Dirty Data

January 30th, 2008 by Potato

Well, I had a disappointing day: I found out that a large SPSS file I was working on was somehow corrupted or had a data entry error all along. Basically, in a file with something like 8 rows and 500 columns full of data, about 200 rows down the data got all mixed up across the columns. I don’t know yet if it’s gone wrong in a predictable way that can be undone, but even if it has I’m still going to have to manually comb through all of that to make sure it’s right now before I can finish analyzing it and getting a paper out.

It’s a real bitch to have that sort of thing happen, especially after massaging a data set for a while to try to plumb the depths of its secrets. It’s a lot of wasted work. Drawing a simple scatter plot for myself might have been enough to catch the error earlier. Of course, if it wasn’t for the fact that one data point went way above any value that column should have had (a 91 in a column that should only have numbers 1-60 in it), I probably would never have noticed the scrambling in the first place. If the data was a little less screwed up, it might very well have gone on to get published in that state. Then we’d be stuck trying to defend that dirty data (assuming anyone read the paper). It makes me a little afraid working in science knowing how many little unconscious mistakes can potentially go into an experiment to really change things around. The (external) replication step seems even more important now.

Comments are closed.