yun's attic

Write Tests for Your Data Science Project

I never seem to have got into the habit of writing tests as I code. That’s bad, I know.

But there are so many excuses that prevents it, “oh this is just an exploratory thing”, or “Karen and Chad really needs the report/tool soon, no time for test”, or whatever else that might get in your way. Plus there’s a tendency to just use the million open-source projects out-of-the box, and expect them to do what you think they do.

Sometimes they don’t work they way you had hoped, and you end up digging thought lines and lines of code hoping to find out what went wrong where. And if you’re working with particularly large datasets, just the IO along can slow you down very considerably. Or worse yet, you merrily hand over a “finished product” without even realising the bugs inside – because no tests told you so! I can personally attest to this, especially those deadline-rushed projects.

Well, this time I did it again. I’m about a thousand line deep into my autotrader project, when I realised my features are being scaled in inconsistent ways, that I though of “hmm, maybe there should be a test for this…”

It’s a very simple operation, take all the values of one feature in the training dataset, subtract the mean, and divide by the standard deviation. The default behaviour of sklearn.preprocessing.StandardScaler takes the population standard deviation, meaning it’s normalised by N, the sample size, whereas the default behaviour of pandas.DataFrame.std calculates standard deviation with N-1, which is what I expected.

I went back and created a toy dataset, and wrote my own test for my functions and modules which scale different types of features. And many more tests for other functions and modules. I spent a whole week catching up with writing tests. During the process I also refactored some of my maybe-a-bit-too-long functions, so that they’re easier to test, and also easier to read.

Granted, bugs are ever-elusive, and no amount of testing can ensure you don’t ever make mistakes. But I would sleep better at night. (It’s also a great feeling to watch your tests pass one by one, and the green dots appear one after another. So satisfying.)

Just write some tests, it’s better late than never.

Menu