Lessons Learned from RC2020

yun - 2 February, 2021

I took part in the recently-finished ML Reproducibility Challenge 2020, and together with a great team we attempted to reproduce one of the papers as part of a course assignment.

This is just my attempt to get back to blogging, as well as writing down some lessons learned. It’s by no means a criticism of anything, but lately I’ve realised if I don’t write things down I’ll most likely forget all about it in two or three months.

Short story is, we couldn’t completely reproduce the original paper results. From my limited sample size of talking to and listening to presentations of my fellow students participating in this challenge, it seems like it’s quite common to not be able to reproduce a paper’s results completely, not in four weeks at least. That’s not as optimistic as I had thought.

In a way my group already had a head start, because the paper we tried to reproduce already came with some pretty extensive code, nuts and bolts included. But after scrolling through pages after pages, we realised this code is adapted from another paper, so there are a lot of dangling variables and unfinished experiments hidden inside, and at times this could be really confusing.

Being the overachieving students that we are, in the very beginning of our reproducibility study we were already thinking about the million different ways we could test this model or examine the authors’ claims under more general settings. But training these babies is slow, so it’s quite easy to get discouraged by the progress. Then I felt somewhat swamped in discussions that were mostly speculation rather than factually grounded. That coupled with (almost) midnight zoom meetings was definitely not helpful for my sleep quality.

Overall the experience was a very rewarding one, and I want to try to summarise what I observed/learned:

Document your code.
Remove the bits in your code that you don’t actually use.
Test your code. Don’t write one thing and your code is actually doing another (wildly different thing).
At my previous job they made sure we read this book: Clean Code, or at least get the gist of it. I think that’s a pretty good idea, even if you’re “just a data guy/gal/person”.
Optimise your code so it doesn’t take an insane amount of time (1.5 days! on a NVIDIA 1080 Ti) to train a not-grossly-complex model.
Related to the previous point, make sure you’re not fine-tuning parameters that you don’t actually want to change, such as embeddings, unless that’s your intention.
If the dataset only has a train/test split, and you would also want a dev split, it’s better to shave it off from the test part rather than train part, otherwise you might overfit your model because your train set and dev set is too similar/correlated.
Train a model and evaluate multiple times with different seeds to make sure you’re not just getting lucky with your results.
Report your hyper-parameters.
Report your complete results (in an appendix or supplementary materials), don’t cherry-pick or leave the impression of cherry-picking your results just to support your claim.
It’s probably not a good idea to use the same metric to evaluate different datasets/tasks. An accuracy improvement from 65% to 67% is not as awe-inspiring as an improvement from 95% to 97%.
Don’t speculate before you get your results from the experiments (talking to myself right now :D). It’s easier to come up with one plan after you have solid results than to have ten contingency plans before seeing any actual numbers.

I actually think the scientific community would be much happier and we’d be making much better progress as a whole if we all publish “failed” experiments rather than just those shiny ones. Knowing what will not work and what not to do is, in my opinion, much more helpful for gaining new insights and coming up with new ideas to experiment with.

Maybe we need to start a negative-arXiv to host the dumb, the unglorious and the unprestigious.