If you follow this blog, you’ll know that one of the main themes of my research is data provenance – one of the main use cases for it is reproducibility and transparency in science. I’ve been attending and speaking at quite a few events talking about data sharing, reproducibility and making science more transparent. I’ve even published [1, 2] on these topics.
In this context, I’ve been thinking about my own process as a scientist and whether I’m ”eating my own dogfood“. Indeed at the Beyond the PDF 2 conference in March, I stood up at the end and in front of ~200 people said that I would change my work practice – we have enough tools to really change how we do science. I knew I could do better.
So this post is about doing just that. In general, my research work consists of larger infrastructure projects in collaborations and then smaller work developing experimental prototypes and mucking with new algorithms. For the former, the projects use all the standard software development stuff (github, jira, wikis) so this gets documented fairly well.
The bit that’s not as good as it should be is for the smaller scale things. I think with my co-authors and I do an ok job at publishing the code and the data associated with our publications — although this could be improved. (It’s too often on our own websites). The major issue I have is that the methods are probably not as reproducible or transparent as they should be – essentially it’s a bit messy for other people to figure out exactly what I was up to when doing something new. It’s not in one place nor is it clearly documented. It also hurts my process in that a lot of the mucking about I do gets lost or it takes time to find. I see this is as a particular problem as I do more web science research where the gathering cleaning and reanalyzing data is a critical part of the endeavor.
To do this, I’ve decided to adopt IPython Notebooks as my new note taking environment. This solves the problem of allowing me to try different things out and keep track of all the parts of a project together. Additionally, it lets me “narrate my work” – that is mix commentary with my code, which is pretty cool. My notebook is on github and also contains information about how my system is setup including versions of libraries I’m relying on.
There’s still a long way to go to pass Phil’s test for research programming effectiveness (see also Why use make?), but I think this is a right step in my direction.
To honor this step, I’m giving $100 to FORCE11 to spread the word about how we can make scholarship better.