20090709

Biologists: Vectorize Thyselves

I'm currently investigating strains recorded from equine hooves. In this case, I wasn't directly involved with the experiment, but I'm helping out with the analysis. The analysis itself fits a common pattern: lots of trials of a similar type that all need to be processed in the same way.

Biologists often tackle this kind of work using spreadsheet programs. The advantages of a spreadsheet include the fact that your analysis is quite accessible, and it's computed in real-time. The main disadvantage (in this case, but also in many others I've seen) is that there is a lot of book-keeping required to synchronize the analysis that is performed on each dataset. The book-keeping introduces plenty of opportunity for error.

My solution to this problem is to take a "vectorized" approach. A good solution is a single program that can process each dataset in turn, and perform the analysis automatically. Thus, the amount of work required for the analysis is de-coupled from the sample size of the experiment. There is no book-keeping work to keep the analysis from each spreadsheet in-sync. It sounds simple enough, but few biologists either understand or implement this idea. The main (only?) drawback of this approach is its somewhat greater complexity; you need a real program to perform the analysis, rather than using a spreadsheet. That's really not much of a drawback though, once your experiment is sufficiently large.

Thanks to this work, I've also had the pleasure of using the excellent Python plotting library matplotlib. Once you get used to the basics, this library makes plotting using a spreadsheet feel like a terrible kludge.