Skip to content


A biophysicist teaches himself how to code

Category Archives: data management

Although he blogs almost as rarely as I do, John Wilbanks (VP of Science Commons) tends to inspire me with many of the things he writes.

Back at the end of 2009, he had a few posts on why the Open Source metaphor doesn’t work well when talking about science. While he’s speaking in this case more generally about science as a whole, his comments reflect directly on my post from yesterday on data management. I wanted to summarize a few of his key points and my thoughts on them.

Before I do so, however, I’ll put in another plug for the Science Commons Symposium, taking place on February 20th in Seattle. John Wilbanks will be there, along with a host of other strong voices interested in knowledge sharing. It should be a great event. If you can’t make it in person, it will be streamed live at

If you’re interested in reading his posts in their entirety, you can find them in parts 1, 2, & 3. In order to stick to a more continuous story, I’ll just be pulling quotes at random out of all three of John’s posts.

Several of the comments here yesterday pointed out some specific LIMS projects that have been started. I can see why (given how tightly I focus on a LIMS at the end of my post) people would latch onto this idea, but what I really had in mind was something more like the following:

We need the biological equivalent of the C compiler, of Emacs […] These tools need to be democratized to bring the beginning of distributed knowledge creation into labs, with the efficiencies we know from eBay and Amazon

Because of the complex and variable nature of “DATA” being generated in science labs, I think making one LIMS to rule them all would be nearly impossible. What I’d rather see are some tools that are accessible to the average bench scientist which can be easily modified and expanded upon by the technically gifted scientist. These tools would (if they are to be truly useful) automate some annotation/tagging/parsing of the data as a precursor to deposition in shared repositories such as:

[OpenWetWare and the Registry of Standard Biological Parts] are resources and toolchains that absolutely support distribution of capability and increase capacity, which are fundamental to early-stage distributed innovation.

Above the meat-space layer where the science is actually being done and data is being collected, we need decentralized places to store and share the “functional information units” – i.e. the data that other scientists can use. Unfortunately:

science is like writing code in the 1950s – if you didn’t work at a research institution then, you probably couldn’t write code, and if you did, you were stuck with punch cards. Science is in the punch cards stage, and punch cards aren’t so easy to turn into GNU/Linux.

I think John stretches the metaphor a bit here, but I see where he is going. The punch card above has more to do with the controlling influence of the institution than it has to do with the day-to-day practice of science. The key point is that there are interests who will put up a resistance to a more free distribution of scientific knowledge, for a variety of reasons.

He goes on to summarize his argument:

I propose that the point of this isn’t to replicate “open source” as we know it in software. The point is to create the essential foundations for distributed science so that it can emerge in a form that is locally relevant and globally impactful


it’s not something that’s enabled by an open source license, a code version repository, and other hallmarks of open source software. It’s users saying, “screw this, I can do better” – and doing it. It’s users who know the problem best and design the best solutions.

I couldn’t agree more, and I think this is what we’re seeing from the blog posts and conversations that are taking place. There are a subset of people who are doing science or who are avidly interested in aiding the practice of science who feel like they can do better than the current system. These people (probably most people reading this blog, especially if you’ve gotten this far) are the ones who have to effect change. It will take more than writing and talking about it, although these are important as well. I’d like to also see a nascent, community-driven project which we can point to and say “it will be like this, but better”.

One final word from John:

Data and databases are another place where the underlying property regimes don’t work as well for open source as in software. But that’s difficult enough to merit its own post. Suffice to say if Open Data had a facebook page, its relationship status with the law would be “It’s Complicated.”


I’ve been thinking a bit more about open science lately, given the outside chance that I’ll be able to attend the upcoming Science Commons symposium in Seattle. It’s a topic that I’ve unfortunately pushed to the back burner a bit while I’ve been getting settled in my post-doc.

Again I’ve been trying to decide what I think is the key issue for developing a culture of sharing with scientific data. At the moment I feel like the main problem is data management. What I mean here is that labs have a hard time keeping track of their data internally, let alone “preparing” it for broader release.

For example, in my lab we are generating a relatively small amount of DATA (easily quantifiable files, like results of instrument runs); on the order of a 1GB/month. Even though this is probably about average for a science lab, it’s surprisingly difficult to keep organized and readily accessible. This is because it’s being produced by several largely independent students on distinct projects. In addition, the tools we have for analyzing this data are clunky, prone to crashes, and using them is an exercise in caveats and “magic numbers”. Combining and parsing data across multiple experiments is a major operation.

I’d like to point out a couple of key points here. Firstly, this is actually a better situation than other labs I’ve been in. At least here there are some common repositories, in the form of a few spreadsheets saved on common-use computers, from which one can find pointers to the raw data files. Secondly, I think this example illuminates the type of ad-hoc system in place for many academic labs. I think there is a desire in many cases to implement a better system, but not really the drive, dedication, and resources that are required to implement one with the tools that are available.

Perhaps we can take a lesson from industry, where data management has financial and legal ramifications. Although my experience in this environment is somewhat limited, I believe that the difference is largely a matter of resources. Industrial labs might have access to a Technical Information Manager on staff and/or use a Laboratory Information Management System (LIMS). Why haven’t either of these taken hold in academia?

One issue is the separation between IT and scientists in many departments. Often the IT department is lightly staffed, and spends a large portion of their time doing desktop support for individual users (cleaning viruses, updating software, etc). When possible, they may be able to implement some larger projects like deploying a server, managing a common datastore, or things of this nature. The key is that almost all of these activities are more or less completely decoupled from the actual science. They are IT issues, and are handled by the IT folks. Meanwhile, the professors (or more often their students) are generating and analyzing data on the infrastructure that IT has provided. Again, this is decoupled from IT. They use the computers, and when the computers break they call IT. The issue here is that there is no guidance on good practices in data management. It’s an area that falls between the cracks, and is often only addressed as an afterthought or following a major computer failure. Individual professors don’t have the resources (or workload) to hire a full time technical information manager to fill this gap, and this isn’t a position that I’ve ever seen at a departmental level in academia.

The other option is to use a software system which can automate the data management. The term for this software “LIMS”, has been tarnished by an abundance of clunky, overpriced, closed-source products developed at fly-by-night software houses. I’m sure not all LIMS producers fall under this umbrella, but an unfortunate number do. So what would a good LIMS look like? I think there are just a few simple criteria:

  • It has to be simple & flexible. Getting your data into the LIMS needs to be easier than not doing it. Students are incredibly busy, and will resist anything that involves extra work.
  • It has to be open source, to leverage the power of the community. No development team can anticipate the needs of every lab (or even department), so an easily-extensible core with freely available code is the only way to encourage widespread adoption and contribution.
  • It has to be trustworthy. The data store has to be rock-solid, and backups need to be bulletproof. This data is the highly valuable output of labs, and no one will touch a system that has a whiff of instability.

I think these can all be accomplished. Many open-source projects have already found acceptance, such as the Open Bioinformatics member projects, PyMol, and many others. One key will be developing a package that can be deployed on existing hardware (i.e. as close to a standard LAMP stack as possible), to ease the burden on the IT people who will need to do the on-site support. A web-based tool will also help with ease of use: if a student can include their data from their own laptop at the coffee shop, it’s a lot more likely to happen then if they need to fight for time on a certain cluttered common-use machine in the lab.

This type of tool would aid in the larger studies that many open science proponents are interested in. How great would it be if you wanted to do a meta-study from the published results of several labs, and all it took to have the data in a consistent format was a simple MySQL statement (or, if the software is coded properly, a couple of button clicks)? What if when you were reviewing a paper for publication you could quickly get all of the source data, again in a format that is immediately accessible and able to be parsed? What if, as a professor, all the data collected by your summer undergraduate from 4 years back was available with a few clicks? It’s possible. It will take a bit of work by a few intelligent people, but the payoff would be worth it many times over.