February 9, 2010 Data management: the key to open science?
I’ve been thinking a bit more about open science lately, given the outside chance that I’ll be able to attend the upcoming Science Commons symposium in Seattle. It’s a topic that I’ve unfortunately pushed to the back burner a bit while I’ve been getting settled in my post-doc.
Again I’ve been trying to decide what I think is the key issue for developing a culture of sharing with scientific data. At the moment I feel like the main problem is data management. What I mean here is that labs have a hard time keeping track of their data internally, let alone “preparing” it for broader release.
For example, in my lab we are generating a relatively small amount of DATA (easily quantifiable files, like results of instrument runs); on the order of a 1GB/month. Even though this is probably about average for a science lab, it’s surprisingly difficult to keep organized and readily accessible. This is because it’s being produced by several largely independent students on distinct projects. In addition, the tools we have for analyzing this data are clunky, prone to crashes, and using them is an exercise in caveats and “magic numbers”. Combining and parsing data across multiple experiments is a major operation.
I’d like to point out a couple of key points here. Firstly, this is actually a better situation than other labs I’ve been in. At least here there are some common repositories, in the form of a few spreadsheets saved on common-use computers, from which one can find pointers to the raw data files. Secondly, I think this example illuminates the type of ad-hoc system in place for many academic labs. I think there is a desire in many cases to implement a better system, but not really the drive, dedication, and resources that are required to implement one with the tools that are available.
Perhaps we can take a lesson from industry, where data management has financial and legal ramifications. Although my experience in this environment is somewhat limited, I believe that the difference is largely a matter of resources. Industrial labs might have access to a Technical Information Manager on staff and/or use a Laboratory Information Management System (LIMS). Why haven’t either of these taken hold in academia?
One issue is the separation between IT and scientists in many departments. Often the IT department is lightly staffed, and spends a large portion of their time doing desktop support for individual users (cleaning viruses, updating software, etc). When possible, they may be able to implement some larger projects like deploying a server, managing a common datastore, or things of this nature. The key is that almost all of these activities are more or less completely decoupled from the actual science. They are IT issues, and are handled by the IT folks. Meanwhile, the professors (or more often their students) are generating and analyzing data on the infrastructure that IT has provided. Again, this is decoupled from IT. They use the computers, and when the computers break they call IT. The issue here is that there is no guidance on good practices in data management. It’s an area that falls between the cracks, and is often only addressed as an afterthought or following a major computer failure. Individual professors don’t have the resources (or workload) to hire a full time technical information manager to fill this gap, and this isn’t a position that I’ve ever seen at a departmental level in academia.
The other option is to use a software system which can automate the data management. The term for this software “LIMS”, has been tarnished by an abundance of clunky, overpriced, closed-source products developed at fly-by-night software houses. I’m sure not all LIMS producers fall under this umbrella, but an unfortunate number do. So what would a good LIMS look like? I think there are just a few simple criteria:
- It has to be simple & flexible. Getting your data into the LIMS needs to be easier than not doing it. Students are incredibly busy, and will resist anything that involves extra work.
- It has to be open source, to leverage the power of the community. No development team can anticipate the needs of every lab (or even department), so an easily-extensible core with freely available code is the only way to encourage widespread adoption and contribution.
- It has to be trustworthy. The data store has to be rock-solid, and backups need to be bulletproof. This data is the highly valuable output of labs, and no one will touch a system that has a whiff of instability.
I think these can all be accomplished. Many open-source projects have already found acceptance, such as the Open Bioinformatics member projects, PyMol, and many others. One key will be developing a package that can be deployed on existing hardware (i.e. as close to a standard LAMP stack as possible), to ease the burden on the IT people who will need to do the on-site support. A web-based tool will also help with ease of use: if a student can include their data from their own laptop at the coffee shop, it’s a lot more likely to happen then if they need to fight for time on a certain cluttered common-use machine in the lab.
This type of tool would aid in the larger studies that many open science proponents are interested in. How great would it be if you wanted to do a meta-study from the published results of several labs, and all it took to have the data in a consistent format was a simple MySQL statement (or, if the software is coded properly, a couple of button clicks)? What if when you were reviewing a paper for publication you could quickly get all of the source data, again in a format that is immediately accessible and able to be parsed? What if, as a professor, all the data collected by your summer undergraduate from 4 years back was available with a few clicks? It’s possible. It will take a bit of work by a few intelligent people, but the payoff would be worth it many times over.
- 9 comments
- Posted under data management, open science
Permalink #
Neil Smalheiser
said
I agree. My group programmed an extremely lightweight prototype open source lab notebook, WETLAB, available on our site for download, which is available for comments but would need a lot of further development to make it an everyday tool.
Permalink #
GreggT
said
From my perspective there are two different possibilities here. Coming from the top-down, more comprehensive and sophisticated angle, caLIMs (part of caBIG) has some potential to fulfill the need for a common LIMS platform. However I suspect it’s a good deal more complicated to implement and maintain than what you’re looking for.
Coming from the bottom up, you might look at an electronic lab notebook, which doesn’t directly facilitate sharing, but does get data into a consistent, electronic format that is a great first step towards LIMS. There is an interesting, free SaaS ELN available at https://www.lablife.org, which seems to be getting use by quite a few labs.
Permalink #
bill
said
Take a look at labkey.org. It’s open source and supported by a group of professional software engineers. Any kind of data can be managed but it is mainly used for systems biology. They’re based in Seattle at the Fred Hutchinson Cancer Research Center.
Permalink #
jwinget
said
Thanks for pointing out Lablife and Labkey. They are both interesting on the surface; I’ll have to spend some time checking out some of the guts (where possible).
I’d be very interested to see some of these projects migrated to a more canonical OSS development platform, such as Sourceforge or Github.
Permalink #
Tom Caruso
said
I agree with your comments…particularly that academia needs a better laboratory information management infrastructure. I am familiar with this IT space because I saw a need for a laboratory information system for managing animal information in a toxicology laboratory back in 2003, after which I put together a team and tried to get NIH funding to support my efforts to build national standards into such a database system that could be open source and modular for addition of other capabilities (Sheetz and Caruso, 2006).
You point to several of the problems including the concern about data integrity as provided by trustworthy systems, and the division between faculty and IT. Several other issues are also important:
1.) Faculty want to make sure that their information is secure from computer pirates, who might be their colleagues even those in the same department or lab.
2.) Extramural academic funding supports direct costs for hypothesis-driven research or, more recently, “Big Science”, not infrastructure costs, and academic institutions have not shown the motivation to invest in laboratory information management infrastructure considering their space and other equipment shortages, shortfalls of state government funding in the past decade, and growing salary needs of their faculty.
Open source solutions do not solve the need for support, and that will require a commitment from deans, provosts and presidents to provide these resources to insure maximum usage of a laboratory information system.
I applaud the efforts of NCI, driven by a need for better data sharing to improve cancer research, to make major investments in the Cancer Biomedical Informatics Grid (caBIG(r)) program that seeks solutions to many of the technical problems and provides funding for infrastructure development and early adoption investments.
Sheetz, SD and TP Caruso. (2006). Integrating Electronic Health Record Standards into a Laboratory Information Management System. IN Proceedings of the Twelfth Americas Conference on Information Systems, Acapulco, Mexico. August 4-6.
Permalink #
Will FitzHugh
said
A previous comment mentioned the NCI’s caBIG project. The National Cancer Institute is funding a project called caLIMS (which is part of the caBIG effort) that, while just getting started, is meant to be general-use, open-source LIMS. There’s some information on the web here: https://wiki.nci.nih.gov/display/caLIMS2/caLIMS2+Wiki+Home+page
Permalink # More on data management, with reference to some comments by John Wilbanks « Biostumblematic said
[...] “A biophysicist teaches himself how to code” ‹ Data management: the key to open science? [...]
Permalink #
Frank
said
I hear the same comment alot “we are producing data and its ITs problem to help us sort it out”. I think this is a cop-out and down right lazy. If you are carrying our scientific experiments and producing data you have a responsibility to learn how to manage that data, along with your reagents and equipment.
A LIMS system is a heavyweight solution and can be quite a step change from normal paper lab-book keeping. I would suggest trying at least to move to an electronic copy of a labbook first. This is not necessary an all in one ELN system either, its could be in latex,or if you really must Word – or better yet a wiki. As far as storing the data the simplest solution is to stick it in a code versioning system like SVN or git, with a log saying what experiment it came from.
In the first instance I would choose this ahead of trying to build and maintain a database. I have never heard of anyone that purchases a LIMS system to say that they are happy with it – in academia or industry
Permalink #
jwinget
said
Frank, I feel like you’ve missed the main point of what I was trying to say. I’ll take the blame for unclear writing. What I was trying to point out was that it’s not the job of IT to manage the data. Indeed, I’d be really happy if more scientists thought like you did and spent more time on managing their data (in between the time when getting it all in order for publication).
Again, I’ve gotten a couple of comments today bagging on LIMS systems. I’ll repeat what I tried to say in the original post: LIMS as they currently exist are almost always a bad solution. I’ve fallen into the trap of using an acronym that carries serious baggage. What I am talking about is more like a wiki, but a wiki that will do some (automated?; at least simple for the end-user) semantic markup so that the data can be compiled across multiple labs & experiments.
Again, one of the keys for me is that such software run on a more or less standard web application stack, to lessen the load on IT.