As of a year ago, all grant proposals submitted to NSF must be accompanied by a data management plan. Basically, the PIs must explain:
- how sensitive data (for example, data that contains personal information about experimental subjects) will be managed
- how data resulting from the research will be archived and how access to it will be preserved
The requirements seem reasonable and certainly people asking for public funding should have thought about these issues. The problem is that little guidance is given about what constitutes an appropriate plan, or how the quality of the plan will play into the evaluation of the proposal. Since I (happily!) do not work with sensitive data, this post will deal primarily with the second item — preserving public access to data.
The NSF policy on Dissemination and Sharing of Research Results makes it clear that researchers are expected to share the results of their research in three ways. First, research results must be published. Second, peripheral data must be shared “at no more than incremental cost and within a reasonable time.” Third, NSF grantees are permitted to “retain principal legal rights to intellectual property developed under NSF grants to provide incentives for development and dissemination of inventions, software and publications that can enhance their usefulness, accessibility and upkeep.” In other words, it is permissible to create proprietary products based on technologies developed using NSF money.
It is the second item in the previous paragraph that concerns us here, and which seems to be a main focus of the data management plan. The problems that PIs need to solve when writing this plan include determining:
- How much data will need to be archived? A few megabytes of code or a few petabytes of imaging data?
- Over what time period must data be stored? The duration of the grant or 75 years?
- What kind of failures must be tolerated? The failure of a single disk or tape? A major, region-wide disaster (like the one Salt Lake City is certain to get, eventually)?
- In the absence of disasters, how many bit flips in archived data are permissible per year?
Of course the cost of implementing such a plan might vary over many orders of magnitude depending on the answers. Furthermore, there is the difficult question of how to pay for archiving services after the grant expires. I’m personally a bit cynical about the ability of an educational organization to make a funding commitment that will weather unforeseeable financial problems and changes in administration. Also, I’m guessing that data management plans along the lines of “we’ll mail a couple of TB disks to our buddies across the country” aren’t going to fly anymore.
For large-scale data, a serious implementation of the NSF’s requirements seems to be beyond the capabilities of most individual PIs and probably also out of reach of smallish organizations like academic departments. Rather, it will make sense to pool resources at least across an entire institution and probably across several of them. Here’s an example where this is happening for political and social research. Here’s some guidance provided by my institution, which includes the Uspace repository. I don’t believe they are (yet) equipped to handle very large data sets, and I’d guess that similar repositories at other institutions are in the same boat.
This post was motivated by the fact that researchers like me, who haven’t ever had to worry about data management — because we don’t produce much data — now need to care about it, at least to the level of producing a plausible plan. This has lead to a fair amount of confusion among people I’ve talked to. I’d be interested to hear what’s going on at other US CS departments to address this.
6 responses to “NSF Data Management Plans”
Outsource it?
http://code.google.com/apis/storage/
http://aws.amazon.com/s3/
http://explore.live.com/skydrive
For all I know, an academic pricing plan might even be a tax write off for the hosting provider.
Climate researchers have to deal with this problem thanks to adversarial global warming deniers. They continually demand access to the data (mainly as a time-wasting exercise of harassment) but they’re ironing out lots of problems to real scientists’ benefit 🙂
Besides storage media that degrade and/or become obsolete, the data formats themselves can become obsolete. And does one store raw data, or the ‘massaged’ data that goes into models (or wherever it goes)? Researchers tend to write throwaway, ad hoc scripts to convert raw data, and these methods also need to be stored somehow for reproducibility.
The LHC has some interesting projects for sharing the vast reams of data it creates; maybe something useful will end up out of it (vis the www that came out of CERN). I think the best idea is to store data in a peer-to-peer distributed way.
What are others doing? Listing the type of data that will be produced, and how it will be shared/released to the public. Promising to make the data publicly available for 3 years after the end of the grant, as NSF apparently requires. (I personally think the NSF requirement is pretty sketchy; they have a requirement to share data, after the point when they have stopped giving us money for it? But whatever, I’ll make the apparently-required promises and move on with my life.)
The U’s Cyberinfrastructure Council is forming a data-oriented subcommittee – would you like to participate?
In the meantime, here is a tool developed by a few unis – haven’t delved into it but want to throw it out for you to look at:
https://dmp.cdlib.org/about/dmp_about.
Also –
DataOne (NSF OCI):
http://www.dataone.org/
EarthCube (NSF Geosciences):
http://earthcube.ning.com/
http://www.nsf.gov/pubs/2012/nsf12024/nsf12024.jsp?org=NSF
Contact me at cvb@utah.edu, 801-585-3918
The NSF could help. Part of a grant proposal could be a request for a certain amount of archival space to hold all of the results. Publications that come out of the grant could include the appropriate locators that anyone reading the paper can go find the data in the NSF archive.
Thanks all, this is useful.
bcs, the one data management plan that I have written basically said “code is the only important output and we’ll just open source it and stick it in GitHub.” So that is sort of a variant of your idea — hopefully it’ll fly.
Magnus us CS people are pretty lucky to not have huge amounts of data or ideologically motivated attacks on our work, I guess!
Lurker, I’m not sure what others are doing but I expect that soon enough people will build up little libraries of text and strategies to deal with different kinds of data.
Thanks for the links Cassandra, I’ll contact you offline.