Sign petition: A Universal Scientific Data Coordination Platform

#Science & Technology

Target:: Eric Lander, Scientific Advisor to the President
Region:: United States of America

News stories of proposed massive funding increases for the sciences have recently broken. Social media is full of ideas and desperate requests for various projects; favorite moonshots, massive increase of the number of R01, blanket funding increases across the board. But - at least for biological sciences - I have a slightly different proposal. Even with the massive successes, advances, discoveries and treatments in medicine and biology, the entire system of science is mired in 16th Century methods of the dissemination of scientific results.

We are simply not doing a very good job of capturing and collating the results of all (or really any) of the experiments performed under the auspices of NIH research. The issue, in my opinion, is that scientists are rewarded for scientific publications, not the work they have done or what they have learned. Experiments are performed, models are built, software is developed, data are analysed. Then the papers are written, reviewed, revised, snapshot image figures are made. Results are reported in electronic spreadsheets (1980s technology), supplemental files, addenda and appendices. In some specific cases (genomics, for example) larger data files are deposited in public repositories run by NLM or other scientific institutes. It is extremely rare for analysis software that is created to be reviewed or checked for functionality, and essentially zero quality control or structured (not free form text!) metadata is ever captured. This -- in the long run - is an inefficient use of taxpayer funds.

Large genomics projects often have specific “Data Coordination Centers” which have demonstrated (at a prototype scale) how detailed metadata about experiments and analyses, coupled to public ontologies can be used to make something like a machine-readable database of a large body of work. However, these efforts are hampered by a “translations” or data wrangling steps, where work that is done in the lab must be transformed, compressed, and synthesized semi-automatically into a centralized database. In order for this to really be transformative, data coordination needs to be extended to capture experimental proposals, protocols, and preliminary results at the source.

There is a better way forward for science. It is not an easy way. It is not inexpensive. We estimate that it would take 5-10% of the TOTAL NIH research budget to actually bring this to fruition across all federally funded labs in the biological arena.

This work needs to be extended throughout all of NIH funded research. Not just genomics experiments, not just “machine-learning ready” computable data files, but all experimental protocols need to be registered in a machine-readable format. All proposed experiments and results should be wrangled, catalogued, annotated, and stored - successful or not. Analysis software would be reviewed and tested and stored in public open source repositories. Preliminary findings and interesting data could be shared instantaneously with collaborators across the globe.

This will never happen without top down organization. It would require equipment on the ground, computational clouds, full time data scientists and professionals, thousands - maybe millions of lines of code, and training. In particular, scientists will have to be re-trained to be more collaborative, more open, and more rigorous in detailing exactly how to reproduce their work.

We the undersigned feel that a project of this magnitude, while difficult and paradigm shattering would immeasurably advance science in the US.

This is an infrastructure project.

Sign the petition
Sponsor

Hide my comment

Hide my name from public

Keep me informed

GoPetition respects your privacy.

Sign Petition Signatures Map Comments Link 1402 Views

Petition Tags

science infrastructure metadata NIH budget a new way forward