New dataset to study the history of genomics

25 June 2020

The team working on the ERC-funded project Medical Translation in the History of Modern Genomics (TRANSGENE), which comprises various Innogen members, have published a freely available online dataset that enables users to identify overlooked individuals, institutions and connections in the history of genomic science.

The dataset comprises more than 13 million records and has been compiled over more than two years of work. It documents the institutions that submitted yeast, human and pig DNA sequences to the European Nucleotide Archive and other open access databases between 1980 and 2015, indicating for each institution the number of submitted nucleotides and the year of submission. It also lists the PubMed ID, authors and publication year of the articles that describe these sequences for the first time in the scientific literature.

A data note describing the search strategy and cleaning protocol, as well as the design and structure of the dataset, has been published in the open access and open peer review life sciences platform F1000Research. The source code of the software that was used to compile the data can also be downloaded without restrictions.

The TRANSGENE team is now analysing a number of co-authorship network visualisations derived from the data. These analyses are being combined with historical knowledge that the project has drawn from oral histories and archival searches. The results of this mixed methods approach will be published in a history of science journal during 2021 or early 2022.