Review
. 2020 Apr; 69(6):1231-1253.
doi: 10.1093/sysbio/syaa026.

Repositories for Taxonomic Data: Where We Are and What is Missing

Aurélien Miralles 1 Teddy Bruy 1 Katherine Wolcott 2 Mark D Scherz 3 Dominik Begerow 4 Bank Beszteri 5 Michael Bonkowski 6 Janine Felden 7 Birgit Gemeinholzer 8 Frank Glaw 3 Frank Oliver Glöckner 9 Oliver Hawlitschek 3 Ivaylo Kostadinov 10 Tim W Nattkemper 11 Christian Printzen 12 Jasmin Renz 13 Nataliya Rybalka 14 Marc Stadler 15 Tanja Weibulat 10 Thomas Wilke 16 Susanne S Renner 2 Miguel Vences 17 
Affiliations
  • PMID: 32298457
  •     78 References
  •     3 citations

Abstract

Natural history collections are leading successful large-scale projects of specimen digitization (images, metadata, DNA barcodes), thereby transforming taxonomy into a big data science. Yet, little effort has been directed towards safeguarding and subsequently mobilizing the considerable amount of original data generated during the process of naming 15,000-20,000 species every year. From the perspective of alpha-taxonomists, we provide a review of the properties and diversity of taxonomic data, assess their volume and use, and establish criteria for optimizing data repositories. We surveyed 4113 alpha-taxonomic studies in representative journals for 2002, 2010, and 2018, and found an increasing yet comparatively limited use of molecular data in species diagnosis and description. In 2018, of the 2661 papers published in specialized taxonomic journals, molecular data were widely used in mycology (94%), regularly in vertebrates (53%), but rarely in botany (15%) and entomology (10%). Images play an important role in taxonomic research on all taxa, with photographs used in >80% and drawings in 58% of the surveyed papers. The use of omics (high-throughput) approaches or 3D documentation is still rare. Improved archiving strategies for metabarcoding consensus reads, genome and transcriptome assemblies, and chemical and metabolomic data could help to mobilize the wealth of high-throughput data for alpha-taxonomy. Because long-term-ideally perpetual-data storage is of particular importance for taxonomy, energy footprint reduction via less storage-demanding formats is a priority if their information content suffices for the purpose of taxonomic studies. Whereas taxonomic assignments are quasifacts for most biological disciplines, they remain hypotheses pertaining to evolutionary relatedness of individuals for alpha-taxonomy. For this reason, an improved reuse of taxonomic data, including machine-learning-based species identification and delimitation pipelines, requires a cyberspecimen approach-linking data via unique specimen identifiers, and thereby making them findable, accessible, interoperable, and reusable for taxonomic research. This poses both qualitative challenges to adapt the existing infrastructure of data centers to a specimen-centered concept and quantitative challenges to host and connect an estimated $ \le $2 million images produced per year by alpha-taxonomic studies, plus many millions of images from digitization campaigns. Of the 30,000-40,000 taxonomists globally, many are thought to be nonprofessionals, and capturing the data for online storage and reuse therefore requires low-complexity submission workflows and cost-free repository use. Expert taxonomists are the main stakeholders able to identify and formalize the needs of the discipline; their expertise is needed to implement the envisioned virtual collections of cyberspecimens. [Big data; cyberspecimen; new species; omics; repositories; specimen identifier; taxonomy; taxonomic data.].

New species without dead bodies: a case for photo-based descriptions, illustrated by a striking new species of Marleyimyia Hesse (Diptera, Bombyliidae) from South Africa.
Stephen A Marshall, Neal L Evenhuis.
Zookeys, 2015 Oct 22; (525). PMID: 26487819    Free PMC article.
A design framework and exemplar metrics for FAIRness.
Mark D Wilkinson, Susanna-Assunta Sansone, +3 authors, Michel Dumontier.
Sci Data, 2018 Jun 27; 5. PMID: 29944145    Free PMC article.
Haplowebs as a graphical tool for delimiting species: a revival of Doyle's "field for recombination" approach and its application to the coral genus Pocillopora in Clipperton.
Jean-François Flot, Arnaud Couloux, Simon Tillier.
BMC Evol Biol, 2010 Dec 02; 10. PMID: 21118572    Free PMC article.
The use of bioacoustics in anuran taxonomy: theory, terminology, methods and recommendations for best practice.
Jörn Köhler, Martin Jansen, +7 authors, Miguel Vences.
Zootaxa, 2017 Jun 15; 4251(1). PMID: 28609991
Highly Cited.
Species concepts and species delimitation.
Kevin De Queiroz.
Syst Biol, 2007 Nov 21; 56(6). PMID: 18027281
Highly Cited.
Inference of population structure using multilocus genotype data.
J K Pritchard, M Stephens, P Donnelly.
Genetics, 2000 Jun 03; 155(2). PMID: 10835412    Free PMC article.
Highly Cited.
Darwin Core: an evolving community-developed biodiversity data standard.
John Wieczorek, David Bloom, +5 authors, David Vieglais.
PLoS One, 2012 Jan 13; 7(1). PMID: 22238640    Free PMC article.
Highly Cited.
What Difference Does Quantity Make? On the Epistemology of Big Data in Biology.
Sabina Leonelli.
Big Data Soc, 2015 Mar 03; 1(1). PMID: 25729586    Free PMC article.
How to stop data centres from gobbling up the world's electricity.
Nicola Jones.
Nature, 2018 Sep 14; 561(7722). PMID: 30209383
Cybertaxonomy to accomplish big things in aphid systematics.
Colin Favret.
Insect Sci, 2013 Dec 05; 21(3). PMID: 24302684
Review.
Public Data Archiving in Ecology and Evolution: How Well Are We Doing?
Dominique G Roche, Loeske E B Kruuk, Robert Lanfear, Sandra A Binning.
PLoS Biol, 2015 Nov 12; 13(11). PMID: 26556502    Free PMC article.
A spatial statistical model for landscape genetics.
Gilles Guillot, Arnaud Estoup, Frédéric Mortier, Jean François Cosson.
Genetics, 2004 Nov 03; 170(3). PMID: 15520263    Free PMC article.
Highly Cited.
Nomenclatural benchmarking: the roles of digital typification and telemicroscopy.
Quentin Wheeler, Thierry Bourgoin, +6 authors, M Alma Solis.
Zookeys, 2012 Aug 04; (209). PMID: 22859888    Free PMC article.
Mass Spectrometry Data Repository Enhances Novel Metabolite Discoveries with Advances in Computational Metabolomics.
Hiroshi Tsugawa, Aya Satoh, +3 authors, Masanori Arita.
Metabolites, 2019 Jun 27; 9(6). PMID: 31238512    Free PMC article.
Toward a large-scale and deep phenological stage annotation of herbarium specimens: Case studies from temperate, tropical, and equatorial floras.
Titouan Lorieul, Katelin D Pearson, +10 authors, Alexis Joly.
Appl Plant Sci, 2019 Apr 03; 7(3). PMID: 30937225    Free PMC article.
ABGD, Automatic Barcode Gap Discovery for primary species delimitation.
N Puillandre, A Lambert, S Brouillet, G Achaz.
Mol Ecol, 2011 Sep 03; 21(8). PMID: 21883587
Highly Cited.
Names are key to the big new biology.
D J Patterson, J Cooper, +2 authors, D P Remsen.
Trends Ecol Evol, 2010 Oct 22; 25(12). PMID: 20961649
The French Muséum national d'histoire naturelle vascular plant herbarium collection dataset.
Gwenaël Le Bras, Marc Pignal, +15 authors, Thomas Haevermans.
Sci Data, 2017 Feb 15; 4. PMID: 28195585    Free PMC article.
New species in the Old World: Europe as a frontier in biodiversity exploration, a test bed for 21st century taxonomy.
Benoît Fontaine, Kees van Achterberg, +48 authors, Philippe Bouchet.
PLoS One, 2012 Jun 01; 7(5). PMID: 22649502    Free PMC article.
Paradigm shift in species description: the need to move towards a tabular format.
Erko Stackebrandt, David Smith.
Arch Microbiol, 2018 Dec 13; 201(2). PMID: 30539264
Entomological Collections in the Age of Big Data.
Andrew Edward Z Short, Torsten Dikow, Corrie S Moreau.
Annu Rev Entomol, 2017 Oct 24; 63. PMID: 29058981
Review.
Species detection and individual assignment in species delimitation: can integrative data increase efficacy?
Danielle L Edwards, L Lacey Knowles.
Proc Biol Sci, 2014 Jan 10; 281(1777). PMID: 24403337    Free PMC article.
Photography-based taxonomy is inadequate, unnecessary, and potentially harmful for biological sciences.
Luis M P Ceríaco, Eliécer E Gutiérrez, Alain Dubois.
Zootaxa, 2016 Dec 19; 4196(3). PMID: 27988669
Is photography-based taxonomy really inadequate, unnecessary, and potentially harmful for biological sciences? A reply to Ceríaco et al. (2016).
Stephen E Thorpe.
Zootaxa, 2017 Feb 12; 4226(3). PMID: 28187627
Towards a Global Names Architecture: The future of indexing scientific names.
Richard L Pyle.
Zookeys, 2016 Feb 16; (550). PMID: 26877664    Free PMC article.
Assembly, Assessment, and Availability of De novo Generated Eukaryotic Transcriptomes.
Joanna Moreton, Abril Izquierdo, Richard D Emes.
Front Genet, 2016 Jan 23; 6. PMID: 26793234    Free PMC article.
Review.
SHERPA: an image segmentation and outline feature extraction tool for diatoms and other objects.
Michael Kloster, Gerhard Kauer, Bánk Beszteri.
BMC Bioinformatics, 2014 Jun 27; 15. PMID: 24964954    Free PMC article.
A generic workflow for effective sampling of environmental vouchers with UUID assignment and image processing.
Dagmar Triebel, Wolfgang Reichert, +4 authors, Gerhard Rambold.
Database (Oxford), 2018 Apr 25; 2018. PMID: 29688348    Free PMC article.
Photographs and herbarium specimens as tools to document phenological changes in response to global warming.
Abraham J Miller-Rushing, Richard B Primack, Daniel Primack, Sharda Mukunda.
Am J Bot, 2006 Nov 01; 93(11). PMID: 21642112
A new microhylid frog, genus Rhombophryne, from northeastern Madagascar, and a re-description of R. serratopalpebrosa using micro-computed tomography.
Mark D Scherz, Bernhard Ruthensteiner, Miguel Vences, Frank Glaw.
Zootaxa, 2014 Oct 07; 3860(6). PMID: 25283290
The NCBI Taxonomy database.
Scott Federhen.
Nucleic Acids Res, 2011 Dec 06; 40(Database issue). PMID: 22139910    Free PMC article.
Highly Cited.
Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications.
Pelin Yilmaz, Renzo Kottmann, +95 authors, Frank Oliver Glöckner.
Nat Biotechnol, 2011 May 10; 29(5). PMID: 21552244    Free PMC article.
Highly Cited.
Making research data repositories visible: the re3data.org Registry.
Heinz Pampel, Paul Vierkant, +7 authors, Uwe Dierolf.
PLoS One, 2013 Nov 14; 8(11). PMID: 24223762    Free PMC article.
Diffusible iodine-based contrast-enhanced computed tomography (diceCT): an emerging tool for rapid, high-resolution, 3-D imaging of metazoan soft tissues.
Paul M Gignac, Nathan J Kley, +20 authors, Lawrence M Witmer.
J Anat, 2016 Mar 13; 228(6). PMID: 26970556    Free PMC article.
Highly Cited. Review.
The FAIR Guiding Principles for scientific data management and stewardship.
Mark D Wilkinson, Michel Dumontier, +50 authors, Barend Mons.
Sci Data, 2016 Mar 16; 3. PMID: 26978244    Free PMC article.
Highly Cited.
Evaluation of matrix-assisted laser desorption/ionization time of flight mass spectrometry for the identification of ceratopogonid and culicid larvae.
I C Steinmann, V Pflüger, +2 authors, C Kaufmann.
Parasitology, 2012 Nov 23; 140(3). PMID: 23171762
Biodiversity into your hands - A call for a virtual global natural history 'metacollection'.
Michael Balke, Stefan Schmidt, +19 authors, Donald Hobern.
Front Zool, 2013 Sep 21; 10(1). PMID: 24044698    Free PMC article.
Revealing higher than expected diversity of Harpacticoida (Crustacea:Copepoda) in the North Sea using MALDI-TOF MS and molecular barcoding.
S Rossel, P Martínez Arbizu.
Sci Rep, 2019 Jun 27; 9(1). PMID: 31235850    Free PMC article.
InvertNet: a new paradigm for digital access to invertebrate collections.
Chris Dietrich, John Hart, +4 authors, Chris Taylor.
Zookeys, 2012 Aug 04; (209). PMID: 22859886    Free PMC article.
Considerations and consequences of allowing DNA sequence data as types of fungal taxa.
Juan Carlos Zamora, Måns Svensson, +409 authors, Stefan Ekman.
IMA Fungus, 2018 Jul 19; 9(1). PMID: 30018877    Free PMC article.
The integrative future of taxonomy.
José M Padial, Aurélien Miralles, Ignacio De la Riva, Miguel Vences.
Front Zool, 2010 May 27; 7. PMID: 20500846    Free PMC article.
Highly Cited.
Standardizing metadata and taxonomic identification in metabarcoding studies.
Leho Tedersoo, Kelly S Ramirez, +3 authors, Kessy Abarenkov.
Gigascience, 2015 Aug 04; 4. PMID: 26236474    Free PMC article.
Big data: How do your data grow?
Clifford Lynch.
Nature, 2008 Sep 05; 455(7209). PMID: 18769419
Declining rates of species described per taxonomist: slowdown of progress or a side-effect of improved quality in taxonomy?
George Sangster, Jolanda A Luksenburg.
Syst Biol, 2014 Sep 06; 64(1). PMID: 25190593
Species delimitation with ABC and other coalescent-based methods: a test of accuracy with simulations and an empirical example with lizards of the Liolaemus darwinii complex (Squamata: Liolaemidae).
Arley Camargo, Mariana Morando, Luciano J Avila, Jack W Sites.
Evolution, 2012 Sep 06; 66(9). PMID: 22946806
Can we name Earth's species before they go extinct?
Mark J Costello, Robert M May, Nigel E Stork.
Science, 2013 Jan 26; 339(6118). PMID: 23349283
Highly Cited. Review.
Micro-computed tomography: Introducing new dimensions to taxonomy.
Sarah Faulwetter, Aikaterini Vasileiadou, +2 authors, Christos Arvanitidis.
Zookeys, 2013 May 09; (263). PMID: 23653515    Free PMC article.
Scaling laws predict global microbial diversity.
Kenneth J Locey, Jay T Lennon.
Proc Natl Acad Sci U S A, 2016 May 04; 113(21). PMID: 27140646    Free PMC article.
Highly Cited.
Diversity of biologically active secondary metabolites from endophytic and saprotrophic fungi of the ascomycete order Xylariales.
Soleiman E Helaly, Benjarong Thongbai, Marc Stadler.
Nat Prod Rep, 2018 May 19; 35(9). PMID: 29774351
Highly Cited. Review.
Photos belong in the taxonomic Code.
André Rinaldo Senna Garraffoni, André Victor Lucci Freitas.
Science, 2017 Feb 25; 355(6327). PMID: 28232546
Use of globally unique identifiers (GUIDs) to link herbarium specimen records to physical specimens.
Gil Nelson, Patrick Sweeney, Edward Gilbert.
Appl Plant Sci, 2018 May 08; 6(2). PMID: 29732258    Free PMC article.
Review.
The Increasing Disconnection of Primary Biodiversity Data from Specimens: How Does It Happen and How to Handle It?
Julien Troudet, Régine Vignes-Lebbe, Philippe Grandcolas, Frédéric Legendre.
Syst Biol, 2018 Jun 13; 67(6). PMID: 29893962
Sequence-based species delimitation for the DNA taxonomy of undescribed insects.
Joan Pons, Timothy G Barraclough, +6 authors, Alfried P Vogler.
Syst Biol, 2006 Sep 14; 55(4). PMID: 16967577
Highly Cited.
A Return to Linnaeus's Focus on Diagnosis, Not Description: The Use of DNA Characters in the Formal Naming of Species.
Susanne S Renner.
Syst Biol, 2016 May 06; 65(6). PMID: 27146045
Timeless standards for species delimitation.
Dalton S Amorim, Charles Morphy D Santos, +47 authors, David Grimaldi.
Zootaxa, 2016 Jul 11; 4137(1). PMID: 27395746
A near-infrared spectroscopy routine for unambiguous identification of cryptic ant species.
Martin-Carl Kinzner, Herbert C Wagner, +5 authors, Florian M Steiner.
PeerJ, 2016 Jan 07; 3. PMID: 26734510    Free PMC article.
Towards Automated Annotation of Benthic Survey Images: Variability of Human Experts and Operational Modes of Automation.
Oscar Beijbom, Peter J Edmunds, +12 authors, David Kriegman.
PLoS One, 2015 Jul 15; 10(7). PMID: 26154157    Free PMC article.
Contextual cross-referencing of species names for fiddler crabs (genus Uca): an experiment in cyber-taxonomy.
Michael S Rosenberg.
PLoS One, 2014 Jul 09; 9(7). PMID: 25004097    Free PMC article.
DNA barcoding and taxonomy: dark taxa and dark texts.
Roderic D M Page.
Philos Trans R Soc Lond B Biol Sci, 2016 Aug 03; 371(1702). PMID: 27481786    Free PMC article.
Review.
The Extended Specimen Network: A Strategy to Enhance US Biodiversity Collections, Promote Research and Education.
James Lendemer, Barbara Thiers, +16 authors, M Catherine Aime.
Bioscience, 2020 Jan 18; 70(1). PMID: 31949317    Free PMC article.
A general species delimitation method with applications to phylogenetic placements.
Jiajie Zhang, Paschalia Kapli, Pavlos Pavlidis, Alexandros Stamatakis.
Bioinformatics, 2013 Aug 31; 29(22). PMID: 23990417    Free PMC article.
Highly Cited.
A New Dimension in Documenting New Species: High-Detail Imaging for Myriapod Taxonomy and First 3D Cybertype of a New Millipede Species (Diplopoda, Julida, Julidae).
Nesrine Akkari, Henrik Enghoff, Brian D Metscher.
PLoS One, 2015 Aug 27; 10(8). PMID: 26309113    Free PMC article.
Community next steps for making globally unique identifiers work for biocollections data.
Robert P Guralnick, Nico Cellinese, +9 authors, Roderic D M Page.
Zookeys, 2015 Apr 23; (494). PMID: 25901117    Free PMC article.
The population ecology and social behaviour of taxonomists.
Lucas N Joppa, David L Roberts, Stuart L Pimm.
Trends Ecol Evol, 2011 Aug 25; 26(11). PMID: 21862170
A Standardised Vocabulary for Identifying Benthic Biota and Substrata from Underwater Imagery: The CATAMI Classification Scheme.
Franziska Althaus, Nicole Hill, +11 authors, Karen Gowlett-Holmes.
PLoS One, 2015 Oct 29; 10(10). PMID: 26509918    Free PMC article.
Linnaeus in the information age.
H C J Godfray.
Nature, 2007 Mar 16; 446(7133). PMID: 17361160
Take time to smell the frogs: vocal sac glands of reed frogs (Anura: Hyperoliidae) contain species-specific chemical cocktails.
Iris Starnberger, Dennis Poth, +7 authors, Walter Hödl.
Biol J Linn Soc Lond, 2013 Nov 28; 110(4). PMID: 24277973    Free PMC article.
Beyond dead trees: integrating the scientific process in the Biodiversity Data Journal.
Vincent Smith, Teodor Georgiev, +13 authors, Lyubomir Penev.
Biodivers Data J, 2013 Jan 01; (1). PMID: 24723782    Free PMC article.
Let's rise up to unite taxonomy and technology.
Holly M Bik.
PLoS Biol, 2017 Aug 19; 15(8). PMID: 28820884    Free PMC article.
A DNA-based registry for all animal species: the barcode index number (BIN) system.
Sujeevan Ratnasingham, Paul D N Hebert.
PLoS One, 2013 Jul 19; 8(7). PMID: 23861743    Free PMC article.
Highly Cited.
Volatile amphibian pheromones: macrolides from mantellid frogs from Madagascar.
Dennis Poth, Katharina C Wollenberg, Miguel Vences, Stefan Schulz.
Angew Chem Int Ed Engl, 2012 Jan 24; 51(9). PMID: 22266641
Adding more ecology into species delimitation: ecological niche models and phylogeography help define cryptic species in the black salamander (Aneides flavipunctatus).
Leslie J Rissler, Joseph J Apodaca.
Syst Biol, 2007 Dec 11; 56(6). PMID: 18066928
Bayesian species delimitation combining multiple genes and traits in a unified framework.
Claudia Solís-Lemus, L Lacey Knowles, Cécile Ané.
Evolution, 2014 Dec 17; 69(2). PMID: 25495061
How many species are there on Earth and in the ocean?
Camilo Mora, Derek P Tittensor, +2 authors, Boris Worm.
PLoS Biol, 2011 Sep 03; 9(8). PMID: 21886479    Free PMC article.
Highly Cited.
Psychology, not technology, is our biggest challenge to open digital morphology data.
Christy A Hipsley, Emma Sherratt.
Sci Data, 2019 Apr 28; 6(1). PMID: 31028285    Free PMC article.
The use of secondary metabolite profiling in chemotaxonomy of filamentous fungi.
Jens C Frisvad, Birgitte Andersen, Ulf Thrane.
Mycol Res, 2008 Mar 06; 112(Pt 2). PMID: 18319145
Review.
Reconciling molecular phylogeny, morphological divergence and classification of Madagascan narrow-mouthed frogs (Amphibia: Microhylidae).
Mark D Scherz, Miguel Vences, +4 authors, Angelica Crottini.
Mol Phylogenet Evol, 2016 Apr 18; 100. PMID: 27085671
More taxonomists describing significantly fewer species per unit effort may indicate that most species have been discovered.
Mark J Costello, Simon Wilson, Brett Houlding.
Syst Biol, 2013 Apr 12; 62(4). PMID: 23576317
When mycologists describe new species, not all relevant information is provided (clearly enough).
Louisa Durkin, Tobias Jansson, +4 authors, R Henrik Nilsson.
MycoKeys, 2020 Sep 29; 72. PMID: 32982558    Free PMC article.
Fungal biodiversity and conservation mycology in light of new technology, big data, and changing attitudes.
Lotus A Lofgren, Jason E Stajich.
Curr Biol, 2021 Oct 13; 31(19). PMID: 34637742    Free PMC article.
Review.
DNA barcoding of the National Museum of Natural History reptile tissue holdings raises concerns about the use of natural history collections and the responsibilities of scientists in the molecular age.
Daniel G Mulcahy, Roberto Ibáñez, +10 authors, Kevin de Queiroz.
PLoS One, 2022 Mar 05; 17(3). PMID: 35245325    Free PMC article.