A benchmark dataset of herbarium specimen images with label data

Mathias Dillen, Quentin Groom, Simon Chagnoux, Anton Güntsch, Alex Hardisty, Elspeth Haston, Laurence Livermore, Veljo Runnel, Leif Schulman, Luc Willemse, Zhengzhe Wu, Sarah Phillips

Research output: Contribution to journalArticleScientificpeer-review

Abstract

Background

More and more herbaria are digitising their collections. Images of specimens are made available online to facilitate access to them and allow extraction of information from them. Transcription of the data written on specimens is critical for general discoverability and enables incorporation into large aggregated research datasets. Different methods, such as crowdsourcing and artificial intelligence, are being developed to optimise transcription, but herbarium specimens pose difficulties in data extraction for many reasons.

New information

To provide developers of transcription methods with a means of optimisation, we have compiled a benchmark dataset of 1,800 herbarium specimen images with corresponding transcribed data. These images originate from nine different collections and include specimens that reflect the multiple potential obstacles that transcription methods may encounter, such as differences in language, text format (printed or handwritten), specimen age and nomenclatural type status. We are making these specimens available with a Creative Commons Zero licence waiver and with permanent online storage of the data. By doing this, we are minimising the obstacles to the use of these images for transcription training. This benchmark dataset of images may also be used where a defined and documented set of herbarium specimens is needed, such as for the extraction of morphological traits, handwriting recognition and colour analysis of specimens.

Original languageEnglish
Article number31817
JournalBiodiversity Data Journal
Volume7
Number of pages15
ISSN1314-2828
DOIs
Publication statusPublished - 8 Feb 2019
MoE publication typeA1 Journal article-refereed

Fields of Science

  • 119 Other natural sciences

Cite this

Dillen, M., Groom, Q., Chagnoux, S., Güntsch, A., Hardisty, A., Haston, E., ... Phillips, S. (2019). A benchmark dataset of herbarium specimen images with label data. Biodiversity Data Journal, 7, [31817]. https://doi.org/10.3897/BDJ.7.e31817
Dillen, Mathias ; Groom, Quentin ; Chagnoux, Simon ; Güntsch, Anton ; Hardisty, Alex ; Haston, Elspeth ; Livermore, Laurence ; Runnel, Veljo ; Schulman, Leif ; Willemse, Luc ; Wu, Zhengzhe ; Phillips, Sarah. / A benchmark dataset of herbarium specimen images with label data. In: Biodiversity Data Journal. 2019 ; Vol. 7.
@article{cdc7e7db288144e09e2f9d6c871ff65f,
title = "A benchmark dataset of herbarium specimen images with label data",
abstract = "BackgroundMore and more herbaria are digitising their collections. Images of specimens are made available online to facilitate access to them and allow extraction of information from them. Transcription of the data written on specimens is critical for general discoverability and enables incorporation into large aggregated research datasets. Different methods, such as crowdsourcing and artificial intelligence, are being developed to optimise transcription, but herbarium specimens pose difficulties in data extraction for many reasons.New informationTo provide developers of transcription methods with a means of optimisation, we have compiled a benchmark dataset of 1,800 herbarium specimen images with corresponding transcribed data. These images originate from nine different collections and include specimens that reflect the multiple potential obstacles that transcription methods may encounter, such as differences in language, text format (printed or handwritten), specimen age and nomenclatural type status. We are making these specimens available with a Creative Commons Zero licence waiver and with permanent online storage of the data. By doing this, we are minimising the obstacles to the use of these images for transcription training. This benchmark dataset of images may also be used where a defined and documented set of herbarium specimens is needed, such as for the extraction of morphological traits, handwriting recognition and colour analysis of specimens.",
keywords = "119 Other natural sciences",
author = "Mathias Dillen and Quentin Groom and Simon Chagnoux and Anton G{\"u}ntsch and Alex Hardisty and Elspeth Haston and Laurence Livermore and Veljo Runnel and Leif Schulman and Luc Willemse and Zhengzhe Wu and Sarah Phillips",
year = "2019",
month = "2",
day = "8",
doi = "10.3897/BDJ.7.e31817",
language = "English",
volume = "7",
journal = "Biodiversity Data Journal",
issn = "1314-2828",
publisher = "Pensoft Publishers",

}

Dillen, M, Groom, Q, Chagnoux, S, Güntsch, A, Hardisty, A, Haston, E, Livermore, L, Runnel, V, Schulman, L, Willemse, L, Wu, Z & Phillips, S 2019, 'A benchmark dataset of herbarium specimen images with label data' Biodiversity Data Journal, vol. 7, 31817. https://doi.org/10.3897/BDJ.7.e31817

A benchmark dataset of herbarium specimen images with label data. / Dillen, Mathias; Groom, Quentin; Chagnoux, Simon; Güntsch, Anton ; Hardisty, Alex; Haston, Elspeth; Livermore, Laurence; Runnel, Veljo; Schulman, Leif; Willemse, Luc; Wu, Zhengzhe; Phillips, Sarah.

In: Biodiversity Data Journal, Vol. 7, 31817, 08.02.2019.

Research output: Contribution to journalArticleScientificpeer-review

TY - JOUR

T1 - A benchmark dataset of herbarium specimen images with label data

AU - Dillen, Mathias

AU - Groom, Quentin

AU - Chagnoux, Simon

AU - Güntsch, Anton

AU - Hardisty, Alex

AU - Haston, Elspeth

AU - Livermore, Laurence

AU - Runnel, Veljo

AU - Schulman, Leif

AU - Willemse, Luc

AU - Wu, Zhengzhe

AU - Phillips, Sarah

PY - 2019/2/8

Y1 - 2019/2/8

N2 - BackgroundMore and more herbaria are digitising their collections. Images of specimens are made available online to facilitate access to them and allow extraction of information from them. Transcription of the data written on specimens is critical for general discoverability and enables incorporation into large aggregated research datasets. Different methods, such as crowdsourcing and artificial intelligence, are being developed to optimise transcription, but herbarium specimens pose difficulties in data extraction for many reasons.New informationTo provide developers of transcription methods with a means of optimisation, we have compiled a benchmark dataset of 1,800 herbarium specimen images with corresponding transcribed data. These images originate from nine different collections and include specimens that reflect the multiple potential obstacles that transcription methods may encounter, such as differences in language, text format (printed or handwritten), specimen age and nomenclatural type status. We are making these specimens available with a Creative Commons Zero licence waiver and with permanent online storage of the data. By doing this, we are minimising the obstacles to the use of these images for transcription training. This benchmark dataset of images may also be used where a defined and documented set of herbarium specimens is needed, such as for the extraction of morphological traits, handwriting recognition and colour analysis of specimens.

AB - BackgroundMore and more herbaria are digitising their collections. Images of specimens are made available online to facilitate access to them and allow extraction of information from them. Transcription of the data written on specimens is critical for general discoverability and enables incorporation into large aggregated research datasets. Different methods, such as crowdsourcing and artificial intelligence, are being developed to optimise transcription, but herbarium specimens pose difficulties in data extraction for many reasons.New informationTo provide developers of transcription methods with a means of optimisation, we have compiled a benchmark dataset of 1,800 herbarium specimen images with corresponding transcribed data. These images originate from nine different collections and include specimens that reflect the multiple potential obstacles that transcription methods may encounter, such as differences in language, text format (printed or handwritten), specimen age and nomenclatural type status. We are making these specimens available with a Creative Commons Zero licence waiver and with permanent online storage of the data. By doing this, we are minimising the obstacles to the use of these images for transcription training. This benchmark dataset of images may also be used where a defined and documented set of herbarium specimens is needed, such as for the extraction of morphological traits, handwriting recognition and colour analysis of specimens.

KW - 119 Other natural sciences

U2 - 10.3897/BDJ.7.e31817

DO - 10.3897/BDJ.7.e31817

M3 - Article

VL - 7

JO - Biodiversity Data Journal

JF - Biodiversity Data Journal

SN - 1314-2828

M1 - 31817

ER -

Dillen M, Groom Q, Chagnoux S, Güntsch A, Hardisty A, Haston E et al. A benchmark dataset of herbarium specimen images with label data. Biodiversity Data Journal. 2019 Feb 8;7. 31817. https://doi.org/10.3897/BDJ.7.e31817