Create your own conference schedule! Click here for full instructions

Abstract Detail


Best, Jason H. [1], Neill, Amanda [2], Moen, William E. [3].

A framework and workflow for extraction and parsing of herbarium specimen data.

Millions of specimens in museums and herbaria worldwide need to be digitized to be accessible to scientists. A key challenge faced by all biodiversity collections is determining a transformation process that yields high-quality results in a cost- and time-efficient manner. The University of North Texas’s Texas Center for Digital Knowledge (TxCDK) and the Botanical Research Institute of Texas (BRIT) are developing a web-based application workflow for combining human and machine processes to facilitate the transformation of herbarium label data into machine-processable parsed data. The workflow and framework, called the Apiary Project (, are made possible through integration of a variety of existing technologies and the application of standards developed by the Taxonomic Databases Working Group and the Dublin Core Metadata Initiative. The workflow interfaces will allow the human participants to inspect and analyze the digital herbarium sheet images and extract textual components with the assistance of software technologies such as Optical Character Recognition (OCR) then parse this text into standardized metadata elements. The workflow will provide a final quality control where specimen records are evaluated for accuracy and completeness.

Broader Impacts:

Log in to add this item to your schedule

Related Links:
Apiary Project Site

1 - Botanical Research Institute of Texas, 500 E. 4th Street, Fort Worth, TX, 76102, USA
2 - Botanical Research Institute of Texas, 500 East 4th Street, Fort Worth, TX, 76102, USA
3 - University of North Texas, College of Information, 3940 North Elm Street, Denton, TX, 76207, USA

text analysis
optical character recognition
text parsing.

Presentation Type: Poster:Posters for Topics
Session: P
Location: Hall A/Convention Center
Date: Monday, August 2nd, 2010
Time: 5:30 PM
Number: PBG006
Abstract ID:644

Copyright © 2000-2010, Botanical Society of America. All rights