Specimens of natural history carry a key to undertaking a fundamental challenge of our time: biodiversity conservation. Across the globe, museum collections maintain a treasure trove of more than a hundred million specimens, labels, and archives. Unfortunately, much of the information lies dormant due in part to data inaccessibility or challenges with data usability. Our work focuses on mobilizing and digitizing insect specimen collections to help scientists accelerate the biodiversity research needed to help our planet thrive.

Our Mission

Digitize and database insect specimen collections in an integrated, accurate, and scalable manner.

The Data

400,000 specimen images from the Essig Museum of Entomology at the University of California, Berkeley.

The Product

We developed a modular pipeline that leverages computer vision and NLP to identify and extract specimen labels.


Over hundreds of years, entomologists around the world have meticulously collected and annotated millions of specimens. Given that the collections were documented in non-digital and non-standard ways, estimates suggest that less than 3% of biological specimen data is web-accessible (Ariño 2010). As a result, most of the data that narrates our ecological history and preserves our biodiversity footprint remains largely untapped.

Our digitization effort focuses on one class of biological specimens that account for 80% of life on earth: insects. From pollinators to pest controllers to decomposers, insects play critical roles in our delicately balanced ecosystem. With the rapid pace of environmental change, it is immensely important for entomologists to understand trends in insect populations over time. But with millions of specimen samples in hand, studying and uncovering patterns efficiently requires a step away from manual annotations and a move towards an automated approach.

A central challenge of automated annotation is that specimen collections come in all shapes an sizes. The variance is particularly pronounced when it comes to the method, quality, and style of documentation and photo-taking.

Variation in Insect Specimen Photographs

Challeges of Automated Specimen Annotation

Through our work, we have prioritized the extraction of five attributes within specimen images: specimen ID, scientific name (genus + species), name of the collector, date of specimen collection, and location of the spcimen at the point of collection. Our goal is to develop a pipeline that enables scalable and accurate extraction of insect specimen attributes in order to accelerate the digitization and accessibility of archives.

How it works?

Follow along below to understand our pipeline and get insights on our data-driven decisions throughout the development process.

The Pipeline: At A Glance

Step 1: Web Scraping

We started by creating a scraper to extract the specimen images housed by the Essig Museum of Entomology in Berkeley, CA. The output was siphoned into two categories: databased (n = 50,000 images) and non-databased (n = 400,000 images). The databased images were paired with pre-transcribed attributes while the non-databased images required full transcription.

Step 2: Image Stitching

We explored different platforms to perform optical character recognition (OCR). We found that the Cloud Vision API from Google yielded the most accurate results and handled certain specimen nuances (e.g., tilted text, shadows, cursive handwriting) substantially better than its open-source counterparts (such as Tesseract and OCRopus). However, the robust performance of the Google Cloud Vision API comes at a cost: $1.50 per 1,000 API hits. Thinking about the scalability of our pipeline, the OCR costs for an archive of 5 million images would total $7,500. As a more cost-effective strategy (without sacrificing accuracy), we developed an approach to stitch batches of images together so that they could be analyzed as a single API hit. By stitching 10 images together so that it "reads" as a single image, we were able to reduce cost by 10X.

Step 3: OCR Using Google Cloud Vision API

We ran all stitched images (each containing 10 specimen images) through the Google Cloud Vision API through perform OCR. In addition to the text of each image, the API outputted the coordinates (bounding boxes) of each set of clustered words. To ensure we maintained specimen-level traceability through the stitching process, we included code that retained the file name and file size of each specimen. Both attributes played a critical piece in manipulating the stitched outputs back to their original (un-stitched) forms.

Step 4: Deconstructing the OCR Output

Each stitched image had up to 10 constituent specimen images. Utilizing the file sizes, we were able to identify the bounding boxes of each constituent specimen image (within the stitched files) and map it to its original file name, which contained the specimen ID and scientific name. From there, we parsed the text outputs so that it coincided with the constituent image. Through the process of stitching and deconstructing the specimen images, we were able to mimic the output we would have received running OCR individually on each specimen image… but at a tenth of the cost.

Step 5: Spell Check to Flag Poor Outputs

In certain instances, we came across specimen images that were not easily readable due to factors like text blurring, superimposed labels, and illegible handwriting. As a result, we anticipated the OCR results for certain specimen images would not be decipherable or usable. To strengthen the quality of extracted results, we incorporated an in-process spell check that flags unusable outputs based on pre-set Levenshtein distance thresholds. If any specimen image fails the spell check, the user is notified that the OCR output does not meet the specified robustness standards and requires manual inspection. The spell check step has been introduced as a quality control measure to bring attention to potentially inaccurate or faulty transcriptions.

Step 6: Attribute Extraction

There are 5 attributes we extracted from each specimen image: specimen ID, scientific name (genus + species), name of the collector, date of specimen collection, and location of the specimen at the point of collection. Recognizing that each image file name contained the corresponding specimen ID and scientific name (e.g., EMEC8626 Aphthargelia symphoricarpi.txt or SBMNHENT133 Enallagma carunculatum.txt), we performed a direct lookup and parsed the information accordingly.

Extracting the collector name from each image required more creativity. When we first performed the scraping, we had 50,000 databased images at our disposal. Many of the collector names were frequent so we pulled the collector names from the databased images and performed a global lookup on the 400,000 non-databased images to check for matches. By using the lookup method, we were able to identify the collector name for 56% of our specimen images. Taking it a step further, we observed that the names on the specimen images were often followed by indicator words like “coll.” or “collector”. We then used Regex to characterize a search pattern to pull any names that prefaced the indicator words. In doing so, we were able to identify the collector name for an additional 22% of our specimen images.

For the date of specimen collection, we exercised Regex to a similar degree by flagging 15 different permutations of date formats. Using Regex, we captured the collection date for 51% of our specimen images. For the remaining 49% of the images, we leveraged the Stanford Question Answering Dataset (SQuAD) with BERT. The identical approach was used for location of specimen collection, where we achieved accurate extraction across 67% of specimen images. Based on stakeholder feedback, 60% of the location extractions provided the desired granularity at the landmark, region, and/or county level.

Step 7: TSV Generation

As part of the pipeline, all extracted attributes are directly imported to a .tsv file to enable quick and easy information retrieval across the archive. The .tsv file also reflects the OCR output of each image to maintain traceback to the “raw data” used for the extractions.

See it in action.

Please begin by watching the guided video providing a walkthrough of our pipeline. From there, we encourage you to explore the process through the interactive demo hosted on Google Colab.

Open In Colab

Interested in learning more?

Below you will find the tools used in our work along with supplementary resources. We encourage you to visit the Insection GitHub Repo to explore the pipeline and experiment with different images.

Project Resources

Further Reading

Project Presentation