Follow along below to understand our pipeline and get insights on our data-driven decisions throughout the development process.
The Pipeline: At A Glance
Step 1: Web Scraping
We started by creating a scraper to extract the specimen images housed by the Essig Museum of Entomology
in Berkeley, CA. The output was siphoned into two categories: databased (n = 50,000 images
) and non-databased (n = 400,000 images
). The databased images were paired with pre-transcribed attributes while the non-databased images required full transcription.
Step 2: Image Stitching
We explored different platforms to perform optical character recognition (OCR). We found that the Cloud Vision API from Google yielded the most accurate results and handled certain specimen nuances (e.g., tilted text, shadows, cursive handwriting) substantially better than its open-source counterparts (such as Tesseract and OCRopus). However, the robust performance of the Google Cloud Vision API comes at a cost
: $1.50 per 1,000 API hits. Thinking about the scalability of our pipeline, the OCR costs for an archive of 5 million images would total $7,500. As a more cost-effective strategy (without sacrificing accuracy), we developed an approach to stitch batches of images together so that they could be analyzed as a single API hit. By stitching 10 images together so that it "reads" as a single image, we were able to reduce cost by 10X.
Step 3: OCR Using Google Cloud Vision API
We ran all stitched images (each containing 10 specimen images) through the Google Cloud Vision API through perform OCR. In addition to the text of each image, the API outputted the coordinates (bounding boxes) of each set of clustered words. To ensure we maintained specimen-level traceability through the stitching process, we included code that retained the file name and file size of each specimen. Both attributes played a critical piece in manipulating the stitched outputs back to their original (un-stitched) forms.
Step 4: Deconstructing the OCR Output
Each stitched image had up to 10 constituent specimen images. Utilizing the file sizes, we were able to identify the bounding boxes of each constituent specimen image (within the stitched files) and map it to its original file name, which contained the specimen ID and scientific name. From there, we parsed the text outputs so that it coincided with the constituent image. Through the process of stitching and deconstructing the specimen images, we were able to mimic the output we would have received running OCR individually on each specimen image… but at a tenth of the cost.
Step 5: Spell Check to Flag Poor Outputs
In certain instances, we came across specimen images that were not easily readable due to factors like text blurring, superimposed labels, and illegible handwriting. As a result, we anticipated the OCR results for certain specimen images would not be decipherable or usable. To strengthen the quality of extracted results, we incorporated an in-process spell check that flags unusable outputs based on pre-set Levenshtein distance thresholds. If any specimen image fails the spell check, the user is notified that the OCR output does not meet the specified robustness standards and requires manual inspection. The spell check step has been introduced as a quality control measure to bring attention to potentially inaccurate or faulty transcriptions.
Step 6: Attribute Extraction
There are 5 attributes we extracted from each specimen image: specimen ID, scientific name (genus + species), name of the collector, date of specimen collection, and location of the specimen at the point of collection. Recognizing that each image file name contained the corresponding specimen ID and scientific name (e.g., EMEC8626 Aphthargelia symphoricarpi.txt or SBMNHENT133 Enallagma carunculatum.txt), we performed a direct lookup and parsed the information accordingly.
Extracting the collector name from each image required more creativity. When we first performed the scraping, we had 50,000 databased images at our disposal. Many of the collector names were frequent so we pulled the collector names from the databased images and performed a global lookup on the 400,000 non-databased images to check for matches. By using the lookup method, we were able to identify the collector name for 56% of our specimen images. Taking it a step further, we observed that the names on the specimen images were often followed by indicator words like “coll.” or “collector”. We then used Regex to characterize a search pattern to pull any names that prefaced the indicator words. In doing so, we were able to identify the collector name for an additional 22% of our specimen images.
For the date of specimen collection, we exercised Regex to a similar degree by flagging 15 different permutations of date formats. Using Regex, we captured the collection date for 51% of our specimen images. For the remaining 49% of the images, we leveraged the Stanford Question Answering Dataset (SQuAD) with BERT. The identical approach was used for location of specimen collection, where we achieved accurate extraction across 67% of specimen images. Based on stakeholder feedback, 60% of the location extractions provided the desired granularity at the landmark, region, and/or county level.
Step 7: TSV Generation
As part of the pipeline, all extracted attributes are directly imported to a .tsv file to enable quick and easy information retrieval across the archive. The .tsv file also reflects the OCR output of each image to maintain traceback to the “raw data” used for the extractions.