Label digitization is a slow process. The most common first step for herbarium sheets is to image the entire sheet. From there, the images with labels can be shipped to transcribers, who gather the content from the labels and type them in the proper fields. Over the last 10 years, Notes from Nature has been a way to ask for help for this work from public participants, and you have helped us collect data from millions of digital specimen records, building databases of specimen data that would not have been possible without your help. We have been thinking about ways to bring in new technologies to help you with this process. We’ve looked at those typewritten labels and thought to ourselves: is there a way to automate some parts of this, leaving the most difficult challenges – such as hand-written labels – to our expert team of label puzzlers?
In 2018, we ran a little experiment to just test the waters, an expedition named Label Babel. We called it that because we wanted to basically figure out if we could automate detection of labels in specimen images. We asked our volunteers to draw rectangles around the main labels and then used those “training” data to see if a computer could build a model using that data to find labels itself. Turns out, this works (give or take).
With funding from the National Science Foundation, we have extended Label Babel into a larger effort called Digi-Leap. Its goal is both very simple and challenging - to speed up label digitization as much as 5-fold faster than its current rates by developing workflows to accelerate specimen digitization to make the data broadly available to museums and stakeholders alike.
How is this going to work? It will be a multi-step process. First, we need to train a machine learning model to find labels on herbarium sheets and to detect the label “type” (handwritten versus typewritten versus both versus non-text, such as barcode). We are going to take typewritten labels and try to use OCR (optical character recognition) approaches to automate capture of content from them. This is hard so we are going to ask you to help us validate quality, to help improve our OCR process. Later we will be developing a new workflow to separate the text extracted from these labels into the different categories of information (date, species name, etc.).
Why do this? Natural history specimens and their data have great use, but they must be digitized to be able to use these critical data in research. The challenge is sheer volume - there are billions of specimens in collections across the nation and globe that need to be fully digitized. Even after years of effort, it is critical that we implement new strategies to help speed up this process. We are hoping that connecting human intelligence and machine intelligence gives us a shot to speed up the process, effectively letting computers do some of the easier things, and saving the really interesting things for us humans to tackle.
Digi-Leap Related Blogs: