Humans and Machines: Deciphering Herbarium Handwriting

Finished! Looks like this project is out of data at the moment!

See Results

(Updated Feb 18th) Thanks for your patience -- more handwriting images will be coming next week!

Results

Pilot Study Results

Our pilot study used 698 images of handwritten sets of characters, which were verified and labeled by Zooniverse volunteers. As with the current iteration of the study, word images are segmented automatically by Google Cloud's Vision API, and volunteers are asked to transcribe the text and label the word as either handwritten or typewritten. (The Zooniverse workflow has been slightly tweaked based on the pilot study feedback and results, but the core tasks are unchanged.)

The pilot study contained two phases:

Phase 1: Basic Test Model Creation

Six demo models were created, with slight variations in values for random seeds and epoch numbers. Two standard, off-the-shelf model architectures were used, based on the original architecture proposal and a Keras demonstration tutorial. The figures below show the results for models using the latter architecture.

Fig. 1: Character Error Rate for six demonstration models

Character Error Rate (CER) is a metric used to describe the difference between two words, on a letter-by-letter level. (For example, the difference between "cat" and "bat" is 1 letter substitution, and the length of the word "cat" is 3, so the CER is 1 / 3 = 0.333)
Lower values for CER indicate fewer errors, and therefore better performance -- Models 2 and 5 perform significantly better than the other four models.

Fig. 2: Exact text matches for six demonstration models

The blue bars indicate the number of words in the test set which were correctly transcribed by a given model. The yellow bar indicates the words which were correctly transcribed except for one letter. (In our previous example, "cat" being transcribed as "cat" would be part of the blue bar, "cat" being transcribed as "bat" would be part of the yellow bar, and "cat" being transcribed as "dog" would not be shown.)
The best possible score would be a blue bar of 69 or 70, indicating all exact matches (perfect transcriptions). As with Fig. 1, models 2 and 5 perform significantly better than the other models.

Phase 2: Cross-fold Model Validation

Cross-fold validation is a technique used to evaluate the "sturdiness" of a machine learning model. Essentially, our image set is divided into 10 groups (each containing 69-70 images), and we create 10 different models -- each image group is used as the test set for one model (the other 9 groups are combined and used for training).
Cross-fold validation is more time-consuming, but gives us valuable information about whether our models are robust, and how noisy our data set might be.
For this cross-fold validation, we used the settings for Model 2 from the previous phase, and then created 10 new models with those settings.

Fig. 3: Character Error Rate for cross-fold models

This chart shows the same information as Fig. 1 -- no fold was significantly (α=0.05) better or worse than the others, but the variation between each fold is an indication that our data are fairly noisy. This isn't surprising, considering the very small size of the pilot data set!