Please give us your feedback using this short Google form https://forms.gle/md9DPqAL7WQb2jf68
In 1969 the Royal Botanic Garden Edinburgh started to digitise the plant records on a computer the size of a room and held securely 6 miles off-site at the Scottish office.
The plan was to find every plant alive at RBGE Inverleith (est 14 thousand) and link it to an accession card in the card file drawers and therefore be able to tell where it came from and how long it had been in the collection.
The updates for this off-site computer only occurred once a year and so there was a strict deadline for collecting and linking the data. What quickly became apparent was that finding the correct cards for the living accessions in the 60 thousand card files dating from 1900 was a time-consuming job and could not be finished before the data-entry deadline.
So the plan was shelved and all remaining unlinked living material was given a new 1969 accession number.
Now with the benefit of digitisation and improved AI natural language techniques, we wish to find out the correct provenance of these 1969 accessions by automatically linking the digitised cards and the latest version of the Living Collection database.
We have scanned the accession cards and now we are ready to use OCR and Machine learning software to “read” the cards. The regular format of the cards really helps with this, but we need a training dataset - and this is where this citizen science project steps in.
We need approx three thousand cards transcribed to train the “reading” software.
There will be two options the cards can be read as a whole
and all the required data transcribed
or the individual segmented data boxes can be transcribed.
as in the accession number workflow above, or the genus workflow below
Different projects for different types of transcriber - the speedster with a liking for quick data entry will like the atomised cards with recurring data, the whole record purist will prefer to transcribe the whole card.
We currently have 37,000 archive cards scanned waiting to be read by the machine (~23,000 more are awaiting scanning). This citizen science project's eight completed workflow datasets will allow us to train the transcription machine learning (ML) algorithm. For this, we train a convolutional neural network from each of the eight workflows' hand-labelled subset of cells. We estimate an initial need for 3000 subjects, per workflow, for training the machine. We then predict on all the cells (using dictionaries from our existing collection management software). The cards span several decades. This entails that half the cards are of one format, and the other half a conglomerate of few other formats, which we separate using a layout classifier (already trained). We currently source all images for classification from the one format most prevalent (and most recent). For the remaining layouts, our models will likely require fine tuning via a further, but typically smaller amount of training data as we “transfer learn” from the initial volunteer effort of 3000 subjects. We estimate an additional 1500 classifications to fine-tune the "transfer-learn" process to other card layout types. Thus we expect 4500 labelled images (from two replicates) per workflow to suffice to predict on the total card data.
You will be able to follow our progress as we update the results section with each completed workflow. We currently aim at > 96% confidence in the machine transcription to define a workflow as complete. It is possible we will achieve even better.
This project builds the base for what we envision to become part of the digitization wave of currently un-transcribed handwritten biodiversity knowledge across the globe, as we explore the potential of feeding this pipeline with additional index cards from other collections-based botanical institutions. The reason we are so excited about this particular type of handwritten archives, is that it contains species-specific information on individual plant ages. Plants vary greatly in lifespans, and many species outlive us humans. It is thus safe to say that the longer a plant lives, the less we know about its complete life-history. Botanic gardens have kept track of individuals across many human generations. Thus, access to these historic survival records across the majority of life forms in the plant kingdom is vital in our attempt to paint a picture - from the plants' perspective - of time, space, provenance, and the ability to adapt on a changing planet.