Finished! Looks like this project is out of data at the moment!
Our understanding of forest ecosystem and their response to climate change relies on consistent long-term observations which provide a baseline or reference. Yet, observing and measuring tropical plant species and the climatic (weather) conditions in which they reside is demanding, particularly in the central Congo Basin. Established long-term observation programs are therefore rare. In terms of meteorological observations, the central Congo Basin is currently represented by only a few rain gauges, limiting climate forecasts across the Congo Basin and the central African continent. This lack of long-term (historical) climatological data leaves the central Congo Basin spatially and temporally under-represented in both climatological and ecological (model) studies. However, old climate records could provide valuable information about previous growing conditions of the forest.
Large amounts of ecological and climatological data, approximately five decades (~1910 – 1960), exists as unexplored heritage, stored in various Belgian federal archives and collections. As part of a larger project called Congo Basin eco-climatological data recovery and valorization (COBECORE, www.cobecore.org) the "Jungle Weather" project will need your help transcribe historical climatological data as measured throughout the Congo Basin. Due to the large volumes of data and the limited options to automate transcription with optical character recognition (OCR) techniques your help is crucial in the full transcription of these historical records. Although various projects aim to improve the automation of transcription tasks, e.g. the European DARIAH project, to date, transcription of tables of handwritten data remains a challenge. The transcribed data therefore will give insight in the climate of the central African rainforest, will complement the completed Jungle Rhythms Zooniverse project, and contribute to machine learning training datasets in order to automate future transcription efforts.
Within this project we will focus on data records as recorded throughout the tropical part of what is currently the Democratic Republic of the Congo (DRC). The area which we will cover is shown above in the map as an open polygon. The project will not cover the southern province of Katanga (red crosshatches) as this area transitions here from tropical to a humid subtropical climate.
The historical data is archived and stored in the Belgian State Archives. The Belgian State Archive harbour almost all data regarding colonial affairs, ranging from communications about trade to the raw data as digitized within the context of the Jungle Weathers project. Row upon row of data is stored in the basement. Below you see a part of the INEAC (Institut National pour l’Etude Agronomique du Congo belge) archive, which holds all climatological records.
These climatological records were noted rigorously on carbon copy paper. However, due to the hand written nature of the data (and the volume involved) automated processing is not possible. Although optical character recognition (OCR) works wonderfully on printed data the high variability in characters and the low contrast pencil markings contribute to the failure of current automated approaches. Similar to the Old Weather project and in spirit of the Jungle Rhythms project, a keen eye is required to decipher the numbers written down on these sheets.
The project will provide you, citizen scientists, with digital pictures of the original sheets. Scanning these climate data sheets was a laborious process. In total more than 70 000 records were digitized. Unlike the Old Weather project we do not require you to outline valid sections of the sheet. This part of the processing has been automated. Below you see an example of the automated machine learning (ML) based screening. A visualization of the screening results of one particular table are given below. The light blue pixels represent those of a template we use to figure out where valuable data is in the table, red/pink pixels represent those of the table data, blue pixels show agreement between the template and the matched table and, finally, white crosses indicate empty cells as predicted through ML.
As such, once digitized and properly aligned the whole record will be divided into an estimated 30 million cells. Of those a substantial number are empty. Only rows which are flagged as completely empty will be omitted from transcription. This routine saves a considerable amount of time and effort and lowers the number of actual values to transcribe to roughly 10 million cells.
The cells are padded so they might include some of their neighbours. This is needed to ensure that all data is included in case the hand written numbers do not stick to the nice layout of the table. As with all citizen science projects your feedback is essential in tracking strange cases and transcribing the correct data. In the below example you see that the data (on the bottom line) reads 10.1.
In addition to data screening, data recovery will also rely on Machine Learning. Data will be transcribed in sections, using a subset of the complete dataset to provide training data for these automated algorithms. Despite the flexibility and power of these algorithms large training datasets are required to represent the variety of handwritten numbers in this large dataset.
Two workflows will be provided, one workflow will cover the measurements, or the Transcribe Climate Data workflow. The Transcribe Meta-Data workflow will complement these data with additional information on the site's ID number, and month and year of observation. You are free to choose where to contribute. The meta-data workflow includes multiple steps and is probably less easy to execute on tablet or smartphone. Transcribing measurements from cells is easy and lends itself better to mobile device use.