Finished! Looks like this project is out of data at the moment!
Hello Genome Detectives. Thank you again for your incredible efforts - we have posted an update on the project results tab. We are taking a break while we develop the next iteration of Genome Detectives, but have left you a data set here for practice only – PLEASE NOTE THESE DATA ARE ALREADY COMPLETE. If you are ready for more of a challenge then please check out our new 'Training Academy' website. To browse other active projects that still need your classifications, check out zooniverse projects.
DNA sequencing has become more affordable and available in recent years, meaning genomes are being sequenced in increasing numbers. This has led to a "big data" problem; the issue is not in generating the data but lies in the processing and interpretation to produce meaningful results. The DNA sequences require annotation (or naming) of the genes and gene variants ( or alleles) for scientists and doctors to study. This is largely done by automated programmes, but occasionally (as you will see in the tasks) there are aspects that cannot yet be programmed.
The computer programmes for annotating the genes and gene variants in PubMLST.org can identify certain DNA sequences based on their similarity to those already held in the database (those sequences that have been "seen" before). However, there are new variants that have not been identified due to the very large diversity of bacteria (remember bacteria have been around for 3 billion years), so the automated programme will not detect these. Therefore, there is a risk this information is lost if we do not identify these new gene variants. This is why the Zooniverse members are being asked to help us in this search for new gene variants. In order to build better computer programmes for this type of work, machine learning approaches need “training data” so they can learn how to identify new genetic variants; and this is what we hope to obtain from this Zooniverse community curation project.
In the Field Guide you will find four "Mini-quizzes" that you can work through with one of our scientists, so that you can check how/why certain features are identified. We have provided detailed task-related explanations in the Tutorial and Field Guide as well as more biological background in the About-Education section, so please review these carefully. If you have further questions about the tasks, please use the Talk boards, we will be happy to chat with you and provide more supporting material if necessary.
Decoding DNA in this project involves a detailed review of the DNA sequence and an understanding about the specific features we are hunting for. We would love for you to have a go at this and contribute to the analysis of the genes, but understand that you may be uncertain that you are doing it correctly. When we assess the output of the Zooniverse project, for each subject we look at the consensus of the classifications. With multiple citizen scientists analysing each gene, we see that the majority of people pick the same answer and only a few outliers disagree. The subject is then quickly reviewed by a scientist to check if they agree with the consensus and then the gene is annotated. So do not worry if you think you made a mistake, or take some time to get your into the task, be assured that the power of the Zooniverse is the strength in numbers!
There are many different parameters that the automated program uses to find genes. If the DNA sequence has undergone changes, there may not be a start codon where the program expects to find one. The DNA could have changed due to a mutation, insertion, or deletion of the DNA affecting the original start codon and therefore it is no longer present. Your task is to search up and down 50 bases from the beginning of the yellow highlighted DNA sequence to see if you find a start codon (ATG, GTG, TTG, or CTG) that the computer algorithm missed. The rest of the gene sequence may be intact, but without a start codon the DNA code will not be read and so a protein will not be made. In your task, you should mark this gene "no start codon".
There are many different parameters that the automated computer program uses to find genes. If the DNA sequence has undergone changes, there may not be a stop codon where the program expects to find one. The DNA could have changed due to a mutation, insertion, or deletion of the DNA affecting the original stop codon and therefore it is no longer present. Your task is to search up and down 50 bases from the end of the yellow highlighted DNA sequence to see if you find a stop codon (TGA, TAA, or TAG) the computer algorithm missed. The rest of the gene sequence may be intact, but without a stop codon the DNA code will continue to be read and a protein made, but not as originally intended. In your task, you should mark this gene "no stop codon" if you do not find one.
This is the symbol for a stop codon in the amino acid sequence encoded by the bases TGA, TAA, or TAG. The stop codons do not make an amino acid, so when one of these three sequences comes along, there are no further amino acids added to the chain and so it terminates and forms a protein.
Bacteria can read the DNA sequence starting at any position. Depending on which letter is read first, a different series of codons will be read. Consider the sequence CCCAAAGGGTTTCCC:
Frame 1 (F1): If you start at position 1 the codons are "CCC" "AAA" "GGG" "TTT"
Frame 2 (F2): If you start at position 2 the codons are "CCA" "AAG" "GGT" "TTC"
Frame 3 (F3): If you start at position 3 the codons are "CAA" "AGG" "GTT" "TCC"
Since every codon can code for different amino acids, different amino acids chains will be made with the different codon examples above.
When N's appear in the string of bases, it means that the base (A, C, T, G) cannot be determined through the sequencing and assembly approach performed. Therefore the bases are unknown for this region. If this occurs at the start or end of the gene you will not be able to determine the START or STOP codons, so just hit "no" when these questions are asked in the task and you will be given the next subject to classify.
When you have N's appear in the string of bases, there is not a complete codon to be translated into an amino acid, so X is used to denote this.
If you find more than one start codon in your search up and down from the beginning of the highlighted yellow-shaded gene sequence, try to identify one that is in the same frame as the rest of the gene, highlighted in yellow. You can do this by finding the stop codon (highlighted in red at the end of the highlighted yellow-shaded sequence) and working out which frame it is in (F1, F2, or F3). If none of the new start codons you find are in the same frame as the stop codon, then answer "yes" you found one but please add a comment in the "Talk" section giving the frame (F1, F2 or, F3) and the codon (ATG, GTG, TTG, or CTG) for the subject and we will investigate further.
If you find more than one stop codon in your search up and down from the ending of the highlighted yellow gene sequence, try to identify one that is in the same frame as the rest of the gene. You can do this by finding the start codon (highlighted in green at the beginning of the highlighted yellow sequence) and working out which frame it is in (F1, F2, or, F3). If the new stop codons you find are in the same frame as the start codon, identify the one that will give you the shortest gene and then answer "yes" you found one but please add a comment in the "Talk" section giving the frame (F1, F2 or, F3) and the codon (TAA, TAG, or TGA) for the subject and we will investigate further. If none are in the correct reading frame, answer "no".
Stop codons always result in termination of the amino acid sequence. When we ask about stop codons at the end of the gene, we are referring to those that would normally end the gene sequence to produce a protein of full length. When we ask about Additional stop codons, we are referring to those in the middle of the gene, that disrupt the gene and the resultant protein production.
The DNA sequence can change when errors occur in the replication process, meaning the wrong C, A, G, T, is placed in the sequence. We call these ‘DNA mutations’. DNA can also change by the insertion or deletion of DNA bases in the sequence; these may be short or long. Bacteria are also able to insert large sections of DNA, from other sources and this may provide advantages, e.g. resistance to antibiotics.
As the DNA sequence changes, there is potential for the amino acid sequence to change as well. If the DNA sequence change is simply a single mutation, this may not lead to a change in amino acid as some similar codons (e.g. GGT, GGA, GGC, GGG) produce the same amino acid (in this example glycine). However, if the DNA sequence change results in an amino acid change (e.g. GGT changed to AGT, means serine is produced instead of glycine), this may change the function of the protein.