Read trimming of chloroplast genome sequences may lead to better quality assemblies
Document Type
Oral Presentation
Campus where you would like to present
SURC 137B
Start Date
16-5-2013
End Date
16-5-2013
Abstract
Obtaining the DNA sequence of an entire genome requires the computer assembly of many small segments of DNA sequence data called reads. These reads are pieced together by their overlapping information. For example, if one read was ACGTTTCGAT and another was TCGATCACTG, they would be ‘overlapped’ because the last five nucleotides of the first sequence and the first five nucleotides of the second sequence are the same. In reality, the reads are longer (80-800 nucleotides in length depending on the method) and there can be millions of them. Because all genomes contain some repeated sequences and because the reads can include sequencing errors, reads can be misassembled by assembly programs. These misassemblies can be hard to detect, and when present lead to a misunderstanding of gene order in the genome. We are investigating the effect of trimming reads by length or by quality, prior to the assembly step, to see if removal of lower quality sequence data improves assembly quality. We are using 80bp read sequence data generated for the chloroplast genomes in a family of conifers. So far, we have trimmed by length and compared untrimmed reads with reads with 70 and 60bp. For each of six species tested, assemblies using the most extensively trimmed reads produce the best outcomes.
Recommended Citation
McFadden, Angela, "Read trimming of chloroplast genome sequences may lead to better quality assemblies" (2013). Symposium Of University Research and Creative Expression (SOURCE). 64.
https://digitalcommons.cwu.edu/source/2013/oralpresentations/64
Additional Mentoring Department
Biological Sciences
Read trimming of chloroplast genome sequences may lead to better quality assemblies
SURC 137B
Obtaining the DNA sequence of an entire genome requires the computer assembly of many small segments of DNA sequence data called reads. These reads are pieced together by their overlapping information. For example, if one read was ACGTTTCGAT and another was TCGATCACTG, they would be ‘overlapped’ because the last five nucleotides of the first sequence and the first five nucleotides of the second sequence are the same. In reality, the reads are longer (80-800 nucleotides in length depending on the method) and there can be millions of them. Because all genomes contain some repeated sequences and because the reads can include sequencing errors, reads can be misassembled by assembly programs. These misassemblies can be hard to detect, and when present lead to a misunderstanding of gene order in the genome. We are investigating the effect of trimming reads by length or by quality, prior to the assembly step, to see if removal of lower quality sequence data improves assembly quality. We are using 80bp read sequence data generated for the chloroplast genomes in a family of conifers. So far, we have trimmed by length and compared untrimmed reads with reads with 70 and 60bp. For each of six species tested, assemblies using the most extensively trimmed reads produce the best outcomes.
Faculty Mentor(s)
Linda Raubeson