Read trimming of chloroplast genome sequences may lead to better quality assemblies

Presenter Information

Angela McFadden

Document Type

Oral Presentation

Campus where you would like to present

SURC 137B

Start Date

16-5-2013

End Date

16-5-2013

Abstract

Obtaining the DNA sequence of an entire genome requires the computer assembly of many small segments of DNA sequence data called reads. These reads are pieced together by their overlapping information. For example, if one read was ACGTTTCGAT and another was TCGATCACTG, they would be ‘overlapped’ because the last five nucleotides of the first sequence and the first five nucleotides of the second sequence are the same. In reality, the reads are longer (80-800 nucleotides in length depending on the method) and there can be millions of them. Because all genomes contain some repeated sequences and because the reads can include sequencing errors, reads can be misassembled by assembly programs. These misassemblies can be hard to detect, and when present lead to a misunderstanding of gene order in the genome. We are investigating the effect of trimming reads by length or by quality, prior to the assembly step, to see if removal of lower quality sequence data improves assembly quality. We are using 80bp read sequence data generated for the chloroplast genomes in a family of conifers. So far, we have trimmed by length and compared untrimmed reads with reads with 70 and 60bp. For each of six species tested, assemblies using the most extensively trimmed reads produce the best outcomes.

Faculty Mentor(s)

Linda Raubeson

Additional Mentoring Department

Biological Sciences

This document is currently not available here.

Share

COinS
 
May 16th, 1:10 PM May 16th, 1:30 PM

Read trimming of chloroplast genome sequences may lead to better quality assemblies

SURC 137B

Obtaining the DNA sequence of an entire genome requires the computer assembly of many small segments of DNA sequence data called reads. These reads are pieced together by their overlapping information. For example, if one read was ACGTTTCGAT and another was TCGATCACTG, they would be ‘overlapped’ because the last five nucleotides of the first sequence and the first five nucleotides of the second sequence are the same. In reality, the reads are longer (80-800 nucleotides in length depending on the method) and there can be millions of them. Because all genomes contain some repeated sequences and because the reads can include sequencing errors, reads can be misassembled by assembly programs. These misassemblies can be hard to detect, and when present lead to a misunderstanding of gene order in the genome. We are investigating the effect of trimming reads by length or by quality, prior to the assembly step, to see if removal of lower quality sequence data improves assembly quality. We are using 80bp read sequence data generated for the chloroplast genomes in a family of conifers. So far, we have trimmed by length and compared untrimmed reads with reads with 70 and 60bp. For each of six species tested, assemblies using the most extensively trimmed reads produce the best outcomes.