Improving Genome Annotation with RNA-seq Data

With the advent of next generation sequencing, researchers can now investigate the genome of species and individuals in unprecedented detail. Each part of genome has its own function. Annotation is the process to identify the parts and their functions.
Deep RNA sequencing (RNA-seq) emerged as a revolutionary technology for transcriptome analysis, now widely used to annotate genes. We designed two transcript assemblers, CLASS and CLASS2, to better detect alternative splicing events and to find novel transcripts from RNA-seq data. With sequencing costs dropping, experiments now routinely include multiple RNA-seq samples, to improve the power of statistical analyses. We took advantage of the power of multiple samples in a new software, PsiCLASS. PsiCLASS simultaneously assembles multiple RNA-seq samples, which significantly improves performance over the traditional ‘assemble-and-merge’ model.
For many alignment and assembly applications, sequencing errors can confound downstream analyses. We implemented two k-mer-based error correctors, Lighter and Rcorrector, for whole genome sequencing data and for RNA-seq data, respectively. Lighter was the first k-mer-based error corrector without counting and is much faster and more memory-efficient than other error correctors while having comparable accuracy. Rcorrector searches for a path in the De Bruijn graph that is closest to the current read, using local k-mer thresholds to determine trusted k-mers. Rcorrector measurably improves de novo assembled transcripts, which is critical in annotating species without a high-quality reference genome. A newly assembled genome is typically highly fragmented, which makes it difficult to annotate. Contiguity information from paired-end RNA-seq reads can be used to connect multiple disparate pieces of the gene. We implemented this principle in Rascaf, a tool for assembly scaffolding with RNA-seq read alignments. Rascaf is highly practical, and has improved sensitivity and precision compared to traditional approaches using de novo assembled transcripts. Overall, the collection of algorithms, methods and tools represent a powerful and valuable resource that can be readily and effectively used in any genome sequencing and annotation project and for a vast array of transcriptomic analyses.

Speaker Biography

Li Song is a Ph.D. candidate in the Department of Computer Science at the Johns Hopkins University. He is working in the area of computational biology under the advice of Dr. Liliana Florea and is a member of The Center for Computational Biology. He received a B.S. degree from the Computer Science and Technology Department at Tongji Univeristy in 2009, and holds M.S. degrees in Computer Science from the Michigan Technological University (2011) and in Applied Mathematics and Statistics from the Johns Hopkins University (2017).