Obtaining the complete set of proteins for each eukaryotic organism is an important step in the process of better understanding how life evolves and functions. The complex physiology of eukaryotic cells, however, makes direct observation of proteins and their parent genes difficult to achieve. An organism’s genome provides the raw data, which contains the ‘hidden’ set of instructions for generating the complete protein set. Computational gene prediction systems, therefore, play an important role in identifying protein sets using information extracted from the sequenced genome.
This talk will discuss the problem of computational gene prediction in eukaryotic genomes and present a framework for predicting single isoform protein coding genes and overlapping alternatively spliced exons. Incorporating diverse sources of gene structure evidence is shown to lead to substantial improvements in prediction accuracy with performance beginning to match the accuracy of expert human annotators. Alternative exon prediction experiments are discussed, which show accurate prediction of alternatively spliced exons in known genes without relying on evidence from gene expression data.