DNA microarrays provide the means to simultaneously measure the expression level of thousands of genes. The immense volume of data resulting from microarray experiments, accompanied by an increase in the amount of related literature, presents a major data analysis challenge. Current analysis methods typically focus on clustering genes based on temporal expression pattern, under the hypothesis that similarly expressed genes share a common function. However, WHAT this function is remains to be explained through further experiments, human expertise and the published literature. An ultimate goal is to complement existing analysis techniques with an automated system for extracting relevant information from the literature.
We present a new approach for utilizing the literature to establish functional relationships among genes on a genome-wide scale. The first part of the talk will introduce a new Expectation-Maximization algorithm, which produces sets of PubMed documents with a unifying theme, along with a list of terms characterizing each theme. The second part presents a method based on this algorithm, which finds content-based relationships among PubMed abstracts, and translates them into functional relationships among genes. Preliminary results, applying this method to a database of documents discussing yeast genes, demonstrate the effectiveness of the approach.