Many language processing tasks are dependent on large databases of lexical semantic information, such as WordNet. These hand-built resources are a particular domain, both because domain-specific terms are missing and because the lexicon contains many words or meanings which would be extremely rare in that domain.
This talk describes statistical techniques to automatically extract semantic information about words from text. These techniques could be used in the construction of updated or domain-specific semantic resources as needed.
Given a large corpus of text and no additional sources of semantic information, we build a hierarchy of nouns appearing in the text. The hierarchy is in the form of an IS-A tree, where the nodes of the tree contain one or more nouns, and the ancestors of a node contain hypernyms of the nouns in that node. (An English word A is said to be a hypernym of a word B if native speakers of English accept the sentence “B is a (kind of) A.”)
The talk will also include a detailed discussion of a particular subproblem: determining which of a pair of nouns is more specific. We identify numerical measures which can be easily computed from a text corpus and which can answer this question with over 80% accuracy.