English and a small set of other languages have a wealth of available linguistic knowledge resources and annotated language data, but the great majority of the world’s languages have little or none. This seminar will describe work on leveraging the detailed and accurate morphosyntactic analyses available for English to improve analytical capabilities for a diverse set of other languages. This includes the targeted enrichment of English morphosyntactic analysis, translingual projection of that analysis to bootstrap analyses of other languages, and exploitation of that richer feature space for improved machine translation and bitext word alignment. Emphasis is on the combination of multiple sources of information, including both explicitly expressed human linguistic knowledge and patterns observed in monolingual and bilingual corpora, and on language pairs where advanced analysis capabilities are available for one language and unavailable for the other.
Selected contributions to science that will be described include:
-
Proposal and execution of the concept of tagging English with a quasi-universal part-of-speech tag set of fine-grained morphosyntactic features designed for effective translingual annotation transfer from English to a diverse set of world languages.
-
Demonstration of the feasibility of automatically tagging English with a quasi-universal part-of-speech tagset with high accuracy, including the large percentage of quasi-universal features which are not realized via surface English morphology.
-
Demonstration of the high-performance extraction of fine-grained morphosyntactic tags from several state-of-the-art parsers, the combination of which outperforms the syntactic analysis extracted from any individual parser.
-
Demonstration of successful fine-grained tagset mapping between languages to enable translingual projection between non-isomorphic fine-grained tagsets.
-
Demonstration of successful bootstrapping from this projection, using automatically trained system combination to integrate multiple information sources.
-
Demonstration that enrichment of conditioning for machine translation by inclusion of fine-grained morphosyntactic tagging can provide significant gains in the accuracy of lexical choice in machine translation.
-
Demonstration that morphological expansion of a translation lexicon can provide significant improvements in word-alignment performance.
-
Demonstration that such expansion followed by weighting or filtering by empirically estimated correspondences between source- and target-language inflectional forms can improve translation performance.
-
Demonstration that syntactically transforming the target language into an English’ reordering of parsed English to closely parallel the source language word order can provide substantial improvements in word-alignment performance.