Language is a sea of uncertainty. Over the past decade, computational linguistics has been learning to navigate this sea by means of probabilities. Given a newspaper sentence that has hundreds of possible parses, for example, recent systems have set their course for the most probable parse. Defining the most probable parse requires external knowledge about the relative probabilities of parse fragments: a kind of soft grammar.
But how could one LEARN such a grammar? This is a higher-level navigation problem – through the sea of possible soft grammars. I will present a clean Bayesian probability model that steers toward the most probable grammar. It is guided by (1) a prior belief that much of a natural-language grammar tends to be predictable from other parts of the grammar, and (2) some evidence about the phenomena of the specific language, as might be available from previous parsing attempts or small hand-built databases.
Optimizing this model naturally discovers common linguistic transformations in the target language. This ability to generalize allows it to make more efficient use of evidence: to achieve an equally good grammar, as measured by cross-entropy, it requires only half as much training data as the best methods from previous literature.