Integration of Multiple Information Sources in Automatic Speech Recognition

While we are starting to encounter commercial automatic speech recognizers in everyday life, state-of-the-art systems still have trouble with some tasks. When transcribing news reports, current systems misrecognize one word in ten; transcription of human-to-human conversations is even more errorful (30% word error). The foremost reason for poor performance on these tasks is the increased acoustic and linguistic variability found in less constrained conversational situations. Incorporating knowledge about speaking conditions (fast, slow, noisy, reverberant, etc.) into the statistical models of our speech recognizer can compensate in part for this increased variability. In this talk, I describe three projects in which multiple information sources were combined to improve estimation of the speaking rate of an utterance, to increase robustness to noise and reverberation in transcription of television and radio news reports, and to improve prediction of how and when people will pronounce words in a non-standard manner. These studies suggest that integrating disparate information representations into statistical pattern recognizers is a promising direction for future research.