Building a statistical language model (LM) is a challenging task that involves handling unseen events and assigning proper nonzero probabilities to all possible word sequences based on only limited observations. We investigate data sharing with statistical/linguistic knowledge encoded as word classes/labels. A maximum entropy token-based language model (METML) is proposed as a framework to incorporate word label information in language modeling. While the conventional LMs model word sequencies, a METLM directly predict distributions of tokens (a coupling of a word and its labels ). The probability of a word sequence is computed by marginalization where all its possible token sequences realization are considered. With features capturing explicit local linguistic dependencies based on words/labels n-grams, this model avoids further data sparseness with the more specific tokens. Moreover, it also enables data sharing through large-granularity word labels which can be either syntactic word tags or semantic word classes. We also explore semantic data sharing by investigating automatic semantic tagging of all nouns in the corpus with a unique set of labels. We address large scale semantic analysis on a refined set of semantic labels defined in the Longman dictionary. We employ maximum entropy based classifiers and random forest based classifiers as basic tagging techniques and compare their performance. Many modeling strategies with various linguistically motivated or data-driven indicators are proposed, examined and compared. Automatic semantic labels are further evaluated in language modeling.
Speaker Biography
Jia Cui is a PhD student in the Computer Science Department, a member of the Center for Language and Speech Processing (CLSP) at the Johns Hopkins University. She received her BS degree in Computer Science from University of Science and Technology of China, Hefei, China, and MS degree from Chinese Academy of Sciences, Beijing, China, in Artificial Intelligence. She now works for IBM T.J. Watson Center, Yorktown Heights, NY. Her research interests include automatic speech recognition and natural language processing.