code and data

I recorded Peter Brown and Bob Mercer's talks and the subsequent Q&A session at the 2013 EMNLP workshop Twenty Years of Bitext. I then had them transcribed, cleaned them up, and annotated them, as a service to posterity. Peter and Bob delivered exactly the sort of talk you might have hoped for, that was both reminiscent and humorous. It was really an historic event.

Picture of a Spanish speech translation lattice
We collected ASR output (using Kaldi) and human translations (using Amazon's Mechanical Turk) for the Fisher Spanish and CALLHOME Spanish datasets, which together provide a four-way parallel dataset (among acoustic input, transcripts, ASR output in various forms, and English translations) for research in the translation of Spanish conversational speech. See the dataset release page for download information.

Map of northern India with Bengali, Hindi, and Urdu highlighted Map of southern India with Malayalam, Telugu, and Tamil highlighted
We released a set of parallel corpora between English and six languages from the Indian subcontinent, which you can download here.

I wrote a JQuery stack decoder to help visualize word-based MT for MT class. You can play with the live online demo or get the code on github.

You can find data (including the grammar) and code for extracting TSG feature sets on Github. This data includes a version of Mark Johnson's exhaustive CKY parser modified to parse with grammars containing rules intermingled terminals and nonterminals and with a number of other convenient command-line options.

Picture of a parse tree with TSG annotations
The code for the experiments in our 2009 paper on inferring tree substitution grammars is available on github. It is small, modular, and well-documented, and despite being written in Perl, I have been told that it is easy to understand. It includes a patch to Mark Johnson's CKY parser that allows it to be used with TSGs.

Charniak and Johnson's reranking code (from their 2005 ACL paper) extracts a large set of syntactic features from parse trees. An impediment to extracting their features is that it's integrated into their reranking framework, requiring fairly specialized file formats. I modified their extract-spfeatures program to enable the extraction of their feature set from a single parse tree in standard bracketed format, e.g.,
$ echo "(S (NP (DT The) (NN child)) (VP (VBD demurred)))" | extract-spfeatures
It is available on github.