code and data

code & data

Bitext Workshop

I recorded Peter Brown and Bob Mercer’s talks and the subsequent Q&A session at the 2013 EMNLP workshop Twenty Years of Bitext. I then had them transcribed, cleaned them up, and annotated them, as a service to posterity. Peter and Bob delivered exactly the sort of talk you might have hoped for, that was both reminiscent and humorous. It was really an historic event.

Picture of PowerPoint slide entitled 'Oh yes, everything's right on schedule, Fred'

Fisher Callhome Spanish Translation Dataset

Picture of a Spanish speech translation lattice

We collected ASR output (using Kaldi) and human translations (using Amazon’s Mechanical Turk) for the Fisher Spanish and CALLHOME Spanish datasets, which together provide a four-way parallel dataset (among acoustic input, transcripts, ASR output in various forms, and English translations) for research in the translation of Spanish conversational speech. The dataset is available through the LDC.

Stack decoder visualizer

I wrote a JQuery stack decoder to help visualize word-based MT for our MT class. You can play with the live online demo or get the code from GitHub.

Syntactic feature extraction

You can find data (including the grammar) and code for extracting TSG feature sets on GitHub. This data includes a version of Mark Johnson’s exhaustive CKY parser modified to parse with grammars containing rules intermingled terminals and nonterminals and with a number of other convenient command-line options.

Bayesian tree substitution grammar learning

Picture of a parse tree with TSG annotations

The code for the experiments in our 2009 paper on inferring tree substitution grammars is available on GitHub. It is small, modular, and well-documented, and despite being written in Perl, I have been told that it is easy to understand. It includes a patch to Mark Johnson’s CKY parser that allows it to be used with TSGs.

Reranking feature extraction

Charniak and Johnson’s reranking code (from their 2005 ACL paper) extracts a large set of syntactic features from parse trees. An impediment to extracting their features is that it’s integrated into their reranking framework, requiring fairly specialized file formats. I modified their extract-spfeatures program to enable the extraction of their feature set from a single parse tree in standard bracketed format, e.g.,

$ echo "(S (NP (DT The) (NN child)) (VP (VBD demurred)))" | extract-spfeatures

It is available on GitHub.