I recorded Peter Brown and Bob Mercer’s talks and the subsequent Q&A session at the 2013 EMNLP workshop Twenty Years of Bitext. I then had them transcribed, cleaned them up, and annotated them, as a service to posterity. Peter and Bob delivered exactly the sort of talk you might have hoped for, that was both reminiscent and humorous. It was really an historic event.
Fisher Callhome Spanish Translation Dataset
We collected ASR output (using Kaldi) and human translations (using Amazon’s Mechanical Turk) for the Fisher Spanish and CALLHOME Spanish datasets, which together provide a four-way parallel dataset (among acoustic input, transcripts, ASR output in various forms, and English translations) for research in the translation of Spanish conversational speech. The dataset is available through the LDC.
Stack decoder visualizer
Syntactic feature extraction
You can find data (including the grammar) and code for extracting TSG feature sets on GitHub. This data includes a version of Mark Johnson’s exhaustive CKY parser modified to parse with grammars containing rules intermingled terminals and nonterminals and with a number of other convenient command-line options.
Bayesian tree substitution grammar learning
The code for the experiments in our 2009 paper on inferring tree substitution grammars is available on GitHub. It is small, modular, and well-documented, and despite being written in Perl, I have been told that it is easy to understand. It includes a patch to Mark Johnson’s CKY parser that allows it to be used with TSGs.
Reranking feature extraction
Charniak and Johnson’s reranking code (from their 2005 ACL paper) extracts a large set of syntactic features from parse trees.
An impediment to extracting their features is that it’s integrated into their reranking framework, requiring fairly specialized file formats.
I modified their
extract-spfeatures program to enable the extraction of their feature set from a single parse tree in standard bracketed format, e.g.,
It is available on GitHub.