Joshua: an open source decoder for parsing-based machine translation

Joshua Release 1.3 is now available for download!
For help, try the Joshua Technical Support Group

HOW-TO GUIDE: Installing and running the Joshua Decoder

by Chris Callison-Burch (Released: June 12, 2009)

Note: these instructions are several years out of date. I recommend going to http://joshua-decoder.org/ and following through the pipeline script, which will show you how to do the equivalent steps with the current version of Joshua.

This document gives instructions on how to install and use the Joshua decoder. Joshua is an open-source decoder for parsing-based machine translation. Joshua uses the synchronous context free grammar (SCFG) formalism in its approach to statistical machine translation, and the software implements the algorithms that underly the approach.

These instructions will tell you how to:

  1. Install the software
  2. Prepare your data
  3. Create word alignments
  4. Train a language model
  5. Extract a translation grammar
  6. Run minimum error rate training
  7. Decode a test set
  8. Recase the translations
  9. Score the translations

If you use Joshua in your work, please cite this paper:

Zhifei Li, Chris Callison-Burch, Chris Dyer, Juri Ganitkevitch, Sanjeev Khudanpur, Lane Schwartz, Wren Thornton, Jonathan Weese and Omar Zaidan, 2009. Joshua: An Open Source Toolkit for Parsing-based Machine Translation. In Proceedings of the Workshop on Statistical Machine Translation (WMT09). [pdf] [bib]

These instructions apply to Release 1.3 of Joshua, which is described in our WMT09 paper. You can also get the latest version of the Joshua software from the repository with the command:

svn checkout https://joshua.svn.sf.net/svnroot/joshua/trunk joshua

Step 1: Install the software

Prerequisites

The Joshua decoder is written in Java. You'll need to install a few software development tools before you install it:

  • Apache Ant - ant is a tool for compiling Java code which has similar functionality to make.
  • Swig - swig is a tool that connects programs written C++ with Java.

Before installing these, you can check whether they're already on your system by typing which ant and which swig.

In addition to these software development tools, you will also need to download:

  • The SRI language modeling toolkit - srilm is a widely used toolkit for building n-gram language models, which are an important component in the translation process.
  • The Berkeley Aligner - this software is used to align words across sentence pairs in a bilingual parallel corpus. Word alignment takes place before extracting an SCFG.

After you have downloaded the srilm tar file, type the following commands to install it:

mkdir srilm 
mv srilm.tgz srilm/ 
cd srilm/ 
tar xfz srilm.tgz 
make 

If the build fails, please follow the instructions in SRILM's INSTALL file. For instance, if SRILM's Makefile does not identify that your're running a 64 bit Linux you might have to run "make MACHINE_TYPE=i686-m64 World".

After you successfully compile SRILM, Joshua will need to know what directory it is in. You can type pwd to get the absolute path to the sirlm/ directory that you created. Once you've figured out the path, set an SRILM environment variable by typing:

export SRILM="/path/to/srilm"

Where "/path/to/srilm" is replaced with your path. You'll also need to set a JAVA_HOME environment variable. For Mac OS X this usually is done by typing:

export JAVA_HOME="/Library/Java/Home"

These variables will need to be set every time you use Joshua, so it's useful to add them to your .bashrc, .bash_profile or .profile file.

Download and Install Joshua

First, download the Joshua release 1.3. tar file. Next, type the following commands to untar the file and compile the Java classes:

tar xfz joshua.tar.gz
cd joshua
ant

Running ant will compile the Java classes and link in srilm. If everything works properly, you should see the message BUILD SUCCESSFUL. If you get a BUILD FAILED message, it may be because you have not properly set the paths to SRILM and JAVA_HOME, or because srilm was not compiled properly, as described above.

For the examples in this document, you will need to set a JOSHUA environment variable:

export JOSHUA="/path/to/joshua/trunk"

Run the example model

To test to make sure that the decoder is installed properly, we'll translate 5 sentences using a small translation model that loads quickly. The sentences that we will translate are contained in example/example.test.in

科学家 为 攸关 初期 失智症 的 染色体 完成 定序
( 法新社 巴黎 二日 电 ) 国际 间 的 一 群 科学家 表示 , 他们 已 为 人类 第十四 对 染色体 完成 定序 , 这 对 染色体 与 许多 疾病 有关 , 包括 三十几 岁 者 可能 罹患 的 初期 阿耳滋海默氏症 。
这 是 到 目前 为止 完成 定序 的 第四 对 染色体 , 它 由 八千七百多万 对 去氧 核糖核酸 ( dna ) 组成 。
英国 自然 科学 周刊 发表 的 这 项 研究 显示 , 第十四 对 染色体 排序 由 一千零五十 个 基因 和 基因 片段 构成 。
基因 科学家 的 目标 是 , 提供 诊断 工具 以 发现 致病 的 缺陷 基因 , 终而 提供 可 阻止 这些 基因 产生 障碍 的 疗法 。

The small translation grammar contains 15,939 rules -- you can get the count of the number of rules by running gunzip -c example/example.hiero.tm.gz | wc -l or you can see the first few translation rules with gunzip -c example/example.hiero.tm.gz | head

[X] ||| [X,1] 科学家 [X,2] ||| [X,1] scientists to [X,2] ||| 2.17609119 0.333095818 1.53173875
[X] ||| [X,1] 科学家 [X,2] ||| [X,2] of the [X,1] scientists ||| 2.47712135 0.333095818 2.17681264
[X] ||| [X,1] 科学家 [X,2] ||| [X,2] of [X,1] scientists ||| 2.47712135 0.333095818 1.13837981
[X] ||| [X,1] 科学家 [X,2] ||| [X,2] [X,1] scientists ||| 2.47712135 0.333095818 0.218843221
[X] ||| [X,1] 科学家 [X,2] ||| [X,1] scientists [X,2] ||| 1.01472330 0.333095818 0.218843221
[X] ||| [X,1] 科学家 [X,2] ||| [X,2] of scientists of [X,1] ||| 2.47712135 0.333095818 2.05791640
[X] ||| [X,1] 科学家 [X,2] ||| scientists [X,1] for [X,2] ||| 2.47712135 0.333095818 2.05956721
[X] ||| [X,1] 科学家 [X,2] ||| [X,1] scientist [X,2] ||| 1.63202321 0.303409695 0.977472364
[X] ||| [X,1] 科学家 [X,2] ||| [X,1] scientists , [X,2] ||| 2.47712135 0.333095818 1.68990576
[X] ||| [X,1] 科学家 [X,2] ||| scientists [X,2] [X,1] ||| 2.47712135 0.333095818 0.218843221

The different parts of the rules are separated by the ||| delimiter. The first part of the rule is the left-hand side non-terminal. The second and third parts are the right-hand side. The three numbers listed after each translation rules are negative log probabilities that signify, in order:

  • prob(e|f) - the probability of the English phrase given the foreign phrase
  • lexprob(e|f) - the lexical translation probabilities of the English words given the foreign words
  • lexprob(f|e) - the lexical translation probabilities of the foreign words given the English words

You can use the grammar to translate the test set by running

java -Xmx1g -cp $JOSHUA/bin \
	-Djava.library.path=$JOSHUA/lib \
	-Dfile.encoding=utf8 joshua.decoder.JoshuaDecoder \
	example/example.config.srilm \
	example/example.test.in \
	example/example.nbest.srilm.out

For those of you who aren't very familiar with Java, the arguments are the following:

  • -Xmx1g -- this tells Java to use 1 GB of memory.
  • -cp $JOSHUA/bin -- this specifies the directory that contains the Java class files.
  • -Djava.library.path=$JOSHUA/lib -- this specifies the directory that contains the libraries that link in C++ code
  • -Dfile.ecoding=utf8 -- this tells java to use unicode as the default file encoding.
  • joshua.decoder.JoshuaDecoder -- This is the class that is run. If you want to look at the the source code for this class, you can find it in src/joshua/decoder/JoshuaDecoder.java
  • example/example.config.srilm -- This is the configuration file used by Joshua.
  • example/example.test.in -- This is the input file containing the sentences to translate.
  • example/example.nbest.srilm.out -- This is the output file that the n-best translations will be written to.

You can inspect the output file by typing head example/example.nbest.srilm.out

0 ||| scientists to vital early 失智症 the chromosome completed has ||| -127.759 -6.353 -11.577 -5.325 -3.909 ||| -135.267
0 ||| scientists for vital early 失智症 the chromosome completed has ||| -128.239 -6.419 -11.179 -5.390 -3.909 ||| -135.556
0 ||| scientists to related early 失智症 the chromosome completed has ||| -126.942 -6.450 -12.716 -5.764 -3.909 ||| -135.670
0 ||| scientists to vital early 失智症 the chromosomes completed has ||| -128.354 -6.353 -11.396 -5.305 -3.909 ||| -135.714
0 ||| scientists to death early 失智症 the chromosome completed has ||| -127.879 -6.575 -11.845 -5.287 -3.909 ||| -135.803
0 ||| scientists as vital early 失智症 the chromosome completed has ||| -128.537 -6.000 -11.384 -5.828 -3.909 ||| -135.820
0 ||| scientists for related early 失智症 the chromosome completed has ||| -127.422 -6.516 -12.319 -5.829 -3.909 ||| -135.959
0 ||| scientists for vital early 失智症 the chromosomes completed has ||| -128.834 -6.419 -10.998 -5.370 -3.909 ||| -136.003
0 ||| scientists to vital early 失智症 completed the chromosome has ||| -127.423 -7.364 -11.577 -5.325 -3.909 ||| -136.009
0 ||| scientists to vital early 失智症 of chromosomes completed has ||| -127.427 -7.136 -11.612 -5.816 -3.909 ||| -136.086

This file contains the n-best translations, under the model. The first 10 lines that you see above are 10 best translations of the first sentence. Each line contains 4 fields. The first field is the index of the sentence (index 0 for the first sentence), the second field is the translation, the third field contains the each of the individual feature function scores for the translation (language model, rule translation probability, lexical translation probability, reverse lexical translation probability, and word penalty), and the final field is the overall score.

To get the 1-best translations for each sentence in the test set without all of the extra information, you can run the following command:

java -Xmx1g -cp $JOSHUA/bin \
	-Dfile.encoding=utf8 joshua.util.ExtractTopCand \
	example/example.nbest.srilm.out \
	example/example.nbest.srilm.out.1best

You cat then look at the 1-best output file by typing cat example/example.nbest.srilm.out.1best:

scientists to vital early 失智症 the chromosome completed has
( , paris 2 ) international a group of scientists said that they completed to human to chromosome 14 has , the chromosome with many diseases , including more years , may with the early 阿耳滋海默氏症 .
this is to now completed has in the fourth chromosome , which 八千七百多万 to carry when ( dna ) .
the weekly british science the study showed that the chromosome 14 are by 一千零五十 genes and gene fragments .
the goal of gene scientists is to provide diagnostic tools to found of the flawed genes , are still provide a to stop these genes treatments .

If your translations are identical to the ones above then Joshua is installed correctly. With this small model, there are many untranslated words, and the quality of the translations is very low. In the next steps, we'll show you how to train a model for a new language pair, using a larger training corpus that will result in higher quality translations.

Step 2: Prepare your data

To create a new statistical translation model with Joshua, you will need several data sets:

  • A large sentence-aligned bilingual parallel corpus. We refer to this set as the training data, since it will be used to train the translation model. The question of how much data is necessary always arises. The short answer is more is better. Our parallel corpora typically contain tens of millions of words, and we use as much as 250 million words.
  • A larger monolingual corpus. We need data in the target language to train the language model. You could simply use the target side of the parallel corpus, but it is better to assemble to large amounts of monolingual text, since it will help improve the fluency of your translations.
  • A small sentence-aligned bilingual corpus to use as a development set (somewhere around 1000 sentence pairs ought to be sufficient). This data should disjoint from your training data. It will be used to optimize the parameters of your model in minimum error rate training (MERT). It may be useful to have multiple reference translations for your dev set, although this is not strictly necessary.
  • A small sentence-aligned bilingual corpus to use as a test set to evaluate the translation quality of your system and any modifications that you make to it. The test set should be disjoint from the dev and training sets. Again, it may be useful to have multiple reference translations if you are evaluating using the Bleu metric.

There are several sources for training data. A good source of free parallel corpora for European languages is the Europarl corpus that is distributed as part of the Workshop on Statistical Machine Translation. If you sign up to participate in the annual NIST Open Machine Translation Evaluation you can get access large Arabic-English and Chinese-English parallel corpora, and a small Urdu-English parallel corpus.

Once you've gathered your data, you will need to do several preprocess steps: sentence alignment, tokenization, normalization, and subsampling.

Sentence alignment

In this exercise, we'll start with an existing sentence-aligned parallel corpus. Download this tarball, which contains a Spanish-Engish parallel corpus, along with a dev and a test set: data.tar.gz

The data tarball contains two training directories training/, which includes a subset of the corpus, and full-training, which includes the full corpus. I strongly recommend staring with the smaller set, and building an end-to-end system with it, since many steps take a very long time on the full data set. You should debug on the smaller set to avoid wasting time.

If you start with your own data set, you will need to sentence align it. We recommend Bob Moore's bilingual sentence aligner.

Tokenization

Joshua uses whitespace to delineate words. For many languages, tokenization can be as simple as separating punctation off as its own token. For languages like Chinese, which don't put spaces around words, tokenization can be more tricky.

For this example we'll use the simple tokenizer that is released as part of the WMT. It's located in the tarball under the scripts directory. To use it type the following commands:

tar xfz data.tar.gz

cd data/

gunzip -c es-en/full-training/europarl-v4.es-en.es.gz \
	| perl scripts/tokenizer.perl -l es \
	> es-en/full-training/training.es.tok

gunzip -c es-en/full-training/europarl-v4.es-en.en.gz \
	| perl scripts/tokenizer.perl -l en \
	> es-en/full-training/training.en.tok 

Normalization

After tokenization, we recommend that you normalize your data by lowercasing it. The system treats words with variant capitalization as distinct, which can lead to worse probability estimates for their translation, since the counts are fragmented. For other languages you might want to normalize the text in other ways.

You can lowercase your tokenized data with the following script:

cat es-en/full-training/training.en.tok \
	| perl scripts/lowercase.perl \
	> es-en/full-training/training.en.tok.lc 

cat es-en/full-training/training.es.tok \
	| perl scripts/lowercase.perl \
	> es-en/full-training/training.es.tok.lc

The untokenized file looks like this (gunzip -c es-en/full-training/europarl-v4.es-en.en.gz | head -3):

Resumption of the session
I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.
Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful.

After tokenization and lowercasing, the file looks like this (head -3 es-en/full-training/training.en.tok.lc):

resumption of the session
i declare resumed the session of the european parliament adjourned on friday 17 december 1999 , and i would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period .
although , as you will have seen , the dreaded ' millennium bug ' failed to materialise , still the people in a number of countries suffered a series of natural disasters that truly were dreadful .

You must preprocess your dev and test sets in the same way you preprocess your training data. Run the following commands on the data that you downloaded:

cat es-en/dev/news-dev2009.es \
	| perl scripts/tokenizer.perl -l es \
	| perl scripts/lowercase.perl \
	> es-en/dev/news-dev2009.es.tok.lc

cat es-en/dev/news-dev2009.en \
	| perl scripts/tokenizer.perl -l en \
	| perl scripts/lowercase.perl \
	> es-en/dev/news-dev2009.en.tok.lc

cat es-en/test/newstest2009.es \
	| perl scripts/tokenizer.perl -l es \
	| perl scripts/lowercase.perl \
	> es-en/test/newstest2009.es.tok.lc

cat es-en/test/newstest2009.en \
	| perl scripts/tokenizer.perl -l en \
	| perl scripts/lowercase.perl \
	> es-en/test/newstest2009.en.tok.lc

Subsampling (optional)

Sometimes the amount of training data is so large that it makes creating word alignments extremely time-consuming and memory-intesive. We therefore provide a facility for subsampling the training corpus to select sentences that are relevant for a test set.

mkdir es-en/full-training/subsampled
echo "training" > es-en/full-training/subsampled/manifest
cat es-en/dev/news-dev2009.es.tok.lc es-en/test/newstest2009.es.tok.lc > es-en/full-training/subsampled/test-data

java -Xmx1000m -Dfile.encoding=utf8 -cp "$JOSHUA/bin:$JOSHUA/lib/commons-cli-2.0-SNAPSHOT.jar" \
	joshua.subsample.Subsampler \
	-e en.tok.lc \
	-f es.tok.lc \
	-epath  es-en/full-training/ \
	-fpath  es-en/full-training/ \
	-output es-en/full-training/subsampled/subsample \
	-ratio 1.04 \
	-test es-en/full-training/subsampled/test-data \
	-training es-en/full-training/subsampled/manifest

You can see how much the subsampling step reduces the training data, by yping wc -lw es-en/full-training/training.??.tok.lc es-en/full-training/subsampled/subsample.??.tok.lc:

 1411589 39411018 training/training.en.tok.lc
 1411589 41042110 training/training.es.tok.lc
  671429 16721564 training/subsampled/subsample.en.tok.lc
  671429 17670846 training/subsampled/subsample.es.tok.lc

Step 3: Create word alignments

Before extracting a translation grammar, we first need to create word alignments for our parallel corpus. In this example, we show you how to use the Berkeley aligner. You may also use Giza++ to create the alignments, although that program is a little unwieldy to install.

To run the Berkeley aligner you first need to set up a configuration file, which defines the models that are used to align the data, how the program runs, and which files are to be aligned. Here is an example configuration file (you should create your own version of this file and save it as training/word-align.conf):

## word-align.conf
## ----------------------
## This is an example training script for the Berkeley
## word aligner.  In this configuration it uses two HMM
## alignment models trained jointly and then decoded 
## using the competitive thresholding heuristic.

##########################################
# Training: Defines the training regimen 
##########################################

forwardModels	MODEL1 HMM
reverseModels	MODEL1 HMM
mode	JOINT JOINT
iters	5 5

###############################################
# Execution: Controls output and program flow 
###############################################

execDir	alignments
create
saveParams	true
numThreads	1
msPerLine	10000
alignTraining

#################
# Language/Data 
#################

foreignSuffix	es.tok.lc
englishSuffix	en.tok.lc

# Choose the training sources, which can either be directories or files that list files/directories
trainSources	subsampled/
sentences	MAX

#################
# 1-best output 
#################

competitiveThresholding

To run the Berkeley aligner, first set an environment variable saying where the aligner's jar file is located (this environment variable is just used for convenience in this document, and is not necessary for running the aligner in general:

export BERKELEYALIGNER="/path/to/berkeleyaligner/dir"

You'll need to create an empty directory called example/test. This is because the Berkeley aligner generally expects to test against a set of manually word-aligned data:

cd es-en/full-training/
mkdir -p example/test

After you've created the word-align.config file, you can run the aligner with this command:

nohup java -d64 -Xmx10g -jar $BERKELEYALIGNER/berkeleyaligner.jar ++word-align.conf &

If the program finishes right away, then it probably terminated with an error. You can read the nohup.out file to see what went wrong. Common problems include a missing example/test directory, or a file not found exception. When you re-run the program, you will need to manually remove the alignments/ directory.

When you are aligning tens of millions of words worth of data, the word alignment process will take several hours to complete. While it is running, you can skip ahead and complete step 4, but not step 5.

After you get comfortable using the aligner and after you've run through the whole Joshua training sequence, you can try experimenting with the amount of training data, the number of training iterations, and different alignment models (the Berkeley aligner supports Model 1, a Hidden Markov Model, and a syntactic HMM).

Step 4: Train a language model

Most translation models also make use of an n-gram language model as a way of assigning higher probability to hypothesis translations that look like fluent examples of the target language. Joshua provides support for n-gram language models, either through a built in data structure, or through external calls to the SRI language modeling toolkit (srilm). To use large language models, we recommend srilm.

If you successfully installed srilm in Step 1, then you should be able to train a language model with the following command:

mkdir -p model/lm

$SRILM/bin/macosx64/ngram-count \
	-order 3 \
	-unk \
	-kndiscount1 -kndiscount2 -kndiscount3 \
	-text training/training.en.tok.lc \
	-lm model/lm/europarl.en.trigram.lm

(Note: the above assumes that you are on a 64-bit machine running Mac OS X. If that's not the case, your path to ngram-count will be slightly different.)

This will train a trigram language model on the English side of the parallel corpus. We use the .tok.lc file because it is important to have the input to the LM training be tokenized and normalized in the same way as the input data for word alignment and translation grammar extraction.

The -order 3 tells srilm to produce a trigram language model. You can set this to a higher value, and srilm will happily output 4-gram, 5-gram or even higher order language models. Joshua supports arbitrary order n-gram language models, but as the order increases the amount of memory that they require rapidly increases, and the amount of evidence used to estimate the probabilities decreases, so there is a diminishing returns for increasing n. It's common to use n-gram models up to order 5, but in practice, people generally don't use n-gram models much beyond that for practical reasons.

The -kndiscount tells SRILM to use modified Kneser-Ney discounting as its smoothing scheme. Other smoothing schemes that are implemented in SRILM include Good-Turing and Witten-Bell.

Given that the English side of the parallel corpus is a relatively small amount of data in terms of language modeling, it only takes a few minutes a few minutes to output the LM. The uncompressed LM is 144 megabytes large (du -h europarl.en.trigram.lm).

Step 5: Extract a translation grammar

We'll use the word alignments to create a translation grammar similar to the Chinese one shown in Step 1. The translation grammar is created by looking for where the foreign language phrases from the test set occur in the training set, and then using the word alignments to figure out which foreign phrases are aligned.

Create a suffix array index

To find the foreign phrases in the test set, we first create an easily searchable index, called a suffix array, for the training data.

java -Xmx500m -cp $JOSHUA/bin/ \
	joshua.corpus.suffix_array.Compile \
	training/subsampled/subsample.es.tok.lc \
	training/subsampled/subsample.en.tok.lc  \
	training/subsampled/training.en.tok.lc-es.tok.lc.align \
	model

This compiles the index that Joshua will use for its rule extraction, and puts it into a directory named model.

Extract grammar rules for the dev set

The following command will extract a translation grammar from the suffix array index of your word-aligned parallel corpus, where the grammar rules apply to the foreign phrases in the dev set dev/news-dev2009.es.tok.lc:

mkdir mert

java -Dfile.encoding=UTF8 -Xmx1g -cp $JOSHUA/bin \
        joshua.prefix_tree.ExtractRules \
        ./model \
        mert/news-dev2009.es.tok.lc.grammar.raw \
        dev/news-dev2009.es.tok.lc &  

Next, sort the grammar rules and remove the redundancies with the following Unix command:

sort -u mert/news-dev2009.es.tok.lc.grammar.raw \
	-o mert/news-dev2009.es.tok.lc.grammar

You will also need to create a small "glue grammar", in a file called model/hiero.glue that contains these rules that allow hiero-style grammars to reach the goal state:

[S] ||| [X,1] ||| [X,1] ||| 0 0 0
[S] ||| [S,1] [X,2] ||| [S,1] [X,2] ||| 0.434294482 0 0

Step 6: Run minimum error rate training

After we've extracted the grammar for the dev set we can run minimum error rate training (MERT). MERT is a method for setting the weights of the different feature functions the translation model to maximize the translation quality on the dev set. Translation quality is calculated according to an automatic metric, such as Bleu. Our implementation of MERT allows you to easily implement some other metric, and optimize your paramters to that. There's even a YouTube tutorial to show you how.

To run MERT you will first need to create a few files:

  • A MERT configuration file
  • A separate file with the list of the feature functions used in your model, along with their possible ranges
  • An executable file containing the command to use to run the decoder
  • A Joshua configuration file

Create a MERT configuration file. In this example we name the file mert/mert.config. Its contents are:

### MERT parameters
# target sentences file name (in this case, file name prefix)
-r	dev/news-dev2009.en.tok.lc
-rps	1			# references per sentence
-p	mert/params.txt		# parameter file
-m	BLEU 4 closest		# evaluation metric and its options
-maxIt	10			# maximum MERT iterations
-ipi	20			# number of intermediate initial points per iteration
-cmd	mert/decoder_command    # file containing commands to run decoder
-decOut	mert/news-dev2009.output.nbest     # file prodcued by decoder
-dcfg	mert/joshua.config      # decoder config file
-N	300                     # size of N-best list
-v	1                       # verbosity level (0-2; higher value => more verbose)
-seed   12341234                # random number generator seed

You can see a list of the other parameters available in our MERT implementation by running this command:

java -cp $JOSHUA/bin joshua.zmert.ZMERT -h 

Next, create a file called mert/params.txt that specifies what feature functions you are using in your mode. In our baseline model, this file should contain the following information:

lm			|||	1.000000		Opt	0.1	+Inf	+0.5	+1.5
phrasemodel pt 0	|||	1.066893		Opt	-Inf	+Inf	-1	+1
phrasemodel pt 1	|||	0.752247		Opt	-Inf	+Inf	-1	+1
phrasemodel pt 2	|||	0.589793		Opt	-Inf	+Inf	-1	+1
wordpenalty		|||	-2.844814		Opt	-Inf	+Inf	-5	0
normalization = absval 1 lm

Next, create a file called mert/decoder_command that contains the following command:

java -Xmx1g -cp $JOSHUA/bin/ -Djava.library.path=$JOSHUA/lib -Dfile.encoding=utf8 \
	joshua.decoder.JoshuaDecoder \
	mert/joshua.config \
	dev/news-dev2009.es.tok.lc \
	mert/news-dev2009.output.nbest 

Next, create a configuration file for joshua at mert/joshua.config that contains the following:

lm_file=model/lm/europarl.en.trigram.lm

tm_file=mert/news-dev2009.es.tok.lc.grammar
tm_format=hiero

glue_file=model/hiero.glue
glue_format=hiero

#lm config
use_srilm=true
lm_ceiling_cost=100
use_left_equivalent_state=false
use_right_equivalent_state=false
order=3


#tm config
span_limit=10
phrase_owner=pt
mono_owner=mono
begin_mono_owner=begin_mono
default_non_terminal=X
goalSymbol=S

#pruning config
fuzz1=0.1
fuzz2=0.1
max_n_items=30
relative_threshold=10.0
max_n_rules=50
rule_relative_threshold=10.0

#nbest config
use_unique_nbest=true
use_tree_nbest=false
add_combined_cost=true
top_n=300


#remote lm server config, we should first prepare remote_symbol_tbl before starting any jobs
use_remote_lm_server=false
remote_symbol_tbl=./voc.remote.sym
num_remote_lm_servers=4
f_remote_server_list=./remote.lm.server.list
remote_lm_server_port=9000


#parallel deocoder: it cannot be used together with remote lm
num_parallel_decoders=1
parallel_files_prefix=/tmp/


###### model weights
#lm order weight
lm 1.0

#phrasemodel owner column(0-indexed) weight
phrasemodel pt 0 1.4037585111897322
phrasemodel pt 1 0.38379188013385945
phrasemodel pt 2 0.47752204361625605

#arityphrasepenalty owner start_arity end_arity weight
#arityphrasepenalty pt 0 0 1.0
#arityphrasepenalty pt 1 2 -1.0

#phrasemodel mono 0 0.5

#wordpenalty weight
wordpenalty -2.721711092619053

Finally, run the command to start MERT:

nohup java -cp $JOSHUA/bin \
	joshua.zmert.ZMERT \
	-maxMem 1500 mert/mert.config &

While MERT is running, you can skip ahead to the first part of the next step and extract the grammar for the test set.

Step 7: Decode a test set

When MERT finishes, it will output a file mert/joshua.config.ZMERT.final that contains the news weights for the different feature functions. You can copy this config file and use it to decode the test set.

Extract grammar rules for the test set

Before decoding the test set, you'll need to extract a translation grammar for the foreign phrases in the test set test/newstest2009.es.tok.lc:

java -Dfile.encoding=UTF8 -Xmx1g -cp $JOSHUA/bin \
        joshua.prefix_tree.ExtractRules \
        ./model \
        test/newstest2009.es.tok.lc.grammar.raw \
        test/newstest2009.es.tok.lc &  

Next, sort the grammar rules and remove the redundancies with the following Unix command:

sort -u test/newstest2009.es.tok.lc.grammar.raw \
	-o test/newstest2009.es.tok.lc.grammar

Once the grammar extraction has completed, you can edit the joshua.config file for the test set.

cp mert/joshua.config.ZMERT.final test/joshua.config

You'll need to edit the config file to replace tm_file=mert/mert/news-dev2009.es.tok.lc.grammar with tm_file=test/newstest2009.es.tok.lc.grammar. After you have done that, you can decode the test set with the following command:

java -Xmx1g -cp $JOSHUA/bin/ -Djava.library.path=$JOSHUA/lib -Dfile.encoding=utf8 \
	joshua.decoder.JoshuaDecoder \
	test/joshua.config \
	test/newstest2009.es.tok.lc \
	test/newstest2009.output.nbest

After the decoder has finished, you can extract the 1-best translations from the n-best list using the following command:

java -cp $JOSHUA/bin -Dfile.encoding=utf8 \
	joshua.util.ExtractTopCand \
	test/newstest2009.output.nbest \
	test/newstest2009.output.1best 

Step 8: Recase and detokenize

You'll notice that your output is all lowercased and has the punctuation split off. In order to make the output more readable to human beings (remember us?), it'd be good to fix these problems and use proper punctuation and spacing. These are called recasing and detokenization, respectively. We can do recasing using SRILM, and can do detokenization with a perl script.

To build a recasing model first train a language model on true cased English text:

$SRILM/bin/macosx64/ngram-count \
	-unk \
	-order 5 \
	-kndiscount1 -kndiscount2 -kndiscount3 -kndiscount4 -kndiscount5 \
	-text training/training.en.tok \
	-lm model/lm/training.TrueCase.5gram.lm

Next, you'll need to create a list of all of the alternative ways that each word can be capitalized. This will be stored in a map file that lists a lowercased word as the key and associates it with all of the variant capitalization of that word. Here's an example perl script to create the map:

#!/usr/bin/perl
#
# truecase-map.perl
# -----------------
# This script outputs alternate capitalizations

%map = ();
while($line = <>) {
    @words = split(/\s+/, $line);
    foreach $word (@words) {
	$key = lc($word);
	$map{$key}{$word} = 1;
    }
}

foreach $key (sort keys %map) {
    @words = keys %{$map{$key}};
    if(scalar(@words) > 1 || !($words[0] eq $key)) {
	print $key;
	foreach $word (sort @words) {
	    print " $word";
	}
	print "\n";
    }
}
cat training/training.en.tok | perl truecase-map.perl > model/lm/true-case.map

Finally, recase the lowercased 1-best translation by running the SRILM disambig program, which takes the map of alternative capitalizations, creates a confusion network, and uses truecased LM to find the best path through it:

$SRILM/bin/macosx/disambig \
	-lm model/lm/training.TrueCase.5gram.lm \
	-keep-unk \
	-order 5 \
	-map model/lm/true-case.map \
	-text test/mt09.output.1best \
	| perl strip-sent-tags.perl
	> test/mt09.output.1best.recased

Where strip-sent-tags.perl is:

while($line = <>) {
    $line =~ s/^\s*<s>\s*//g;
    $line =~ s/\s*<\/s>\s*$//g;
    print $line . "\n";
}

Step 9: Score the translations

The quality of machine translation is commonly measured using the BLEU metric, which automatically compares a system's output against reference human translations. You can score your output using the JoshuaEval class, Joshua's built-in scorer:

java -cp $JOSHUA/bin -Djava.library.path=lib -Xmx1000m -Xms1000m \
	-Djava.util.logging.config.file=logging.properties \
	joshua.util.JoshuaEval \
	-cand dev/dev2006.en.output \
	-ref dev/dev2006.en.small \
	-m BLEU 4 closest