The Multitarget TED Talks Task (MTTT)

This is a collection of multitarget bitexts based on TED Talks (https://www.ted.com).
The data is extracted from WIT^3, which is also used for the IWSLT Machine Translation Evaluation Campaigns.

We have a different train/dev/test split from IWSLT. Here, all the dev and test sets have the same English side and come from the same talks.
There are 20 languages, so this is 20-way parallel.
This can support the evaluation of:

The dev and test sets have roughly 2000 sentences each, extracted from 30 talks, and are multi-way parallel.
The train set for different languages may have different English sides, ranging from 77k-188k "sentences" (1.5M to 3.9M English tokens). These train sets are not 20-way parallel and represent the largest bitext we can extract for each language pair.
The data is preprocessed and tokenized via either the Moses tokenizer by default or other language-specific tokenizers when available (PyArabic for Arabic, Kytea for Japanese, Mecab-ko for Korean, Jieba for Chinese).

Additionally, metadata about talk ids and seekvideo counters are retained so that document-level processing or speech translation experiments are possible.

The languages are: ar (Arabic), bg (Bulgarian), cs (Czech), de (German), fa (Farsi), fr (French), he (Hebrew), hu (Hungarian), id (Indonesian), ja (Japanese), ko (Korean), pl (Polish), pt (Portuguese), ro (Romanian), ru (Russian), tr (Turkish), uk (Ukranian), vi (Vietnamese), zh (Chinese). Note that all talks are originally spoken and transcribed in English, then translated by TED translators.

Data


Terms of Use

TED makes its collection available under the Creative Commons BY-NC-ND license. Please acknowledge TED when using this data. We acknowledge the authorship of TED Talks (BY condition). We are not redistributing the transcripts for commercial purposes (NC condition) nor making derivative works of the original contents (ND condition).


Leaderboard

The goal here is to create a standard way for researchers to compare and improve their machine translation systems. We are doing so in a friendly competition format. Feel free to email your BLEU results to x@cs.jhu.edu (x=kevinduh) for inclusion in the tables below (ideally, also provide a link to a paper or a comment about your system). BLEU is computed with the Moses toolkit multi-bleu.perl on the provided tokenization. The tables below are sorted by task, then sorted by BLEU.

Translation into English (xx->en)

Translation into English (xx->en)
Task Date System Name Submitter test1 BLEU Comment
ar-en 2018-11-20 BPE+CharCNN Pamela Shapiro (JHU) 29.93 Combining BPE subunits with character CNN for addressing source morphology
ar-en 2018-11-19 SockeyeNMT tm1 Kevin Duh (JHU) 28.28 6-layer transformer in sockeye-recipes
ar-en 2018-11-19 SockeyeNMT rm1 Kevin Duh (JHU) 27.50 2-layer mid-size RNN in sockeye-recipes
bg-en 2018-11-19 SockeyeNMT tm1 Kevin Duh (JHU) 35.93 6-layer transformer in sockeye-recipes
bg-en 2018-11-19 SockeyeNMT rm1 Kevin Duh (JHU) 35.84 2-layer mid-size RNN in sockeye-recipes
cs-en 2018-11-19 SockeyeNMT tm1 Kevin Duh (JHU) 26.05 6-layer transformer in sockeye-recipes
cs-en 2018-11-19 SockeyeNMT rm1 Kevin Duh (JHU) 25.80 2-layer mid-size RNN in sockeye-recipes
de-en 2018-11-20 BPE+CharCNN Pamela Shapiro (JHU) 32.74 Combining BPE subunits with character CNN for addressing source morphology
de-en 2018-11-19 SockeyeNMT tm1 Kevin Duh (JHU) 32.46 6-layer transformer in sockeye-recipes
de-en 2018-11-19 SockeyeNMT rm1 Kevin Duh (JHU) 31.34 2-layer mid-size RNN in sockeye-recipes
fa-en 2018-11-19 SockeyeNMT tm1 Kevin Duh (JHU) 22.08 6-layer transformer in sockeye-recipes
fa-en 2018-11-19 SockeyeNMT rm1 Kevin Duh (JHU) 21.21 2-layer mid-size RNN in sockeye-recipes
fr-en 2018-11-20 BPE+CharCNN Pamela Shapiro (JHU) 35.49 Combining BPE subunits with character CNN for addressing source morphology
fr-en 2018-11-19 SockeyeNMT tm1 Kevin Duh (JHU) 35.09 6-layer transformer in sockeye-recipes
fr-en 2018-11-19 SockeyeNMT rm1 Kevin Duh (JHU) 35.01 2-layer mid-size RNN in sockeye-recipes
he-en 2018-11-19 SockeyeNMT tm1 Kevin Duh (JHU) 35.09 6-layer transformer in sockeye-recipes
he-en 2018-11-19 SockeyeNMT rm1 Kevin Duh (JHU) 32.76 2-layer mid-size RNN in sockeye-recipes
he-en 2018-11-20 BPE+CharCNN Pamela Shapiro (JHU) 30.81 Combining BPE subunits with character CNN for addressing source morphology
hu-en 2018-11-20 BPE+CharCNN Pamela Shapiro (JHU) 22.62 Combining BPE subunits with character CNN for addressing source morphology
hu-en 2018-11-19 SockeyeNMT tm1 Kevin Duh (JHU) 21.14 6-layer transformer in sockeye-recipes
hu-en 2018-11-19 SockeyeNMT rm1 Kevin Duh (JHU) 20.64 2-layer mid-size RNN in sockeye-recipes
id-en 2018-11-19 SockeyeNMT tm1 Kevin Duh (JHU) 27.47 6-layer transformer in sockeye-recipes
id-en 2018-11-19 SockeyeNMT rm1 Kevin Duh (JHU) 26.85 2-layer mid-size RNN in sockeye-recipes
ja-en 2018-11-19 SockeyeNMT tm1 Kevin Duh (JHU) 10.90 6-layer transformer in sockeye-recipes
ja-en 2018-11-19 SockeyeNMT rm1 Kevin Duh (JHU) 10.42 2-layer mid-size RNN in sockeye-recipes
ko-en 2018-11-19 SockeyeNMT tm1 Kevin Duh (JHU) 15.23 6-layer transformer in sockeye-recipes
ko-en 2018-11-19 SockeyeNMT rm1 Kevin Duh (JHU) 14.30 2-layer mid-size RNN in sockeye-recipes
pl-en 2018-11-19 SockeyeNMT tm1 Kevin Duh (JHU) 23.56 6-layer transformer in sockeye-recipes
pl-en 2018-11-19 SockeyeNMT rm1 Kevin Duh (JHU) 21.97 2-layer mid-size RNN in sockeye-recipes
pt-en 2018-11-19 SockeyeNMT tm1 Kevin Duh (JHU) 41.80 6-layer transformer in sockeye-recipes
pt-en 2018-11-20 BPE+CharCNN Pamela Shapiro (JHU) 41.67 Combining BPE subunits with character CNN for addressing source morphology
pt-en 2018-11-19 SockeyeNMT rm1 Kevin Duh (JHU) 40.80 2-layer mid-size RNN in sockeye-recipes
ro-en 2018-11-20 BPE+CharCNN Pamela Shapiro (JHU) 36.97 Combining BPE subunits with character CNN for addressing source morphology
ro-en 2018-11-19 SockeyeNMT tm1 Kevin Duh (JHU) 34.96 6-layer transformer in sockeye-recipes
ro-en 2018-11-19 SockeyeNMT rm1 Kevin Duh (JHU) 34.56 2-layer mid-size RNN in sockeye-recipes
ru-en 2018-11-20 BPE+CharCNN Pamela Shapiro (JHU) 24.14 Combining BPE subunits with character CNN for addressing source morphology
ru-en 2018-11-19 SockeyeNMT tm1 Kevin Duh (JHU) 24.03 6-layer transformer in sockeye-recipes
ru-en 2018-11-19 SockeyeNMT rm1 Kevin Duh (JHU) 22.58 2-layer mid-size RNN in sockeye-recipes
tr-en 2018-11-19 SockeyeNMT tm1 Kevin Duh (JHU) 22.40 6-layer transformer in sockeye-recipes
tr-en 2018-11-19 SockeyeNMT rm1 Kevin Duh (JHU) 18.78 2-layer mid-size RNN in sockeye-recipes
uk-en 2018-11-19 SockeyeNMT tm1 Kevin Duh (JHU) 17.87 6-layer transformer in sockeye-recipes
uk-en 2018-11-19 SockeyeNMT rm1 Kevin Duh (JHU) 16.99 2-layer mid-size RNN in sockeye-recipes
vi-en 2018-11-19 SockeyeNMT tm1 Kevin Duh (JHU) 25.39 6-layer transformer in sockeye-recipes
vi-en 2018-11-19 SockeyeNMT rm1 Kevin Duh (JHU) 24.15 2-layer mid-size RNN in sockeye-recipes
zh-en 2018-11-19 SockeyeNMT tm1 Kevin Duh (JHU) 16.63 6-layer transformer in sockeye-recipes
zh-en 2018-11-19 SockeyeNMT rm1 Kevin Duh (JHU) 15.83 2-layer mid-size RNN in sockeye-recipes

Translation from English (en->xx)

TODO (setup leaderboard for en->xx)

Related Resources and Reference

We kindly thank WIT3, which provides ready-to-use versions for research purposes. For a detailed description of WIT3, see:

You may also be interested in a related dataset from Ye, et. al. NAACL 2018. It packages TED Talks with even more languages. The main difference is its dev/test is not split in a multi-way parallel fashion like here, but is different for each language.

If you would like to cite this task:

	    @misc{duh18multitarget,
                author = {Kevin Duh},
                title = {The Multitarget TED Talks Task},
                howpublished = {\url{http://www.cs.jhu.edu/~kevinduh/a/multitarget-tedtalks/}},
                year = {2018},
	    }