This is a collection of multitarget bitexts based on TED Talks (https://www.ted.com).
The data is extracted from WIT^3, which is also used for the IWSLT Machine Translation Evaluation Campaigns.
We have a different train/dev/test split from IWSLT. Here, all the dev and test
sets have the same English side and come from the same talks.
There are 20 languages, so this is 20-way parallel.
This can support the evaluation of:
The dev and test sets have roughly 2000 sentences each, extracted from 30 talks, and are multi-way parallel.
The train set for different languages may have different English sides, ranging from 77k-188k "sentences" (1.5M to 3.9M English tokens). These train sets are not 20-way parallel and represent the largest bitext we can extract for each language pair.
The data is preprocessed and tokenized via either the Moses tokenizer by default or other language-specific tokenizers when available (PyArabic for Arabic, Kytea for Japanese, Mecab-ko for Korean, Jieba for Chinese).
Additionally, metadata about talk ids and seekvideo counters are retained so that document-level processing or speech translation experiments are possible.
The languages are: ar (Arabic), bg (Bulgarian), cs (Czech), de (German), fa (Farsi), fr (French), he (Hebrew), hu (Hungarian), id (Indonesian), ja (Japanese), ko (Korean), pl (Polish), pt (Portuguese), ro (Romanian), ru (Russian), tr (Turkish), uk (Ukranian), vi (Vietnamese), zh (Chinese). Note that all talks are originally spoken and transcribed in English, then translated by TED translators.
TED makes its collection available under the Creative Commons BY-NC-ND license. Please acknowledge TED when using this data. We acknowledge the authorship of TED Talks (BY condition). We are not redistributing the transcripts for commercial purposes (NC condition) nor making derivative works of the original contents (ND condition).
The goal here is to create a standard way for researchers to compare and improve their machine translation systems. We are doing so in a friendly competition format. Feel free to email your BLEU results to x@cs.jhu.edu (x=kevinduh) for inclusion in the tables below (ideally, also provide a link to a paper or a comment about your system). BLEU is computed with the Moses toolkit multi-bleu.perl on the provided tokenization. The tables below are sorted by task, then sorted by BLEU.
Translation into English (xx->en) | |||||
---|---|---|---|---|---|
Task | Date | System Name | Submitter | test1 BLEU | Comment |
ar-en | 2018-11-20 | BPE+CharCNN | Pamela Shapiro (JHU) | 29.93 | Combining BPE subunits with character CNN for addressing source morphology |
ar-en | 2018-11-19 | SockeyeNMT tm1 | Kevin Duh (JHU) | 28.28 | 6-layer transformer in sockeye-recipes |
ar-en | 2018-11-19 | SockeyeNMT rm1 | Kevin Duh (JHU) | 27.50 | 2-layer mid-size RNN in sockeye-recipes |
bg-en | 2018-11-19 | SockeyeNMT tm1 | Kevin Duh (JHU) | 35.93 | 6-layer transformer in sockeye-recipes |
bg-en | 2018-11-19 | SockeyeNMT rm1 | Kevin Duh (JHU) | 35.84 | 2-layer mid-size RNN in sockeye-recipes |
cs-en | 2018-11-19 | SockeyeNMT tm1 | Kevin Duh (JHU) | 26.05 | 6-layer transformer in sockeye-recipes |
cs-en | 2018-11-19 | SockeyeNMT rm1 | Kevin Duh (JHU) | 25.80 | 2-layer mid-size RNN in sockeye-recipes |
de-en | 2018-11-20 | BPE+CharCNN | Pamela Shapiro (JHU) | 32.74 | Combining BPE subunits with character CNN for addressing source morphology |
de-en | 2018-11-19 | SockeyeNMT tm1 | Kevin Duh (JHU) | 32.46 | 6-layer transformer in sockeye-recipes |
de-en | 2018-11-19 | SockeyeNMT rm1 | Kevin Duh (JHU) | 31.34 | 2-layer mid-size RNN in sockeye-recipes |
fa-en | 2018-11-19 | SockeyeNMT tm1 | Kevin Duh (JHU) | 22.08 | 6-layer transformer in sockeye-recipes |
fa-en | 2018-11-19 | SockeyeNMT rm1 | Kevin Duh (JHU) | 21.21 | 2-layer mid-size RNN in sockeye-recipes |
fr-en | 2018-11-20 | BPE+CharCNN | Pamela Shapiro (JHU) | 35.49 | Combining BPE subunits with character CNN for addressing source morphology |
fr-en | 2018-11-19 | SockeyeNMT tm1 | Kevin Duh (JHU) | 35.09 | 6-layer transformer in sockeye-recipes |
fr-en | 2018-11-19 | SockeyeNMT rm1 | Kevin Duh (JHU) | 35.01 | 2-layer mid-size RNN in sockeye-recipes |
he-en | 2018-11-19 | SockeyeNMT tm1 | Kevin Duh (JHU) | 35.09 | 6-layer transformer in sockeye-recipes |
he-en | 2018-11-19 | SockeyeNMT rm1 | Kevin Duh (JHU) | 32.76 | 2-layer mid-size RNN in sockeye-recipes |
he-en | 2018-11-20 | BPE+CharCNN | Pamela Shapiro (JHU) | 30.81 | Combining BPE subunits with character CNN for addressing source morphology |
hu-en | 2018-11-20 | BPE+CharCNN | Pamela Shapiro (JHU) | 22.62 | Combining BPE subunits with character CNN for addressing source morphology |
hu-en | 2018-11-19 | SockeyeNMT tm1 | Kevin Duh (JHU) | 21.14 | 6-layer transformer in sockeye-recipes |
hu-en | 2018-11-19 | SockeyeNMT rm1 | Kevin Duh (JHU) | 20.64 | 2-layer mid-size RNN in sockeye-recipes |
id-en | 2018-11-19 | SockeyeNMT tm1 | Kevin Duh (JHU) | 27.47 | 6-layer transformer in sockeye-recipes |
id-en | 2018-11-19 | SockeyeNMT rm1 | Kevin Duh (JHU) | 26.85 | 2-layer mid-size RNN in sockeye-recipes |
ja-en | 2018-11-19 | SockeyeNMT tm1 | Kevin Duh (JHU) | 10.90 | 6-layer transformer in sockeye-recipes |
ja-en | 2018-11-19 | SockeyeNMT rm1 | Kevin Duh (JHU) | 10.42 | 2-layer mid-size RNN in sockeye-recipes |
ko-en | 2018-11-19 | SockeyeNMT tm1 | Kevin Duh (JHU) | 15.23 | 6-layer transformer in sockeye-recipes |
ko-en | 2018-11-19 | SockeyeNMT rm1 | Kevin Duh (JHU) | 14.30 | 2-layer mid-size RNN in sockeye-recipes |
pl-en | 2018-11-19 | SockeyeNMT tm1 | Kevin Duh (JHU) | 23.56 | 6-layer transformer in sockeye-recipes |
pl-en | 2018-11-19 | SockeyeNMT rm1 | Kevin Duh (JHU) | 21.97 | 2-layer mid-size RNN in sockeye-recipes |
pt-en | 2018-11-19 | SockeyeNMT tm1 | Kevin Duh (JHU) | 41.80 | 6-layer transformer in sockeye-recipes |
pt-en | 2018-11-20 | BPE+CharCNN | Pamela Shapiro (JHU) | 41.67 | Combining BPE subunits with character CNN for addressing source morphology |
pt-en | 2018-11-19 | SockeyeNMT rm1 | Kevin Duh (JHU) | 40.80 | 2-layer mid-size RNN in sockeye-recipes |
ro-en | 2018-11-20 | BPE+CharCNN | Pamela Shapiro (JHU) | 36.97 | Combining BPE subunits with character CNN for addressing source morphology |
ro-en | 2018-11-19 | SockeyeNMT tm1 | Kevin Duh (JHU) | 34.96 | 6-layer transformer in sockeye-recipes |
ro-en | 2018-11-19 | SockeyeNMT rm1 | Kevin Duh (JHU) | 34.56 | 2-layer mid-size RNN in sockeye-recipes |
ru-en | 2018-11-20 | BPE+CharCNN | Pamela Shapiro (JHU) | 24.14 | Combining BPE subunits with character CNN for addressing source morphology |
ru-en | 2018-11-19 | SockeyeNMT tm1 | Kevin Duh (JHU) | 24.03 | 6-layer transformer in sockeye-recipes |
ru-en | 2018-11-19 | SockeyeNMT rm1 | Kevin Duh (JHU) | 22.58 | 2-layer mid-size RNN in sockeye-recipes |
tr-en | 2018-11-19 | SockeyeNMT tm1 | Kevin Duh (JHU) | 22.40 | 6-layer transformer in sockeye-recipes |
tr-en | 2018-11-19 | SockeyeNMT rm1 | Kevin Duh (JHU) | 18.78 | 2-layer mid-size RNN in sockeye-recipes |
uk-en | 2018-11-19 | SockeyeNMT tm1 | Kevin Duh (JHU) | 17.87 | 6-layer transformer in sockeye-recipes |
uk-en | 2018-11-19 | SockeyeNMT rm1 | Kevin Duh (JHU) | 16.99 | 2-layer mid-size RNN in sockeye-recipes |
vi-en | 2018-11-19 | SockeyeNMT tm1 | Kevin Duh (JHU) | 25.39 | 6-layer transformer in sockeye-recipes |
vi-en | 2018-11-19 | SockeyeNMT rm1 | Kevin Duh (JHU) | 24.15 | 2-layer mid-size RNN in sockeye-recipes |
zh-en | 2018-11-19 | SockeyeNMT tm1 | Kevin Duh (JHU) | 16.63 | 6-layer transformer in sockeye-recipes |
zh-en | 2018-11-19 | SockeyeNMT rm1 | Kevin Duh (JHU) | 15.83 | 2-layer mid-size RNN in sockeye-recipes |
We kindly thank WIT3, which provides ready-to-use versions for research purposes.
For a detailed description of WIT3, see:
You may also be interested in a related dataset from Ye, et. al. NAACL 2018. It packages TED Talks with even more languages. The main difference is its dev/test is not split in a multi-way parallel fashion like here, but is different for each language.
If you would like to cite this task:
@misc{duh18multitarget, author = {Kevin Duh}, title = {The Multitarget TED Talks Task}, howpublished = {\url{http://www.cs.jhu.edu/~kevinduh/a/multitarget-tedtalks/}}, year = {2018}, }