Large-Scale CLIR Dataset

The Large-Scale CLIR Dataset is a retrieval dataset built for Cross-Language Information Retrieval (CLIR).
The dataset is derived from Wikipedia and contains more 2.8 million English single-sentence queries with relevant documents from 25 other selected languages.

Terms of Use

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

If you use the corpus in your work, please cite:
Shota Sasaki, Shuo Sun, Shigehiko Schamoni, Kevin Duh, and Kentaro Inui
Cross-lingual Learning-to-Rank with Shared Representations
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), New Orleans, LA, USA.
June 2018.

Data

All queries and documents in this dataset are extracted from the August 23, 2017 version of the Wikimpedia dump.
For practical purposes, each document is limited to the first 200 words of the article.
Empty documents and category pages are also filtered.

Relevance judgments are constructed from the inter-language links between English Wikipedia articless and Foreign Language Wikipedia articles.
A relevance level of (2) is assigned to the (English) cross-lingual mate, and level (1) to all other articles that link to the mate, AND are linked by the mate.

For a more detailed description of the corpus construction process, see the above publication.

Language	#Doc	#Query	#SR
Arabic	535	324	194
Catalan	548	339	625
Chinese	951	463	462
Czech	386	233	720
Dutch	1908	687	1646
Finnish	418	273	665
French	1894	1089	4048
German	2091	938	4612
Italian	1347	808	2635
Japanese	1071	426	2912
Korean	394	224	343
Norwegian-Nynorsk	133	99	150
Norwegian-Bokma ̊l	471	299	663
Polish	1234	693	1777
Portuguese	973	611	1130
Romanian	376	199	251
Russian	1413	664	1656
Simple English	127	114	135
Spanish	1302	781	2113
Swahili	37	22	35
Swedish	3785	639	1430
Tagalog	79	48	23
Turkish	295	185	195
Ukrainian	704	348	565
Vietnamese	1392	354	257
(All numbers are in units of one thousand)

Statistics of CLIR Datasets: The number of documents (#Doc) in a foreign language and the number of English queries are shown. The number of "most relevant" documents is by definition equal to #Query. The number of "slightly relevant" documents is shown in the column #SR.

Format

The English queries data (wiki_en.queries file) can be found in the "English" folder.

Each of the other folders contains two data files:
1) Foreign Language documents data (.docs file)
2) relevance judgments (.qrels file)

The format of the English query file is:
EN-wiki-page-id [TAB] first sentence (with article title removed)

The format of a document file is:
[Foreign Language]-wiki-page-id [TAB] article

The format of the relevance judgments file is:
[Foreign Language]-wiki-page-id [TAB] EN-wiki-page-id [TAB] relevance-level

Download

Full Raw Data in all languages (6.6GB)

Datasplit (de, ja, fr, sw, tl) used in Sasaki et. al. 2018 (5.8GB)

Contact

Any questions about the dataset can be directed to Shuo Sun (ssun32@jhu.edu)

Acknowledgment

This research is based upon work supported by the Intelligence Advanced Research Projects Activity (IARPA), (contract FA8650-17-C-9115). The views and conclusions herein are those of the authors and should not be interpreted as necessarily representing official policies, ex-pressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copy-right annotation therein.