Large-Scale CLIR Dataset
The Large-Scale CLIR Dataset is a retrieval dataset built for Cross-Language Information Retrieval (CLIR).
The dataset is derived from Wikipedia and contains more 2.8 million English single-sentence queries with relevant documents from 25 other selected languages.
Terms of Use
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
If you use the corpus in your work, please cite:
Shota Sasaki, Shuo Sun, Shigehiko Schamoni, Kevin Duh, and Kentaro Inui
Cross-lingual Learning-to-Rank with Shared Representations
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), New Orleans, LA, USA.
June 2018.
Data
All queries and documents in this dataset are extracted from the August 23, 2017 version of the Wikimpedia dump.
For practical purposes, each document is limited to the first 200 words of the article.
Empty documents and category pages are also filtered.
Relevance judgments are constructed from the inter-language links between English Wikipedia articless and Foreign Language Wikipedia articles.
A relevance level of (2) is assigned to the (English) cross-lingual mate, and level (1) to all other articles that link to the mate, AND are linked by the mate.
For a more detailed description of the corpus construction process, see the above publication.
Language |
#Doc |
#Query |
#SR |
Arabic |
535 |
324 |
194 |
Catalan |
548 |
339 |
625 |
Chinese |
951 |
463 |
462 |
Czech |
386 |
233 |
720 |
Dutch |
1908 |
687 |
1646 |
Finnish |
418 |
273 |
665 |
French |
1894 |
1089 |
4048 |
German |
2091 |
938 |
4612 |
Italian |
1347 |
808 |
2635 |
Japanese |
1071 |
426 |
2912 |
Korean |
394 |
224 |
343 |
Norwegian-Nynorsk |
133 |
99 |
150 |
Norwegian-Bokma ĚŠl |
471 |
299 |
663 |
Polish |
1234 |
693 |
1777 |
Portuguese |
973 |
611 |
1130 |
Romanian |
376 |
199 |
251 |
Russian |
1413 |
664 |
1656 |
Simple English |
127 |
114 |
135 |
Spanish |
1302 |
781 |
2113 |
Swahili |
37 |
22 |
35 |
Swedish |
3785 |
639 |
1430 |
Tagalog |
79 |
48 |
23 |
Turkish |
295 |
185 |
195 |
Ukrainian |
704 |
348 |
565 |
Vietnamese |
1392 |
354 |
257 |
(All numbers are in units of one thousand)
|
Statistics of CLIR Datasets: The number of documents (#Doc) in a foreign language and the number of English queries are shown. The number of "most relevant" documents is by definition equal to #Query. The number of "slightly relevant" documents is shown in the column #SR.
Format
The English queries data (wiki_en.queries file) can be found in the "English" folder.
Each of the other folders contains two data files:
1) Foreign Language documents data (.docs file)
2) relevance judgments (.qrels file)
The format of the English query file is:
EN-wiki-page-id [TAB] first sentence (with article title removed)
The format of a document file is:
[Foreign Language]-wiki-page-id [TAB] article
The format of the relevance judgments file is:
[Foreign Language]-wiki-page-id [TAB] EN-wiki-page-id [TAB] relevance-level
Download
Full Raw Data in all languages (6.6GB)
Datasplit (de, ja, fr, sw, tl) used in Sasaki et. al. 2018 (5.8GB)
Contact
Any questions about the dataset can be directed to Shuo Sun (ssun32@jhu.edu)
Acknowledgment
This research is based upon work supported by the Intelligence Advanced Research Projects Activity (IARPA), (contract FA8650-17-C-9115). The views and conclusions herein are those of the authors and should not be interpreted as necessarily representing official policies, ex-pressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copy-right annotation therein.