CLIRMatrix

A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval

Citation

@inproceedings{sun2020clirmatrix, title={CLIRMatrix: A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval}, author={Sun, Shuo and Duh, Kevin}, booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)}, pages={4160--4170}, year={2020} }

BI-139

A bilingual dataset of queries in one language matched with relevant documents in another language for 139x138=19,182 language pairs.

afals

query language: Afrikaans (af)
document language: Alemannic (als)

base

TRAIN (3.3M)DEV (334K)TEST1 (334K)TEST2 (333K)

full

TRAIN (1.2M)DEV (177K)TEST1 (179K)TEST2 (174K)UNALIGNED (13K)

MULTI-8

A multilingual dataset of queries and documents jointly aligned in 8 different languages.

arde

query language: Arabic (ar)
document language: German (de)

TRAIN (4.4M)DEV (459K)TEST1 (459K)TEST2 (459K)

Documents

A collection of Wikipedia documents.
Format is document ID<TAB>text.

af

Language: Afrikaans (af)

Truncated Documents (23M)

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.