CLIRMatrix

A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval

Citation

@inproceedings{sun2020clirmatrix, title={CLIRMatrix: A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval}, author={Sun, Shuo and Duh, Kevin}, booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)}, pages={4160--4170}, year={2020} }

BI-139

A bilingual dataset of queries in one language matched with relevant documents in another language for 139x138=19,182 language pairs.
afals

query language: Afrikaans (af)
document language: Alemannic (als)

base
TRAIN (3.3M)DEV (334K)TEST1 (334K)TEST2 (333K)
full
TRAIN (1.2M)DEV (177K)TEST1 (179K)TEST2 (174K)UNALIGNED (13K)


MULTI-8

A multilingual dataset of queries and documents jointly aligned in 8 different languages.
arde

query language: Arabic (ar)
document language: German (de)

TRAIN (4.4M)DEV (459K)TEST1 (459K)TEST2 (459K)


Documents

A collection of Wikipedia documents.
Format is document ID<TAB>text.
af

Language: Afrikaans (af)


Truncated Documents (23M)


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.