A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval
Citation
@inproceedings{sun2020clirmatrix,
title={CLIRMatrix: A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval},
author={Sun, Shuo and Duh, Kevin},
booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
pages={4160--4170},
year={2020}
}
BI-139
A bilingual dataset of queries in one language matched with relevant documents in another language for 139x138=19,182 language pairs.
MULTI-8
A multilingual dataset of queries and documents jointly aligned in 8 different languages.
Documents
A collection of Wikipedia documents.
Format is document ID<TAB>text.