OSCAR Apr 27, 2016 Go to Project Site Image credit: Alix Chagué Corpus Linguistics Pedro Javier Ortiz Suárez PhD Student I’m a PhD student in Computer Science at Sorbonne Université and at the ALMAnaCH research team at Inria Publications A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages We explore the impact of the training corpus on contextualized word embeddings in five mid-resource languages. Pedro Javier Ortiz Suárez, Laurent Romary, Benoît Sagot PDF Cite Dataset Project Video DOI ACL Anthology ACL 2020 HAL arXiv Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures We propose a new pipeline to filter, clean and classify Common Crawl by language, we publish the final corpus under the name OSCAR. Pedro Javier Ortiz Suárez, Benoît Sagot, Laurent Romary PDF Cite Code Dataset Project Slides DOI CMLC-7 Website HAL