Vietnamese dataset for similarity and relatedness
- Kim Anh Nguyen, Sabine Schulte im Walde, Ngoc Thang Vu
This dataset consists of two kinds of datasets: The first dataset, namely ViCon, comprises pairs of synonyms and antonymys across noun, verb, and adjective classes, offerring data to distinguish between similarity and dissimilarity. The second dataset ViSim-400 is a dataset of semantic relation pairs which contains degrees of similarity across five semantic relations, as rated by human judges.
Kim Anh Nguyen, Sabine Schulte im Walde and Ngoc Thang Vu. Introducing two Vietnamese Datasets for Evaluating Semantic Models of (Dis-)Similarity and Relatedness. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HTL). New Orleans, Louisiana, June 2018.
The resources are freely available for education, research and other non-commercial purposes. For download, click here.