Summary ======= The anvanLS (ANVAN Lexical Substitution) dataset is based on the CoInCo lexical substitution corpus (based, in turn, on texts from the MASC corpus). It contains 165 ANVAN (adjective-noun-verb-adjective-noun) clauses, extracted from CoInCo sentences by searching for transitive verbs, and selecting the noun phrases representing the subject and object of the respective transitive verb. Every content word in the source CoInCo corpus contains a list of context-appropriate substitute words, provided by human annotators. For each content word that forms part of an ANVAN clause in the anvanLS dataset, two substitute words were chosen by selecting the two substitutes suggested by the highest number of annotators. In the case of a tie, the word with the highest number of occurrences in the working corpus (BNC+ukWaC+Wikipedia) was selected. Additionally, for each content word, two context-inappropriate synonyms ('confounders') were selected, by finding the target word's nearest neighbours in the working corpus, and eliminating all candidates that were suggested as substitutes by the human annotators. In total, the 165 ANVAN clauses yielded 732 target instances for lexical substitution. A full description of the dataset creation process can be found in the paper cited below. License ======= The dataset includes a sample of the CoInCo and MASC corpora, both of which are available under the CC-BY-3.0-US license. CoInCo can be found at http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/coinco.html, and MASC at http://www.anc.org/data/masc/. The dataset is published under the same CC-BY-3.0-US license. The full text of the license can be found at https://creativecommons.org/licenses/by/3.0/us/. Citation ======== Details can be found in: Maja Buljan, Sebastian Padó, Jan Šnajder: Lexical Substitution for Evaluating Compositional Distributional Models Proceedings of NAACL June, 2018. New Orleans, LA, USA. Files ===== anvanLS.txt - the anvanLS dataset, in txt format File Structure ============== The ANVAN clauses, with their respective substitutes and confounders, are listed in the txt by providing the original clause, and all four of its variants when focusing on a single target word for substitution. Each substitution instance is presented through six lines, and separated by a blank line. Line 1: CoInCo XML sentence ID ('MASCsentID'), followed by 5 integers denoting the indices of the clause constituents within the original sentence. Line 2: (0) The original ANVAN clause; tab-separated tokens, lemmatized and POS-tagged Lines 3-4: (1-2) The two context-appropriate substitution ('substitute') instances of the clause. Lines 5-6: (3-4) The two context-inappropriate substitution ('confounder') instances of the clause.