Summary:
========

The lexical substitution corpus CoInCo ("Concepts in Context") is
based on contiguous texts provided in MASC (the manually annotated
subcorpus of the Open American National Corpus, OANC). It contains for
every content word in selected (complete) text files substitute words
collected via crowdsourcing (using Amazon Mechanical Turk, AMT) by 6
different annotators per token. The dataset comprises more than
150,000 responses for around 15,000 targets in about 2,500 sentences
containing approximately 35,000 words. Targets are roughly balanced
across the genres "news" and "fiction".

Annotators saw a target sentence with the highlighted two (sometimes
one, when the only leftover in the sentence) target words (that was
tagged in MASC as noun, verb, adjective, or adverb), and the preceding
and following sentence in the original text file as context (but shown
less prominent). They were asked to provide as many as possible
substitutes (up to 5) that would not change the meaning of the
sentence or otherwise mark why they did not provide one, choosing from
"proper name", "part of a fixed expression", "no replacement
possible", or "other problem (with description)". English as the
mother tongue of annotators could not be guaranteed, but we set the
residence address constraint for AMT participants to U.S.A.

We created target ID lists for a development/test set (65%/35% target
number split) that are nearly balanced across the two genres (news,
fiction) within each set. For each set we selected MASC text files and
included all the targets in these files - so both development and test
set contain full substitution data from sentences in contiguous
texts. Note that we used the complete target set for the models
described in the paper.


License
=======
The CoInCo download includes a sample of the MASC corpus,
which is available under the CC-BY-3.0-US license.
You can find MASC at http://www.anc.org/data/masc/.

The lexical substitution annotations that we added for CoInCo
are hereby published under the same CC-BY-3.0-US license.
Find the full text of the license here:
https://creativecommons.org/licenses/by/3.0/us/


Citation
========

More details can be found in:

Gerhard Kremer, Katrin Erk, Sebastian Padó, Stefan Thater: 
What Substitutes Tell Us – Analysis of an 
"All-Words" Lexical Substitution Corpus. 
Proceedings of EACL April, 2014. Gothenburg, Sweden.

@inproceedings{kremer14:_what_subst_tell_us,
  address = {Gothenburg, Sweden},
  author = {Kremer, Gerhard and Erk, Katrin and Pad\'o, Sebastian and Thater, Stefan},
  booktitle = {Proceedings of EACL},
  pages = {540--549},
  title = {What Substitutes Tell Us -- Analysis of an
                  'All-Words' Lexical Substitution Corpus},
  url = {http://www.aclweb.org/anthology/E14-1057.pdf},
  year = 2014
}


Files:
======
coinco.xml - "concepts in context" lexical substitution data in XML format
testset-tokenIDs.txt - target IDs for the test set, one per line
devset-tokenIDs.txt - target IDs for the development set, one per line


XML structure:
==============

We represent the annotated data in a simple XML format. The corpus is
a list of sentences with tokens as elements. For each token, we
provide the original MASC part-of-speech tags, the lemma and
part-of-speech tag as determined by TreeTagger, and the list of the
manually annotated substitutes (if applicable). Another binary
attribute ("problematic") marks targets (when set to "yes") where less
than 2 annotators provided substitutes. For each substitute, we
provide its annotation frequency and its TreeTagger lemma and PoS-tag.
To map sentences back onto the MASC corpus, we finally included the
original MASC text filename and sentence IDs as well as the context
sentences shown with the respective target sentence.

The original data was processed in UTF-8 character encoding. In the
XML file, non-ASCII characters have been transformed into XML
entities.


XML element structure tree: 
===========================
("+": at least one element of this type)

document
|__sent+
   |__precontext
   |__targetsentence
   |__postcontext
   |__tokens
      |__token+
         |__substitutions
            |__subst+


XML elements w/ attributes:
===========================

sent
(the top category containing all data for a target sentence)
. MASCfilename: the text file name as given in MASC
. MASCsentenceID: the sentence ID within the text file, as given in MASC

targetsentence
(the original target sentence as appearing in MASC,
containing substitution targets and non-content words)

precontext
(the sentence in MASC appearing before the target sentence)

postcontext
(the sentence in MASC appearing after the target sentence)

tokens
(containing each word token of the target sentence)

token
- for non-targets:
  . id: dummy-wordID ("XXX")
  . wordform: word token, taken from MASC
  . lemma: word lemma, taken from TreeTagger
  . posMASC: dummy-entry ("XXX")
  . posTT: part-of-speech, taken from TreeTagger
- for targets (w/ substitutions):
  . id: target token ID, unique across substitution corpus
  . wordform: wordform, taken from MASC (but w/o hyphens)
  . lemma: word lemma, taken from TreeTagger
  . posMASC: part-of-speech, taken from MASC
  . posTT: part-of-speech, taken from TreeTagger
  . problematic: "yes" if less than two annotators entered a substitute (otherwise: "no")

substitutions
(containing all substitution lemmata)

subst
(lexical substitution for the target token)
. lemma: lemma for substitute, taken from TreeTagger
  in case of multi-word substitutes, lemmata for all tokens are given
. pos: part-of-speech for substitute, taken from TreeTagger
  in case of multi-word substitutes, PoS-tags for all tokens are given
. freq: number of annotators that produced the substitute w/ this lemma


Additional information on data peculiarities:
=============================================

From the target set we heuristically removed auxiliary verbs that were
part-of-speech-tagged incorrectly in the gold standard.

In general, tokenisation is the same as in MASC (note that we verified
this for the target words, only). But, to provide lemmata from the
TreeTagger output for target sentences, we had sentences tokenised by
the TreeTagger. One observation there was that words connected with a
hyphen are tokenised in MASC as separate tokens, whereas they were
tokenised by the TreeTagger as single tokens. To enable a practically
feasible and correct automatic assignment of lemmata (via word
position indexes) to tokens, for TreeTagger input we substituted
hyphens in the target sentences with spaces - therefore, hyphens are
not included in tokens. Nevertheless, the XML-element "targetsentence"
contains the original sentence as appearing in MASC (i.e., with
hyphens - to check for tokenisation differences, concatenate all
tokens of a sentence by spaces and compare with "targetsentence").

Similarly, 2.6 % of the target sentences in MASC (65) were tokenised
differently from the TreeTagger output because of apostrophes (e.g.,
in names). We completely removed these sentences from the data
collection.

We notice that, because of duplicate sentenceID/range-lines in MASC,
in some cases (concerning 534 out of 2,474 sentences) we had shown the
annotators a copy of the respective target sentence as "postcontext"
(resulting in less informative context then would have been possible
during annotation). We corrected this bug, but kept the "postcontext"
in the resulting XML as it was presented in the experiment. For the
correct "postcontext" sentence, please consult the MASC data.

Because of duplicate sentence IDs (see above), some target instances
were annotated by more participants than intended, and sometimes the
same annotator saw the same target instance twice.

In the open tasks (free to everyone, to find capable annotators), there
were 10 annotators per target instance. In the closed tasks (only open
to trustfully, invited annotators from the open tasks), 6 annotators
processed each target instance. 490 targets were processed by more
than 6 annotators.


MASC text files with lexical substitution annotations:
======================================================

Genre fiction:

lw1.txt
captured_moments.txt
Nathans_Bylichka.txt


Genre newspaper/-wire:

20000410_nyt-NEW.txt
20000415_apw_eng-NEW.txt
20000419_apw_eng-NEW.txt
20000424_nyt-NEW.txt
A1.E1-NEW.txt
A1.E2-NEW.txt
enron-thread-159550.txt
NYTnewswire1.txt
NYTnewswire2.txt
NYTnewswire3.txt
NYTnewswire4.txt
NYTnewswire5.txt
NYTnewswire6.txt
NYTnewswire7.txt
NYTnewswire8.txt
NYTnewswire9.txt
wsj_0006.txt
wsj_0026.txt
wsj_0027.txt
wsj_0032.txt
wsj_0068.txt
wsj_0073.txt
wsj_0106.txt
wsj_0120.txt
wsj_0124.txt
wsj_0127.txt
wsj_0132.txt
wsj_0135.txt
wsj_0136.txt
wsj_0144.txt
wsj_0150.txt
wsj_0151.txt
wsj_0152.txt
wsj_0157.txt
wsj_0158.txt
wsj_0159.txt
wsj_0160.txt
wsj_0161.txt
wsj_0165.txt
wsj_0167.txt
wsj_0168.txt
wsj_0169.txt
wsj_0171.txt
wsj_0172.txt
wsj_0173.txt
wsj_0175.txt
wsj_0176.txt
wsj_0184.txt
wsj_0187.txt
wsj_0189.txt
wsj_1640.mrg-NEW.txt
wsj_2465.txt