Europarl Nominal Compoundhood Ratings

The Europarl Nominal Compoundhood Ratings (ENCR) is a selection of 394 sentences from the English portion of the Europarl corpus (Europarl v7, OPUS), annotated with 824 candidate compounds

Europarl Nominal Compoundhood Ratings

Patrick Ziering

The Europarl Nominal Compoundhood Ratings (ENCR) is a selection of 394 sentences from the English portion of the Europarl corpus (Europarl v7, OPUS (Tiedemann, 2012)), annotated with 824 candidate compounds.

Each compound token is associated with a rating (1, 2 or 3) for the degree of compoundhood and for the validity of six linguistic criteria, described below.

The compoundhood rating:

[3] very compoundlike     (i.e., a prototypical compound )
[2] rather compoundlike    (i.e., probably a compound )
[1] mildly compoundlike   (i.e., could be considered as a compound )

The six linguistic criteria:

  • Spelling:
    Does the spelling of the expression under consideration (i.e., closed or open compounding) point to compoundhood?
  • Inseparability:
    No element should intervene a compound’s constituents. While 'black bird' can be understood as a compound, 'black ugly bird' is a phrase. Can you think of a way to insert an element between the constituents of the underlying expression?
  • Inability to modify the modifier:
    Is there a modifying adjective/adverb or can you think of such an element in the surrounding context that modifies any modifier in the expression under consideration?
  • Inability to replace the head by the pronoun 'one':
    Can you replace the head of the expression under consideration by the pronoun 'one'?
  • Inflection of the modifier:
    Is any modifier inflected (wrt. regular word inflection) in the expression under consideration?
  • Prosody:
    While in a phrase such as 'black bird', the head (i.e., 'bird' ) is stressed (or
    both parts have equal stress), in a compound such as 'blackbird' the primary stress is commonly on the modifier (i.e., 'black' ). How would you stress the expression under consideration?
  • File format: Each line corresponds to one candidate compound. Each line contains 12 tab-spaced fields:
    internal ID <tab> candidate compound <tab> rating for compoundhood <tab> rating for all six linguistic criteria <tab> the underlying Europarl sentence <tab> the underlying Europarl sentence with the highlighted candidate compound <tab> the ID of the annotator (1 or 2)

Download: ENCR.tar.gz

Keywords: noun compound, compound noun, multi-word expression, database, list, resource, dataset, rating, ratings, criterion, criteria, linguistic criterion, linguistic criteria


Jörg Tiedemann
Parallel data, tools and interfaces in OPUS.
Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC), 2012.


General Contact IMS

Pfaffenwaldring 5 b, 70569 Stuttgart


Webmaster of the IMS

  • Write e-mail
  • If you have any problems with the website, please directly contact the webmaster.
To the top of the page