SciCorp

Corpus of full-text English scientific papers of genetics and computational linguistics

DIRNDL

Type
Corpus
Author
Ina Rösiger
Description

SciCorp is a corpus of full-text English scientific papers of two disciplines, genetics and computational linguistics. The corpus comprises coreference and bridging information as well as information status labels.

The corpus has been reliably annotated by independent human coders with moderate inter-annotator agreement (average kappa=0.71). In total, we have annotated 14 full papers containing 61,045 tokens and marked about 8700 definite noun phrases.

The corpus is available for download in two different formats: in an offset-based format and, for the coreference annotations, in the widely-used, tabular CoNLL-2012 format.

Reference

Ina Rösiger (2016)
SciCorp: A Corpus of English Scientific Articles Annotated for Information-Structural Analysis
Proceedings of LREC. Portorož, Slovenia 2016.

Download
 

General Contact IMS

Pfaffenwaldring 5 b, 70569 Stuttgart

 

Webmaster of the IMS

  • Write e-mail
  • If you have any problems with the website, please directly contact the webmaster.
To the top of the page