Corpus of full-text English scientific papers of genetics and computational linguistics


Ina Rösiger

SciCorp is a corpus of full-text English scientific papers of two disciplines, genetics and computational linguistics. The corpus comprises coreference and bridging information as well as information status labels.

The corpus has been reliably annotated by independent human coders with moderate inter-annotator agreement (average kappa=0.71). In total, we have annotated 14 full papers containing 61,045 tokens and marked about 8700 definite noun phrases.

The corpus is available for download in two different formats: in an offset-based format and, for the coreference annotations, in the widely-used, tabular CoNLL-2012 format.


Ina Rösiger (2016)
SciCorp: A Corpus of English Scientific Articles Annotated for Information-Structural Analysis
Proceedings of LREC. Portorož, Slovenia 2016.


