Foundational Course
Departament de Traducció i Filologia
Universitat Pompeu Fabra
April 16-20, 2007
Introduction to Corpus Resources, Annotation and Access
Sabine Schulte im Walde
Institut für Maschinelle Sprachverarbeitung
Universität Stuttgart
Course Description
This course presents an introduction to corpus resources, combining
the theoretical background of corpora, resource examples, annotation
levels, and tools for exploitation.
First, we motivate corpus resources for empirical linguistics, and
describe the properties/problems of corpus data, the levels of
annotation, and standardisation efforts.
We then relate the annotation levels to appropriate tools and uses for
exploitation:
- Tokenisation, tagging, lemmatisation are introduced; we present
CQP to exploit corpora with linear patterns for e.g. collocations, and
unix tools for shallow statistical analyses, e.g. the type-token
distinction, sorting, bigrams.
- Treebanks are introduced, with cross-linguistic examples; we
describe typical complexities (like pp-attachment), and present
TIGERSearch as a query tool.
- SensEval is introduced as a framework for defining and utilising
semantically annotated corpus data; we demonstrate the exploitation of
word senses.
Schedule (April 16-20)
- Monday, 6-9 pm:
- Introduction
- Tokenisation
- Exercise: Basic Unix Tools and Corpus Frequencies
- Tuesday, 7-9 pm:
- Morpho-Syntactic Annotation
- Word Distributions
- Exercise: Tree Tagger and Corpus Query Processor
- Wednesday: no class
- Thursday, 6-9 pm:
- Syntactic Annotation
- Evaluation
- Exercise: Searching Treebanks with TIGERSearch
- Friday, 4-8 pm:
- Semantic Annotation
- More Levels of Corpus Annotation
- Exercise: Semantic Annotation with SALTO
Course Material
Acknowledgements
Most of the course material was adopted from an earlier version of this
course at ESSLLI 2006. Thanks to my collegue Heike Zinsmeister who
prepared the introduction and the lectures on syntactic annotation and
the web as corpus!