Foundational Course
Departament de Traducció i Filologia
Universitat Pompeu Fabra
April 16-20, 2007
Introduction to Corpus Resources, Annotation and Access
References
Corpora and Annotation
Tokenisation
- Gregory Grefenstette and Pasi Tapanainen (1994): What is a word, what is a sentence? Problems of tokenization. In Proceedings of the 3rd Conference on Computational Lexicography and Text Research, pp. 79-87. Budapest, Hungary.
- Andrei Mikheev (2002): Periods, Capitalized Words, etc. Computational Linguistics, 28(3):289-318.
- Andrei Mikheev (2003): Text segmentation. In: Ruslan Mitkov, editor: The Oxford Handbook of Computational Linguistics, pp. 376-394. Oxford University Press.
- Helmut Schmid (2007?): Tokenizing. In: Anke Lüdeling and Merja Kytö, editors: Corpus Linguistics. An International Handbook. Mouton de Gruyter, Berlin.
- Tokeniser:
Type/Token Frequency Distributions
Part-of-Speech Tagging
Morphological Annotation
Word Distributions
- Zellig Harris (1968): Distributional Structure. In: Jerold J. Katz, editor: The Philosophy of Linguistics, pp. 26-47. Oxford University Press.
- Kenneth W. Church and Patrick Hanks (1990): Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1):22-29.
- Ted Dunning (1993): Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1):61-74.
- Stefan Evert (2004): The statistics of word cooccurrences: word pairs and collocations. PhD thesis, Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart.
- Oliver Christ, Bruno M. Schulze, Anja Hofmann, Esther König (1999): The IMS corpus workbench: Corpus Query Processor. Technical report, Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart.
- Adam Kilgarriff, Pavel Rychly, Pavel Smrz, and David Tugwell (2004): The Sketch Engine. In Proceedings of the 11th EURALEX International Congress. Lorient, France.
- Collocations and multiword expressions online:
Syntactic Annotation
- Steven Abney (1991): Parsing by chunks. In:
Robert Berwick, Steven Abney and Carol Tenny, editors: Principle-Based
Parsing. Kluwer Academic Publishers, Dordrecht.
- Geoffrey Leech and Elizabeth Eyes (1997): Syntactic Annotation:
Treebanks. In: Richard Garside, Geoffrey Leech and Anthony McEnery,
editors: Corpus Annotation. London, New York: Longman, pp. 34-52.
- John Carroll, Guido Minnen, and Ted Briscoe (1999): Corpus annotation for
parser evaluation. In Proceedings of Linguistically Interpreted
Corpora.
- Anne Abeille, editor (2003): Treebanks: Bulding and Using Parsed
Corpora. Dordrecht, Boston, London: Kluwer Academic Publishers.
- Catherine Lai and Steven Bird (2004): Querying and updating
treebanks: A critical survey and requirements analysis. In
Proceedings of the Australasian Language Technology Workshop.
- Joakim Nivre (2007?): Treebanks. In: Anke Lüdeling and Merja Kytö,
editors: Corpus
Linguistics. An International Handbook. Mouton de Gruyter, Berlin.
- Penn Treebank:
- TIGER Treebank:
Semantic Annotation
Word Senses:
WordNet:
Prague Dependency Treebank:
- Eva Hajicová, Jarmila Panevová, and Petr Sgall (2000): A manual for tectogrammatic tagging of the Prague Dependency Treebank. UFAL/CKL Technical Report TR-2000-09, Charles University, Prague.
- Alena Böhmová, Jan Hajic, Eva Hajicová, and Barbora Hladká (2003): The Prague Dependency Treebank: A three-level annotation scenario. In: Anne Abeille, editor: "Treebanks: building and using syntactically annotated corpora". Kluwer Academic Publishers.
- Petr Sgall, Jarmila Panevová, and Eva Hajicová (2004): Deep syntactic annotation: Tectogrammatical representation and beyond. In Proceedings of the HLT-NAACL Workshop on "Frontiers in Corpus Annotation". Boston, MA.
- Jan Hajic and Zdenka Uresová (2005): The Prague Dependency Treebank and Valency Annotation. Tutorial at RANLP, Borovets.
- PDT online
FrameNet:
* general and English:
- Collin F. Baker, Charles J. Fillmore, and John B. Lowe (1998): The Berkeley FrameNet project. In Proceedings of the 17th International Conference on Computational Linguistics, pp. 86-90.
- Thierry Fontenelle, editor (2003): FrameNet and frame semantics. Special issue of the International Journal of Lexicography, 16(3).
* German:
* Spanish:
- Carlos Subirats and Hiroaki Sato (2004): Spanish FrameNet and FrameSQL. In Proceedings of the LREC Workshop on "Building Lexical Resources from Semantically Annotated Corpora".
* Japanese:
- Kyoko Hirose Ohara, Seiko Fujii, Toshio Ohori, Ryoko Suzuki, Hiroaki Saito, and Shun Ishizaki (2004): The Japanese FrameNet project: An introduction. In Proceedings of the LREC Workshop on "Building Lexical Resources from Semantically Annotated Corpora".
* FrameNet online:
PropBank:
OntoBank / OntoNotes:
- Invited talk by Eduard Hovy at LREC 2006: Corpus creation by annotation. Genoa, Italy.
- Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel (2006): OntoNotes: The 90% Solution. In Proceedings of the Human Language Technology of the North American Chapter of the Association for Computational Linguistics. New York City, NY.
- OntoBank online: link not yet available, check Eduard Hovy's website
Word Sense Disambiguation and Role Labeling:
Evaluation
More Levels of Corpus Annotation
The Prague Treebank:
- Eva Hajicová (1999): The Prague Dependency Treebank: Crossing the sentence boundary. In Proceedings of the 2nd Workshop on Text, Speech, Dialogue, pp. 20-27. Mariánske Lázne, Czech Republic.
- Eva Hajicová, Jarmila Panevová, and Petr Sgall (2000): Coreference in annotating a large corpus. In Proceedings of the 2nd International Conference on Language Resources and Evaluation, pp. 497-500. Athens, Greece.
- Oana Postolache, Ivana Kruijff-Korbayová, and Geert-Jan Kruijff (2005): Data-driven approaches for information structure identification. In Proceedings of the joint Conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 9-16. Vancouver, Canada.
- PDT online
Rhetorical Structure Theory and the RST Discourse Treebank:
The Penn Discourse TreeBank:
- Eleni Miltsakaki, Rashmi Prasad, Aravind Joshi, and Bonnie Webber (2004): The Penn Discourse TreeBank. In Proceedings of the 4th International Conference on Language Resources and Evaluation. Lisbon, Portugal.
- Eleni Miltsakaki, Rashmi Prasad, Aravind Joshi, and Bonnie Webber (2004): Annotating discourse connectives and their arguments. In Proceedings of the HLT/NAACL Workshop on "Frontiers in Corpus Annotation". Boston, MA.
- Eleni Miltsakaki, Nikhil Dinesh, Rashmi Prasad, Aravind Joshi, and Bonnie Webber (2005): Experiments on sense annotations and sense disambiguation of discourse connectives. In Proceedings of the 4th Workshop on "Treebanks and Linguistic Theories". Barcelona, Spain.
- Bonnie Webber, Aravind Joshi, Eleni Miltsakaki, Rashmi Prasad, Nikhil Dinesh, Alan Lee, and Kate Forbes (2005): A short introduction to the Penn Discourse TreeBank. In Copenhagen Working Papers in Language and Speech Processing.
- Bonnie Webber, Matthew Stone, Aravind Joshi, and Alistair Knott (2001): Anaphora and discourse structure. Computational Linguistics, 29(4):545-587.
- The PDTB Research Group (2006): The Penn Discourse TreeBank 1.0. Annotation Manual. IRCS Technical Report IRCS-06-01, Institute for Research in Cognitive Science, University of Pennsylvania.
- PDTB online
Anaphora and Coreference:
- Ruslan Mitkov, Richard Evans, Constantin Orasan, Catalina Barbu, Lisa Jones, and Violeta Sotirova (2000): Coreference and anaphora: Developing annotating tools, annotated resources and annotation strategies. In Proceedings of the Discourse, Anaphora and Reference Resolution Conference, pp. 49-58. Lancaster, UK.
- Eva Hajicová, Jarmila Panevová, and Petr Sgall (2000): Coreference in annotating a large corpus. In Proceedings of the 2nd International Conference on Language Resources and Evaluation, pp. 497-500. Athens, Greece.
- Erhard Hinrichs, Sandra Kübler, Karin Naumann, Heike Telljohann, Julia Trushkina, and Heike Zinsmeister (2005): Recent developments in linguistic annotations of the TüBa-D/Z Treebank. Poster at the 27th Annual Meeting of the German Linguistic Society (Deutsche Gesellschaft für Sprachwissenschaft). Köln, Germany.
Kiel Corpus of Read Speech:
MATE:
NITE:
Web as Corpus