TreeTagger - a language independent part-of-speech tagger
The TreeTagger is a tool for annotating text with part-of-speech and lemma information. It was developed by Helmut Schmid in the TC project at the Institute for Computational Linguistics of the University of Stuttgart. The TreeTagger has been successfully used to tag German, English, French, Italian, Dutch, Spanish, Bulgarian, Russian, Greek, Portuguese, Chinese, Swahili, Latin, Estonian and old French texts and is adaptable to other languages if a lexicon and a manually tagged training corpus are available.
Sample output:
word | pos | lemma |
---|---|---|
The | DT | the |
TreeTagger | NP | TreeTagger |
is | VBZ | be |
easy | JJ | easy |
to | TO | to |
use | VB | use |
. | SENT | . |
The TreeTagger can also be used as a chunker for English, German, and French. The parameter file for the French chunker was kindly provided by Michel Généreux.
The tagger is described in the following two papers:
-
Helmut Schmid (1995):
Improvements in Part-of-Speech Tagging with an Application to German.
Proceedings of the ACL SIGDAT-Workshop. Dublin, Ireland.
-
Helmut Schmid (1994):
Probabilistic Part-of-Speech Tagging Using Decision Trees.
Proceedings of International Conference on New Methods in Language
Processing, Manchester, UK.
- Download the tagger package for your system (PC-Linux, PC-Linux (64 Bit), Mac OS-X (Intel-CPU), Mac OS-X (PowerPC), Sparc-Solaris, PC-Linux (version for older kernels)).
- Download the tagging scripts into the same directory.
- Download the installation script install-tagger.sh.
- Download the parameter files for your system (PC, Mac-Intel, Mac-Power-PC, Sparc-Solaris).
- Open a terminal window and run the installation script in the directory where you have downloaded the files:
- Make a test, e.g.
- Bulgarian parameter file (gzip compressed, UTF-8, tagset)
- Dutch parameter file (gzip compressed, Latin1, tagset)
- Julien Bioche's Dutch parameter file (gzip compressed, Latin1, trained on the Eindhoven corpus)
- English parameter file (gzip compressed, Latin1, tagset)
- French parameter file (Latin1) (gzip compressed, information about this file, tagset documentation)
- French parameter file (UTF-8) (gzip compressed, tagset documentation)
- German parameter file (gzip compressed, Latin1, tagset documentation)
- German parameter file (UTF-8) (gzip compressed, UTF-8, tagset documentation)
- Italian parameter file (gzip compressed, Latin1, information about this file, tagset documentation)
- Italian parameter file (UTF-8) (gzip compressed, tagset documentation)
- Marco Baroni's Italian parameter file (gzip compressed, Latin1, tagset documentation)
- Spanish parameter file (gzip compressed, Latin1, tagset documentation)
- Spanish parameter file (UTF8) (gzip compressed, UTF8, tagset documentation)
- Estonian parameter file (gzip compressed, UTF-8, tagset documentation)
- Swahili parameter file (gzip compressed, Latin1) The Swahili parameter file was trained on the Helsinki Corpus of Swahili (HCS) and uses a simplified version of the HCS tagset. The HCS was created by Prof. Arvi Hurskainen by means of his Swahili Language Manager (SALAMA) which uses Lingsoft's TWOL compiler for constructing morphological analysers and Connexor's CG2 parser for syntactic disambiguation.
- Latin parameter file (gzip compressed, Latin1) The corpus and lexicon for training the Latin parameter file have been compiled by Gabriele Brandolini from various resources
- A Russian parameter file created by Serge Sharoff is available here
- A Chinese parameter file created by Serge Sharoff is available here
- Portuguese and Galician parameter files created by Pablo Gamallo are available here
- A parameter file for spoken French texts can be found here
- Mongolian parameter file (gzip compressed, ???) created from a small Mongolian corpus by Khuder Altangerel.
-
English
chunker parameter file (gzip compressed, Latin1, tagset info)
Note: The English tagger parameter file is needed, as well. -
French
chunker parameter file (gzip compressed, Latin1)
Note: The French tagger parameter file is needed, as well. -
French
chunker parameter file (UTF-8) (gzip compressed)
Note: The UTF-8 version of the French tagger parameter file is needed, as well. -
German
chunker parameter file (gzip compressed, Latin1, tagset info)
Note: The German tagger parameter file is needed, as well. -
German
chunker parameter file (UTF8) (gzip compressed, UTF8, tagset info)
Note: The UTF-8 version of the German tagger parameter file is needed, as well. - Bulgarian parameter file (gzip compressed, UTF-8, tagset)
- Dutch parameter file (gzip compressed, Latin1, tagset)
- Julien Bioche's Dutch parameter file (gzip compressed, Latin1, trained on the Eindhoven corpus)
- English parameter file (gzip compressed, Latin1, tagset)
- French parameter file (gzip compressed, Latin1, tagset documentation)
- German parameter file (gzip compressed, Latin1, tagset documentation)
- Italian parameter file (gzip compressed, Latin1, tagset documentation)
- Marco Baroni's Italian parameter file (gzip compressed, Latin1, tagset documentation)
- Spanish parameter file (gzip compressed, Latin1, tagset documentation)
-
English
chunker parameter file (gzip compressed, Latin1, tagset info)
Note: The English tagger parameter file is needed, as well. -
German
chunker parameter file (gzip compressed, Latin1, tagset info)
Note: The German tagger parameter file is needed, as well. - English (Penn-Treebank tagset)
- German (in German)
- French
- Italian
- Marco Baroni's Italian tagset
- Spanish
- Bulgarian
- Russian
- Graphical Interface for the Windows version of the TreeTagger (developed by Ciar�n � Duibh�n)
- Serge Sharoffs web page where you can download a tokenizer and a parameter file for Chinese.
- Pablo Gamallos web page where you can download parameter files for Portuguese and Galician.
- Achim Stein's web page on French and old French POS tagging with the TreeTagger
- Python Wrapper for the TreeTagger (developed by Laurent Pointal)
- Java Wrapper for the TreeTagger (developed by Richard Eckart de Castilho)
- Perl module for calling the TreeTagger and manipulating its output (developed by Aris Xanthos)
- R wrapper for the TreeTagger (developed by Meik Michalke)
- Ruby wrapper for the TreeTagger (developed by Andrei Beliankou)
- Italian Online Tagger at the University of Odense
- Giuseppe Attardi's online interface to the TreeTagger.
- Italian Online Tagger at the University for Foreigners Perugia
- Text Analysis Software developed by LinguLab
- Wikimeta
Download
Executable code for Linux and Windows PCs, Macs, and Sparc workstations, as well as parameter files for English, German, Italian, Dutch, Spanish, Bulgarian, Russian, French and old French can be downloaded via the links below. Many thanks to Marco Baroni, Pablo Gamallo, Julien Nioche, Serge Sharoff, Michel Généreux, and Achim Stein for making their parameter files publicly available! Also thanks to Holger Wunsch for compiling the TreeTagger on MacOS!The French and the Italian parameter files are provided by Achim Stein.
The second Italian parameter files was provided by Marco Baroni.
The English parameter file was trained on the PENN treebank and uses the English morphological database created by Karp, Schabes, Zaidel and Egedi.
The Spanish parameter file was trained on the Spanish CRATER corpus and uses the Spanish lexicon of the CALLHOME corpus of the LDC.
The Bulgarian parameter file was created by Julien Nioche on the Bulgarian Treebank. It uses UTF-8 encoding and the BulTreeBank tagset.
Michel Généreux created the parameter file for the French chunker.
The Estonian parameter file was trained on the Tartu Morphologically disambiguated corpus. Thanks to Mark Fishel for pointing me to this data!
This software is freely available for research, education and evaluation.
Please read the license terms, before you download the software! By downloading the software, you agree to the terms stated there.
The following steps are necessary to install the TreeTagger (see below for the Windows version). Download the files by right-clicking on the link. Then select "save file as".
sh install-tagger.sh
echo 'Hello world!' | cmd/tree-tagger-english
or
echo 'Das ist ein Test.' | cmd/tagger-chunker-german
Parameter files for PC (Linux, Windows, and Mac-Intel)
A Windows version of the TreeTagger is available here. Unpack the zip file and follow the instructions in the INSTALL.txt file. The parameter files have to be downloaded separately. The tagger has to be invoked from a (Windows, cygwin, msys) shell. Therefore, you might want to install the graphical interface kindly provided by Ciar�n � Duibh�n.
Tagsets
Here is some information about the tagsets used in the parameter files:
Links
The TreeTagger is a component of the following software products (and of many others too):
Please send questions, comments, suggestions and bug reports to Helmut Schmid at FirstName.LastName@ims.uni-stuttgart.de.