TreeTagger - a language independent part-of-speech tagger
The TreeTagger is a tool for annotating text with part-of-speech
and lemma information. It was developed by Helmut Schmid in
the TC
project at the Institute for Computational Linguistics of the
University of Stuttgart. The TreeTagger has been successfully used to
tag German, English, French, Italian, Dutch, Spanish, Bulgarian,
Russian, Greek, Portuguese, Chinese and old French texts and is
adaptable to other languages if a lexicon and a manually tagged
training corpus are available.
Sample output:
| word |
pos |
lemma |
| The |
DT |
the |
| TreeTagger |
NP |
TreeTagger |
| is |
VBZ |
be |
| easy |
JJ |
easy |
| to |
TO |
to |
| use |
VB |
use |
| . |
SENT |
. |
The TreeTagger can also be used as a chunker for English, German, and
French. The parameter file for the French chunker was kindly provided
by Michel Généreux.
The tagger is described in the following two papers:
- "Probabilistic Part-of-Speech Tagging Using Decision Trees" (pdf)
-
"Improvements in Part-of-Speech Tagging with an Application to German"
(pdf)
Download
Executable code for Sparc workstations, Linux and Windows PCs and Macs
as well as parameter files for English, German, Italian, Dutch,
Spanish, Bulgarian, Russian, French and old French can be downloaded
via the links below. Many thanks to Marco Baroni, Pablo Gamallo,
Julien Nioche, Serge Sharoff, Michel Généreux, and Achim
Stein for making their parameter files publicly available! Also thanks
to Holger Wunsch for compiling the TreeTagger on MacOS!
The French and the Italian parameter files are provided by Achim
Stein.
The second Italian parameter files was provided by Marco Baroni.
The English parameter file was trained on
the PENN
treebank and uses the English morphological database created by Karp,
Schabes, Zaidel and Egedi.
The Spanish parameter file was trained on
the Spanish CRATER corpus and uses the Spanish lexicon
of the CALLHOME corpus of
the LDC.
The Bulgarian parameter file was created
by Julien Nioche on
the Bulgarian
Treebank. It uses UTF-8 encoding and
the BulTreeBank tagset.
Michel Généreux created the
parameter file for the French chunker.
This software is freely available for research, education and
evaluation.
Please read
the license
terms, before you download the software! By downloading the
software, you agree to the terms stated there.
The following steps are necessary to install the TreeTagger (see
below for the Windows version). Download the files by right-clicking
on the link. Then select "save file as".
-
Download the tagger package for your system (PC-Linux,
Sparc-Solaris,
Mac
OS-X (PowerPC),
Mac
OS-X (Intel-CPU)).
-
Download the tagging
scripts into the same directory.
-
Download the installation script install-tagger.sh.
-
Download the parameter files for your system (PC,
Sparc-Solaris, Mac-Power-PC, Mac-Intel).
-
Open a terminal window and run the installation script in the
directory where you have downloaded the files:
sh install-tagger.sh
-
Make a test, e.g.
echo 'Hello world!' | cmd/tree-tagger-english
or
echo 'Das ist ein Test.' | cmd/tagger-chunker-german
Make sure that the files are not automatically unzipped i.e. that the
file ending .gz is still present. If you have difficulties with the
installation, have a look at
the installation hints (kindly
provided by Joachim Wagner).
Parameter files for PC (Linux, Windows, and Mac-Intel)
-
Bulgarian
parameter file (gzip compressed, UTF-8)
-
Dutch
parameter file (gzip compressed, Latin1)
-
Julien Bioche's Dutch
parameter file (gzip compressed, Latin1, trained on the
Eindhoven corpus)
-
English
parameter file (gzip compressed, Latin1)
-
French
parameter file (Latin1) (gzip compressed, information
about this file)
-
French
parameter file (UTF-8) (gzip compressed)
-
German
parameter file (gzip compressed, Latin1)
-
Greek
parameter file (gzip compressed, ISO 8859-7)
-
Italian
parameter file (gzip compressed, Latin1, information
about this file)
-
Italian
parameter file (UTF-8) (gzip compressed)
-
Marco Baroni's Italian
parameter file (gzip compressed, Latin1)
-
Spanish
parameter file (gzip compressed, Latin1)
-
A Russian parameter file created by Serge Sharoff is available here
-
A Chinese parameter file created by Serge Sharoff is available here
-
Portuguese and Galician parameter files created by Pablo Gamallo
are available here
Chunker parameter files for PC (Linux, Windows, and Mac-Intel)
Parameter files for Sparc-Solaris and Mac-PowerPC
-
Bulgarian
parameter file (gzip compressed, UTF-8)
-
Dutch
parameter file (gzip compressed, Latin1)
-
Julien Bioche's Dutch
parameter file (gzip compressed, Latin1, trained on the
Eindhoven corpus)
-
English
parameter file (gzip compressed, Latin1)
-
French
parameter file (gzip compressed, Latin1)
-
German
parameter file (gzip compressed, Latin1)
-
Italian
parameter file (gzip compressed, Latin1)
-
Marco
Baroni's Italian
parameter file (gzip compressed, Latin1)
-
Spanish
parameter file (gzip compressed, Latin1)
Chunker parameter files for Sparc-Solaris and Mac-PowerPC
A
Windows version of the TreeTagger is also available. The parameter
files have to be downloaded separately. This version has to be invoked
from a (Windows, cygwin, msys) shell. Therefore, you might want to
install
the graphical interface kindly provided by Ciarán Ó
Duibhín.
Tagsets
Here is some information about the tagsets used in the parameter files:
-
English
(Penn-Treebank tagset)
The tagset used by the TreeTagger is a refinement of this tagset:
The second letter of the verb part-of-speech tags is used to
distinguish between forms of the verb "to be" (B), the verb "to have"
(H), and all the other verbs (V). So, "VHD" is the POS tag for the
past tense form of the verb "to have", i.e. for the word "had".
-
German
(in German)
-
French
(in
French)
-
Italian
-
Marco
Baroni's Italian tagset
-
Spanish
-
Bulgarian
-
Russian
Links
Please send comments, suggestions and bug reports to Helmut Schmid at FirstName.LastName@ims.uni-stuttgart.de.