TreeTagger - a language independent part-of-speech tagger
The TreeTagger is a tool for annotating text with part-of-speech
and lemma information. It was developed by Helmut Schmid in
the TC
project at the Institute for Computational Linguistics of the
University of Stuttgart. The TreeTagger has been successfully used to
tag German, English, French, Italian, Dutch, Spanish, Bulgarian,
Russian, Greek, Portuguese, Chinese, Swahili, Latin, Estonian and old French
texts and is adaptable to other languages if a lexicon and a manually
tagged training corpus are available.
Sample output:
| word |
pos |
lemma |
| The |
DT |
the |
| TreeTagger |
NP |
TreeTagger |
| is |
VBZ |
be |
| easy |
JJ |
easy |
| to |
TO |
to |
| use |
VB |
use |
| . |
SENT |
. |
The TreeTagger can also be used as a chunker for English, German, and
French. The parameter file for the French chunker was kindly provided
by Michel Généreux.
The tagger is described in the following two papers:
Download
Executable code for Sparc workstations, Linux and Windows PCs and Macs
as well as parameter files for English, German, Italian, Dutch,
Spanish, Bulgarian, Russian, French and old French can be downloaded
via the links below. Many thanks to Marco Baroni, Pablo Gamallo,
Julien Nioche, Serge Sharoff, Michel Généreux, and Achim
Stein for making their parameter files publicly available! Also thanks
to Holger Wunsch for compiling the TreeTagger on MacOS!
The French and the Italian parameter files are provided by Achim
Stein.
The second Italian parameter files was provided by Marco Baroni.
The English parameter file was trained on
the PENN
treebank and uses the English morphological database created by Karp,
Schabes, Zaidel and Egedi.
The Spanish parameter file was trained on
the Spanish CRATER corpus and uses the Spanish lexicon
of the CALLHOME corpus of
the LDC.
The Bulgarian parameter file was created
by Julien Nioche on
the Bulgarian
Treebank. It uses UTF-8 encoding and
the BulTreeBank tagset.
Michel Généreux created the
parameter file for the French chunker.
The Estonian parameter file was trained on
the Tartu Morphologically disambiguated corpus. Thanks
to Mark Fishel for pointing me to this data!
This software is freely available for research, education and
evaluation.
Please read
the license
terms, before you download the software! By downloading the
software, you agree to the terms stated there.
The following steps are necessary to install the TreeTagger (see
below for the Windows version). Download the files by right-clicking
on the link. Then select "save file as".
-
Download the tagger package for your system (PC-Linux,
Sparc-Solaris,
Mac
OS-X (PowerPC),
Mac
OS-X (Intel-CPU),
PC-Linux (version for older kernels)).
-
Download the tagging
scripts into the same directory.
-
Download the installation script install-tagger.sh.
-
Download the parameter files for your system (PC,
Sparc-Solaris, Mac-Power-PC, Mac-Intel).
-
Open a terminal window and run the installation script in the
directory where you have downloaded the files:
sh install-tagger.sh
-
Make a test, e.g.
echo 'Hello world!' | cmd/tree-tagger-english
or
echo 'Das ist ein Test.' | cmd/tagger-chunker-german
Make sure that the files are not automatically unzipped i.e. that the
file ending .gz is still present. If you have difficulties with the
installation, have a look at
the installation hints (kindly
provided by Joachim Wagner).
Parameter files for PC (Linux, Windows, and Mac-Intel)
-
Bulgarian
parameter file (gzip compressed, UTF-8, tagset)
-
Dutch
parameter file (gzip compressed, Latin1, tagset)
-
Julien Bioche's Dutch
parameter file (gzip compressed, Latin1, trained on the
Eindhoven corpus)
-
English
parameter file (gzip compressed, Latin1, tagset)
-
French
parameter file (Latin1) (gzip compressed, information
about this file, tagset documentation)
-
French
parameter file (UTF-8) (gzip compressed, tagset documentation)
-
German
parameter file (gzip compressed, Latin1, tagset documentation)
-
German
parameter file (UTF-8) (gzip compressed, UTF-8, tagset documentation)
-
Italian
parameter file (gzip compressed, Latin1, information
about this file, tagset documentation)
-
Italian
parameter file (UTF-8) (gzip compressed, tagset documentation)
-
Marco Baroni's Italian
parameter file (gzip compressed, Latin1, tagset documentation)
-
Spanish
parameter file (gzip compressed, Latin1, tagset documentation)
-
Spanish
parameter file (UTF8) (gzip compressed, UTF8, tagset documentation)
-
Estonian
parameter file (gzip compressed, UTF-8, tagset documentation)
-
Swahili
parameter file (gzip compressed, Latin1)
The Swahili parameter file was trained on the Helsinki Corpus of Swahili (HCS) and uses a simplified version of the HCS tagset. The HCS was created by Prof. Arvi Hurskainen by means of his Swahili Language Manager (SALAMA) which uses Lingsoft's TWOL compiler for constructing morphological analysers and Connexor's CG2 parser for syntactic disambiguation.
-
Latin
parameter file (gzip compressed, Latin1)
The corpus and
lexicon for training the Latin parameter file have been compiled by
Gabriele Brandolini from
various resources
-
A Russian parameter file created by Serge Sharoff is available here
-
A Chinese parameter file created by Serge Sharoff is available here
-
Portuguese and Galician parameter files created by Pablo Gamallo
are available here
Chunker parameter files for PC (Linux, Windows, and Mac-Intel)
-
English
chunker parameter file (gzip compressed, Latin1, tagset info)
Note: The English tagger parameter file is needed, as well.
-
French
chunker parameter file (gzip compressed, Latin1)
Note: The French tagger parameter file is needed, as well.
-
French
chunker parameter file (UTF-8) (gzip compressed)
Note: The UTF-8 version of the French tagger parameter file is needed, as well.
-
German
chunker parameter file (gzip compressed, Latin1, tagset info)
Note: The German tagger parameter file is needed, as well.
-
German
chunker parameter file (UTF8) (gzip compressed, UTF8, tagset info)
Note: The UTF-8 version of the German tagger parameter file is needed, as well.
Parameter files for Sparc-Solaris and Mac-PowerPC
-
Bulgarian
parameter file (gzip compressed, UTF-8, tagset)
-
Dutch
parameter file (gzip compressed, Latin1, tagset)
-
Julien Bioche's Dutch
parameter file (gzip compressed, Latin1, trained on the
Eindhoven corpus)
-
English
parameter file (gzip compressed, Latin1, tagset)
-
French
parameter file (gzip compressed, Latin1, tagset documentation)
-
German
parameter file (gzip compressed, Latin1, tagset documentation)
-
Italian
parameter file (gzip compressed, Latin1, tagset documentation)
-
Marco
Baroni's Italian
parameter file (gzip compressed, Latin1, tagset documentation)
-
Spanish
parameter file (gzip compressed, Latin1, tagset documentation)
Chunker parameter files for Sparc-Solaris and Mac-PowerPC
Windows version
A Windows version of the TreeTagger is
available
here. Unpack the zip file and follow the instructions in the INSTALL.txt file. The parameter files have to be downloaded separately. The
tagger has to be invoked from a (Windows, cygwin, msys)
shell. Therefore, you might want to install
the graphical interface kindly provided by Ciarán Ó
Duibhín.
Tagsets
Here is some information about the tagsets used in the parameter files:
Links
The TreeTagger is a component of the following software products (and a number of others):
In order to use the TreeTagger commercially, you need to obtain a commercial license (see contact address below)!
Please send questions, comments, suggestions and bug reports to Helmut
Schmid at FirstName.LastName@ims.uni-stuttgart.de.