TreeTagger - a language independent part-of-speech tagger
The TreeTagger is a tool for annotating text with part-of-speech
and lemma information. It was developed by Helmut Schmid in
the TC
project at the Institute for Computational Linguistics of the
University of Stuttgart. The TreeTagger has been successfully used to
tag German, English, French, Italian, Dutch, Spanish, Bulgarian,
Russian, Portuguese, Galician, Chinese, Swahili, Slovak, Slovenian,
Latin, Estonian, Polish and old French
texts and is adaptable to other languages if a lexicon and a manually
tagged training corpus are available.
Sample output:
word |
pos |
lemma |
The |
DT |
the |
TreeTagger |
NP |
TreeTagger |
is |
VBZ |
be |
easy |
JJ |
easy |
to |
TO |
to |
use |
VB |
use |
. |
SENT |
. |
The TreeTagger can also be used as a chunker for English, German,
French, and Spanish.
The tagger is described in the following two papers:
Download
Executable code for Linux and Windows PCs as well as Intel-Macs,
and parameter files for various languages can be downloaded
via the links below.
This software is freely available for research, education and
evaluation.
Please read
the license
terms, before you download the software! By downloading the
software, you agree to the terms stated there.
The following steps are necessary to install the TreeTagger (see
below for the Windows version). Download the files by
right-clicking on the link. Then select "save file as". All files should be
stored in the same directory.
-
Download the tagger package for your system (PC-Linux,
Mac
OS-X (Intel-CPU),
PC-Linux (version for older kernels)).
-
Download the tagging
scripts into the same directory.
-
Download the installation script install-tagger.sh.
-
Download the parameter files for the languages you want to
process.
-
Open a terminal window and run the installation script in the
directory where you have downloaded the files:
sh install-tagger.sh
-
Make a test, e.g.
echo 'Hello world!' | cmd/tree-tagger-english
or
echo 'Das ist ein Test.' | cmd/tagger-chunker-german
Make sure that the files are not automatically unzipped i.e. that the
file ending .gz is still present. If you have difficulties with the
installation, have a look at
the installation hints (kindly
provided by Joachim Wagner). You can also try to install TreeTagger via
the Docker image kindly provided by Leonardo di Donato.
Parameter files
-
Bulgarian
parameter file (gzip compressed, UTF-8, tagset, trained on
the Bulgarian
Treebank)
-
A Chinese parameter file and tokenizer created by Serge Sharoff are available here
-
Dutch
parameter file (UTF8) (gzip compressed, tagset)
-
Another Dutch
parameter file (gzip compressed, UTF8, trained on the
Eindhoven
corpus, tagset documentation)
-
English
parameter file (gzip compressed, UTF8, tagset)
-
Estonian
parameter file (UTF8) (gzip compressed, tagset documentation)
-
Finnish parameter file (UTF8)
trained on data provided by
Joshua Waxman (list of tags used).
-
French
parameter file (UTF-8) (gzip compressed, tagset documentation)
-
A parameter file for spoken French texts can be
found here
-
Galician
parameter file (UTF8) (gzip compressed, UTF8, tagset documentation)
-
German
parameter file (UTF-8) (gzip compressed, UTF-8, tagset documentation)
-
Italian
parameter file (UTF-8) (gzip compressed, tagset documentation)
-
Marco Baroni's Italian
parameter file (gzip compressed, Latin1, tagset documentation)
-
Latin
parameter file (gzip compressed, tagset info in Italian)
The corpus and
lexicon for training the Latin parameter file have been compiled by
Gabriele Brandolini from
various resources
-
Another Latin
parameter file (gzip compressed, tagset
info) which has been trained on
the Index
Thomisticus Treebank which was kindly provided by Marco Passarotti.
-
Mongolian
parameter file (gzip compressed, ???)
created from a small Mongolian corpus by Khuder Altangerel.
-
Polish parameter file (UTF8)
trained on the Polish National Corpus
(tagset description).
-
Portuguese parameter file (UTF8)
provided by Pablo Gamallo
(tagset description).
-
Portuguese
parameter file (UTF8) with fine-grained tagset
provided by Pablo Gamallo
(tagset description).
-
Russian
parameter file (UTF8) (gzip compressed, UTF8, tagset
trained on a corpus created
by Serge Sharoff)
-
Slovak
parameter file (UTF-8) (gzip compressed)
The Slovak parameter file was trained on the Slovak National
Corpus. The tagset was simplified.
-
Slovak
parameter file (UTF-8, full tags) (gzip compressed)
The Slovak parameter file was trained on the Slovak National
Corpus. The tagset was not simplified (just a marker for typos was
removed). Many thanks to Vladimir Benko for suggesting to train on the full
tagset and also for his bug reports.
-
Slovenian
parameter file (UTF-8) (gzip compressed)
The Slovaenian parameter file was trained on the ssj500k 1.3
training corpus. The tagset is documented here.
-
Spanish
parameter file (UTF8) (gzip compressed, UTF8, tagset documentation)
-
Swahili
parameter file (gzip compressed)
The Swahili parameter file was trained on the Helsinki Corpus of
Swahili (HCS) and uses a simplified version of the HCS tagset. The HCS
was created by Prof. Arvi Hurskainen by means of his Swahili Language Manager
(SALAMA) which uses Lingsoft's TWOL compiler for constructing morphological
analysers and Connexor's CG2 parser for syntactic disambiguation. The creation
of the parameter file was joint work with Gabriele Brandolini.
Chunker parameter files for PC (Linux, Windows, and Mac-Intel)
Windows version
A Windows version of the TreeTagger is
available
here. Unpack the zip file and follow the instructions in the INSTALL.txt file. The parameter files have to be downloaded separately. The
tagger has to be invoked from a (Windows, cygwin, msys)
shell. Therefore, you might want to install
the graphical interface kindly provided by Ciarán Ó
Duibhín.
Acknowledgments
The Russian parameter file was created on a corpus provided by Serge
Sharoff. He has a webpage with various
resources for Russian NLP.
The French and the Italian parameter files are provided by Achim
Stein.
The parameter file for the French chunker was created by Michel Généreux.
The second Italian parameter files was provided by Marco Baroni.
The English parameter file was trained on
the PENN
treebank and uses the English morphological database created by Karp,
Schabes, Zaidel and Egedi.
The Spanish parameter file was trained on
the Spanish CRATER corpus and uses the Spanish lexicon
of the CALLHOME corpus of
the LDC.
The Spanish chunker was trained on
the IULA Spanish treebank.
The Galician parameter file was trained on
the Xiada corpus provided by the Centro Ramón Piñeiro para a Investigación en Humanidades
The Bulgarian parameter file was created
by Julien Nioche on
the Bulgarian
Treebank. It uses UTF-8 encoding and
the BulTreeBank tagset.
Michel Généreux created the
parameter file for the French chunker.
The Estonian parameter file was trained on
the Tartu Morphologically disambiguated corpus. Thanks
to Mark Fishel for pointing me to this data!
Many thanks to Marco Baroni, Pablo Gamallo,
Julien Nioche, Serge Sharoff, Michel Généreux, and Achim
Stein for making their parameter files publicly available! Also thanks
to Holger Wunsch and Cassio Binkowski for compiling the TreeTagger on MacOS!
Links
The TreeTagger is a component of the following software products (and of many others too):
In order to use the TreeTagger commercially, you need to obtain a commercial license (see contact address below)!
Please send questions, comments, suggestions and bug reports to Helmut
Schmid at FirstName.LastName@cis.uni-muenchen.de.