Version date: 20.02.2012 ******************** This directory contains scripts for: 1) converting TreeTagger POS tags into MulText for DE, EN, ES and FR (EN, ES and FR: only nouns, adjectives and verbs) 2) correction of lemmas output by the TreeTagger and RFTagger (for DE) Usage instructions *************** The lemma correction script takes as an input a three-column tagged and lemmatized file (word TAB pos TAB lemma). Since taggers used in TTC do not output Multext tags, it is necessary to firstly convert TreeTagger tags into Multext. 1) Converting POS tags into MulText: **************************** The script has the folowing parameters: "-i": input file, "-l": language (en, de, fr, es) "-t": used tagger (tt = TreeTagger, rf = RFTagger) "-u": retain or replace the unknown lemma (""). The script creates a new file "corpus.multag" containing the input file with MulText tags. To run the script, perfom the following (Unix): 1) change to the directory lemma-correction-d32 2) run the following command to convert the input file (example: EN): python toMultext.py -i multext-examples/en.ttag -l en -t tt -u yes 2) Lemma correction: ***************** Parameters: "-t": input file, "-l": language, "-s": file with split compounds, "-r": inflection rules The result is written into a file "tagged.new" which has the same format as the input file. To perform the lemma correction, type the following: python correctLemma.py -t corpus.multag -l en -s "" -r infl-files/en-infl-rules For EN, there is no file with split compounds, thus the argument of the papameter "-s" is empty (""). For DE, for example, type in the following: 1) python toMultext.py multext-examples/de.rftag -l de -t rf 2) python correctLemma.py -t corpus.multag -l de -s multext-examples/de.split -r infl-files/de-infl-rules