3. Corpus indexing

3.1 Introduction

The indexing of corpora is based on the TIGER-XML format. Corpora encoded in other formats have to be converted to TIGER-XML first. So if your corpus source file is not encoded in TIGER-XML format, you will have to use one of the existing corpus format filters (i.e. converters to TIGER-XML, cf. subsection 3.5) or convert your corpus to TIGER-XML on your own.

To index a corpus, first mark the parent folder of the new corpus (in the example: German). Now click the Insert Corpus button in the button toolbar or choose the Insert Corpus item in the popup menu (right mouse click):

Please click to enlarge!

Figure: Inserting a new corpus

Next the corpus indexing window pops up. First of all, you have to specify the corpus input format: TIGER-XML Format or Other Format:

Please click to enlarge!

Figure: Corpus indexing window

The additional parameters of the indexing windows are explained in the following subsections (cf. subsection 3.2 and subsection 3.3).

Please note: During the corpus indexing process a corpus directory, which comprises several corpus files, is generated. The directory and the files in it are created in a platform-independent way. So if you are working on a platform that allows for fine-grained user permissions (e.g. Unix), you should check the permissions of the new corpus directory right after the indexing process has finished in order to make sure that the desired group of TIGERSearch users will be able to access the newly created corpus.

3.2 Indexing of TIGER-XML files

If your corpus source is encoded in TIGER-XML format, please mark Corpus is in TIGER-XML format at the top of the window. Selecting this option deactivates some parameters of the window:

Please click to enlarge!

Figure: Indexing parameters (TIGER-XML corpus input)

Now you have to specify the following parameters:

Corpus ID

The corpus ID is used by the TIGERSearch software suite to realize corpus-dependent configurations. The corpus ID must be unique with regard to all other indexed corpora. The uniqueness is checked before the indexing process is initiated. The ID has to start with a letter.

TIGER-XML file

The source file (relative paths are evaluated with regard to the working directory) can be either an uncompressed .xml file, or a compressed .xml.gz file, or a .zip file that contains one source file only. Compressed files are automatically decompressed during the indexing process.

Extended indexing

If extended indexing is activated, additional corpus information is retreived during the indexing process. This information is used to improve corpus query processing efficiency. Effiency will increase about 50% at the expense of main memory requirement which also increases about 50%.

Please note: Default indexing requires a constant amount of main memory (about 128 MB). The main memory requirement of the extended indexing process will depend on corpus size. If an out of memory warning is displayed, please modify the main memory configuration of the TIGERRegistry tool (cf. section 4, chapter II).

After specifying the indexing parameters, you can start the indexing process by pressing the Start button. The corpus indexing can be stopped at any time. The current progress of the indexing is displayed by the indexing progress window:

Please click to enlarge!

Figure: Indexing progress window

The progress window also shows how many warnings and errors occured during corpus indexing. These messages are stored in the corpus log file indexing.log which is placed in the corpus directory. In subsection 3.4 we desribe how to view this log file within the TIGERRegistry application.

When the indexing process is finished, the Corpus properties window pops up (see screenshot below). Here you can specify meta information about the corpus such as the corpus name. The corpus properties window is explained in detail in section 4.

Please click to enlarge!

Figure: Corpus properties window

After the corpus properties specification, just press the OK button to finish corpus indexing. Now the new corpus can be found in the corpus tree.

3.3 Indexing of non TIGER-XML files

Indexing of non TIGER-XML files consists of two steps: First of all, the corpus is converted to TIGER-XML. Afterwards the generated TIGER-XML corpus is indexed. Thus, you have to choose the Corpus is in Other Format option at the top of the window and specify the following parameters (cf. screenshot):

Corpus ID

The corpus ID is used by the TIGERSearch software to realize corpus-dependent configurations. The corpus ID must be unique with regard to all indexed corpora. The uniqueness is checked before the indexing process is initiated. The ID has to start with a letter.

Import file

The source file can either be an uncompressed file, or a compressed .gz file, or a zip file that contains one source file only (relative paths are evaluated with regard to the working directory). Compressed files are automatically decompressed during the indexing process.

Import filter

Select one of the corpus format filters (cf. screenshot above). For a list of all implemented filters see subsection 3.5.

Please click to enlarge!

Convert corpus graphs

You can either convert and index the whole corpus or the first n graphs of the corpus.

Temporary TIGER-XML file

The first step of the indexing is the conversion to TIGER-XML. Thus, a temporary TIGER-XML file is needed. When typing in the corpus ID, a file name is automatically generated by the system (cf. checkbox Default name for XML file). Of course, you can also specify a different file name. Please note that relative paths are evaluated with regard to the working directory.

TIGER-XML file parameters

As the TIGER-XML file is a temporary file, it makes sense to compress it. You can enforce GZIP compression by checking the GZip XML file box. When the indexing progress is finished, the temporary file is automatically deleted. If you want to save the file (e.g. for debugging purposes) just uncheck the Delete after indexing box.

You can also make use of a so-called external header, i.e. a TIGER-XML document header which is stored in a separate file. To use this external header check the External Header box and type in the path of the header file (relative paths are evaluated with regard to the working directory).

Extended indexing

If extended indexing is activated, additional corpus information is retreived during the indexing process. This information is used to improve corpus query processing efficiency. Effiency will increase about 50% at the expense of main memory requirement which also increases about 50%.

Please note: Default indexing requires a constant amount of main memory (about 128 MB). The main memory requirement of the extended indexing process will depend on corpus size. If an out of memory warning is displayed, please modify the main memory configuration of the TIGERRegistry tool (cf. section 4, chapter II).

To start the indexing process press the Start button. The corpus conversion and indexing can be stopped at any time. The current progress is displayed by the Converting & Indexing progress window:

Please click to enlarge!

Figure: Conversion and Indexing progress window

The progress window also shows how many warnings and errors occured during corpus conversion and indexing. These messages are stored in the corpus log files conversion.log and indexing.log which are placed in the corpus directory. In subsection 3.4 we desribe how to view these log files within the TIGERRegistry application.

When the indexing process is finished, a corpus properties window pops up (cf. screenshot below). Here you can fill in meta information about the corpus such as the corpus name. The corpus properties window is explained in section 4.

Please click to enlarge!

Figure: Corpus properties window

After specifying the corpus properties, just click the OK button to finish corpus indexing. Now the new corpus can be found in the corpus tree.

3.4 Viewing corpus log files

The conversion of corpora into the TIGER-XML format and the indexing of TIGER-XML corpora are both implemented in a robust way, i.e. both processes are also capable of handling corpus sentences that do not fulfill the syntactic and semantic restrictions in some minor points. However, warnings and error messages are produced in these cases. All these messages are collected in two corpus log files which are stored in the corpus directory of the new corpus:

conversion.log

Warnings and errors that have been produced during the corpus conversion process, i.e. during the conversion to TIGER-XML.

indexing.log

Warnings and errors that have been produced during the corpus indexing process.

After corpus indexing you may inspect these messages in order to modify your corpus. Of course, you might view these files using your favourite external editor. However, you can also have a look at these files within the TIGERRegistry window. Just mark the corpus your interested in and select the Corpus Logfiles item in the Corpus submenu of the context menu or select the corresponding item in the TIGERRegistry menu.

Now the corpus logging window pops up. It displays the content of the two log files. To keep track of all the messages, the keywords Warning and Error are displayed green-colored and red-colored, respectively.

Please click to enlarge!

Figure: Viewing corpus log files

3.5 List of implemented import filters

Please note: Corpora to be processed by the text-based import filters of TIGERRegistry (except some XML-based filters) have to be encoded in ISO-Latin-1. If characters outside the ISO-Latin-1 character set have to be used in a corpus, please use the following unicode encoding convention: Prefix the hexadecimal unicode number of your character by the string \u. For example, the unicode character corresponding to the hexadecimal number 03a9 (Greek capital letter Omega) has to be encoded as \u03a9.

The following import filters have been implemented:

General bracketing formats

general () filter

This filter should work with bracketing-style corpora that use braces for structuring. It generates cat, pos and word features.

general [] filter

This should work with bracketing-style corpora that use brackets for structuring. It generates cat, pos and word features.

PennTreebank formats

general PennTreebank filter

This filter should work with UPenn-style corpora. Syntactic functions are modelled as edge labels, traces are modelled as secondary edges.

This filter has been tested for the Wall Street Journal and Brown Corpus (Penn Treebank - bracketing version in mrg/ subdirectory), the Penn-Helsinki Parsed Corpus of Middle English, and the Chinese Treebank.

The Chinese Treebank has to be converted to the mentioned Unicode encoding first. The command line tool native2ascii can be used for this purpose. It is included in Sun's Java Development Kit which you can download at http://java.sun.com. For the Chinese treebank, the command line is the following:

native2ascii -encoding GB2312 chinese.txt unicodeoutput.txt

ATIS corpus filter

This is a special filter for the ATIS corpus format only (Penn Treebank - bracketing version in mrg/ subdirectory). It handles the different pos and word notation and other corpus-specific differences.

SWITCHBOARD corpus filter

This is a special filter for the SWITCHBOARD corpus format only (Penn Treebank - bracketing version in mrg/ subdirectory). It skips code and interjection sections.

Korean treebank filter

This is a special filter for the Korean treebank corpus format only. The Korean Treebank has to be converted to the mentioned Unicode encoding first. The command line tool native2ascii can be used for this purpose. It is included in Sun's Java Development Kit which you can download at http://java.sun.com. For the Korean treebank, the command line is the following:

native2ascii -encoding KSC5601 korean.txt unicodeoutput.txt

Treebanks converted by negra-topenn

This is a special filter for corpora that have been generated using the negra-topenn command line tool. This tool is part of the Negra Corpus deliverable. It has been developed to linguistically transform treebanks which have been annotated according to the Negra annotation scheme to the UPenn style format.

Susanne and Christine

Susanne corpus filter

This is a special filter for the Susanne corpus format only.

Christine corpus filter

This is a special filter for the Christine corpus format only.

Negra format

general Negra format filter

This filter should work with any corpus encoded according to the Negra format, Version 3 or Version 4. It has been tested for the Negra Corpus, the Negra 2000 Corpus, the VerbMobil Treebank, and the TIGER Corpus Release 1.

IMS tools

LoPar format filter

LoPar is an implementation of a parser for head-lexicalised probabilistic context-free grammars. Grammars are currently available for German and English. LoPar has been developed at IMS, University of Stuttgart (cf. http://www.ims.uni-stuttgart.de/projekte/gramotron/SOFTWARE/LoPar-en.html).

The LoPar format filter is able to process a special output format of LoPar. This output can be generated using the following LoPar command line:

cat input.txt | lopar -in <model> -stems -heads -viterbi -viterbi-probs -tgrep > output.txt
The input file must be a text file in one word per line format.

TreeTagger chunking filter

The TreeTagger is a tool for annotating text with part-of-speech and lemma information. It also comprises a chunker that is based on the tagging output. Chunking modules are currently available for German and English. The TreeTagger has been developed at IMS, University of Stuttgart (cf. http://www.ims.uni-stuttgart.de/projekte/corplex/). The TreeTagger chunking filter is able to process the XML output of the chunker.

YAC format filter

The chunker YAC (Yet Another Chunker) is a rule-based chunker for German. It has been developed at IMS, University of Stuttgart (cf. http://www.ims.uni-stuttgart.de/projekte/corplex/). The YAC format filter is able to process the XML-based YAC output.

Other formats

DEREKO format filter

This filter should work with corpora encoded according to the DEREKO corpus format. The DEREKO corpus format has been developed within the DEREKO project.

3.6 Corpus conversion only

If you just want to convert a corpus to the TIGER-XML format without subsequent corpus indexing, you can use the corpus conversion feature. Choose the Convert Corpus item in the Corpus menu. The corpus conversion window pops up. Specify the parameters which have been explained in the previous two subsections and press the Start button to start the conversion. The conversion process can be stopped at any time.