Book: The TIGERRegistry administration tool

Treebanks to be processed by the TIGERSearch search engine have to be converted into a binary representation first - the so-called index. This index-based approach splits working with the TIGERSearch software suite into two parts: The first part deals with corpus indexing (TIGERRegistry, described in the present chapter), the second part with corpus query processing (TIGERSearch, described in chapter IV). The TIGERSearch corpus query processor can only process corpora that have already been indexed.

The indexed corpora are organized in a hierarchical file system: Each corpus is stored in a folder (i.e. in a directory of your local file system), and corpus folders can be grouped in a common folder as well. The following example illustrates the physical content of a corpus directory and its graphical tree representation. Corpus folders are represented as folder icons and corpora are represented as book icons.

1.2 Corpus formats

TIGERSearch supports many existing corpus encoding formats, based on a two-step approach. Since the data model supported by TIGERSearch is more general and expandable than most data models of formats available (PennTreebank format, Negra format, etc.), we have developed the TIGER-XML format. This format maps the supported data model to XML. The TIGER-XML format is described in chapter V. Corpora to be indexed with the TIGERRegistry tool must be encoded in the TIGER-XML format.

To support as many treebank formats as possible, we have also implemented import filters (i.e. converters to TIGER-XML) for many popular formats such as Penn Treebank format or Negra format, and some parser output formats. Indexing of TIGER-XML corpora and indexing of corpora with an import filter are described in subsection 3.2 and subsection 3.3, respectively. A list of implemented import filters can be found in subsection 3.5.

2. Starting TIGERRegistry

2.1 Starting the TIGERRegistry tool

The way you can start the TIGERRegistry tool depends on your operating system. On Windows machines, a program group called TIGERSearch has been created during the installation - so you just have to select the TIGERRegistry program in the start menu.

On Unix machines, symbolic links have been created. If your general path is set properly, you may just need to type in TIGERRegistry. However, the TIGERRegistry start program can always be found in the TIGERSearch installation path:

Please note: Relative paths specified by the user are evaluated with regard to the working directory. On Unix machines this directory is defined as the TIGERSearch starting directory (i.e. the directory TIGERSearch has been started from). On Mac and Windows machines the working directory is defined as the user's home directory.

2.2 Corpus administration window

When you start the TIGERRegistry tool, it first checks whether you have the permissions to read, write, and create files in the corpus directory. If you do not have the required permissions, an information window pops up. The TIGERRegistry tool will not be started.

If the permission check has been successful, the TIGERRegistry main window appears (cf. screenshot). Position and size of the window are saved when leaving the tool. So the arrangement of your windows will be restored in the next TIGERRegistry session.

The TIGERSearch User's Manual can be accessed directly within the TIGERRegistry user interface. The TIGERRegistry help window can be activated by pressing the Help button in the upper toolbar or selecting one of the items in the Help menu.

Now you can browse through the corpus tree (left hand side of the window) and have a look at the corpus properties (right hand side). Just click on a corpus symbol to see the corresponding corpus properties. All operations on this corpus tree (insert a new corpus, delete a corpus etc.) can be activated by pressing the appropriate button in the toolbar, or selecting the appropriate item in the popup-menu (activated by clicking the right mouse button on the corpus symbol), or selecting the corresponding item in the menu bar.

Please note: In contrast to file management tools there is no Undo function available in the TIGERRegistry tool!

2.3 Folders

To insert a new folder (which can contain corpora and other folders) first mark the parent folder of the new folder (in the following figure the folder Projects has been marked). Now click the Insert folder button in the button toolbar. A new window pops up. Please type in the name of your folder (here: MyFolder) and press the Save button.

To delete a folder mark it and press the Delete folder button in the button toolbar. The folder and its subfolders will be deleted.

To move a folder into a new parent folder, use the drag and drop feature of the TIGERRegistry tool: Click the left mouse button on the folder, keep the mouse button pressed (drag), move the folder to the new parent folder, and release the mouse button (drop).

2.4 Corpora

Please read section 3, section 4, and subsection 3.4 for details about inserting a corpus, changing corpus properties, and viewing corpus log files.

To delete a corpus, first mark the corresponding corpus symbol in the corpus administration tree (cf. corpus TESTCORPUS in the screenshot). Now click the Delete corpus symbol in the button toolbar. A corpus deletion always has to be confirmed:

Please note: Corpus deletion will delete all data of the selected corpus on the hard disc.

To move a corpus into a new parent folder, use the drag and drop feature of the TIGERRegistry tool: Click the left mouse button on the corpus, keep the mouse button pressed (drag), move the corpus to the new parent folder, and release the mouse button (drop).

2.5 Consistency check

The consistency check runs through the corpus administration tree and checks whether all corpus IDs are distinct. An inconsistency may be caused by corpus administrators with different user rights. To start the consistency check click the corresponding button in the button tool bar, which is presented as a yellow warning symbol.

3. Corpus indexing

3.1 Introduction

The indexing of corpora is based on the TIGER-XML format. Corpora encoded in other formats have to be converted to TIGER-XML first. So if your corpus source file is not encoded in TIGER-XML format, you will have to use one of the existing corpus format filters (i.e. converters to TIGER-XML, cf. subsection 3.5) or convert your corpus to TIGER-XML on your own.

To index a corpus, first mark the parent folder of the new corpus (in the example: German). Now click the Insert Corpus button in the button toolbar or choose the Insert Corpus item in the popup menu (right mouse click):

Next the corpus indexing window pops up. First of all, you have to specify the corpus input format: TIGER-XML Format or Other Format:

The additional parameters of the indexing windows are explained in the following subsections (cf. subsection 3.2 and subsection 3.3).

Please note: During the corpus indexing process a corpus directory, which comprises several corpus files, is generated. The directory and the files in it are created in a platform-independent way. So if you are working on a platform that allows for fine-grained user permissions (e.g. Unix), you should check the permissions of the new corpus directory right after the indexing process has finished in order to make sure that the desired group of TIGERSearch users will be able to access the newly created corpus.

3.2 Indexing of TIGER-XML files

If your corpus source is encoded in TIGER-XML format, please mark Corpus is in TIGER-XML format at the top of the window. Selecting this option deactivates some parameters of the window:

The corpus ID is used by the TIGERSearch software suite to realize corpus-dependent configurations. The corpus ID must be unique with regard to all other indexed corpora. The uniqueness is checked before the indexing process is initiated. The ID has to start with a letter.

The source file (relative paths are evaluated with regard to the working directory) can be either an uncompressed .xml file, or a compressed .xml.gz file, or a .zip file that contains one source file only. Compressed files are automatically decompressed during the indexing process.

If extended indexing is activated, additional corpus information is retreived during the indexing process. This information is used to improve corpus query processing efficiency. Effiency will increase about 50% at the expense of main memory requirement which also increases about 50%.

Please note: Default indexing requires a constant amount of main memory (about 128 MB). The main memory requirement of the extended indexing process will depend on corpus size. If an out of memory warning is displayed, please modify the main memory configuration of the TIGERRegistry tool (cf. section 4, chapter II).

After specifying the indexing parameters, you can start the indexing process by pressing the Start button. The corpus indexing can be stopped at any time. The current progress of the indexing is displayed by the indexing progress window:

The progress window also shows how many warnings and errors occured during corpus indexing. These messages are stored in the corpus log file indexing.log which is placed in the corpus directory. In subsection 3.4 we desribe how to view this log file within the TIGERRegistry application.

When the indexing process is finished, the Corpus properties window pops up (see screenshot below). Here you can specify meta information about the corpus such as the corpus name. The corpus properties window is explained in detail in section 4.

After the corpus properties specification, just press the OK button to finish corpus indexing. Now the new corpus can be found in the corpus tree.

3.3 Indexing of non TIGER-XML files

Indexing of non TIGER-XML files consists of two steps: First of all, the corpus is converted to TIGER-XML. Afterwards the generated TIGER-XML corpus is indexed. Thus, you have to choose the Corpus is in Other Format option at the top of the window and specify the following parameters (cf. screenshot):

The corpus ID is used by the TIGERSearch software to realize corpus-dependent configurations. The corpus ID must be unique with regard to all indexed corpora. The uniqueness is checked before the indexing process is initiated. The ID has to start with a letter.

The source file can either be an uncompressed file, or a compressed .gz file, or a zip file that contains one source file only (relative paths are evaluated with regard to the working directory). Compressed files are automatically decompressed during the indexing process.

Select one of the corpus format filters (cf. screenshot above). For a list of all implemented filters see subsection 3.5.

You can either convert and index the whole corpus or the first n graphs of the corpus.

The first step of the indexing is the conversion to TIGER-XML. Thus, a temporary TIGER-XML file is needed. When typing in the corpus ID, a file name is automatically generated by the system (cf. checkbox Default name for XML file). Of course, you can also specify a different file name. Please note that relative paths are evaluated with regard to the working directory.

As the TIGER-XML file is a temporary file, it makes sense to compress it. You can enforce GZIP compression by checking the GZip XML file box. When the indexing progress is finished, the temporary file is automatically deleted. If you want to save the file (e.g. for debugging purposes) just uncheck the Delete after indexing box.

You can also make use of a so-called external header, i.e. a TIGER-XML document header which is stored in a separate file. To use this external header check the External Header box and type in the path of the header file (relative paths are evaluated with regard to the working directory).

To start the indexing process press the Start button. The corpus conversion and indexing can be stopped at any time. The current progress is displayed by the Converting & Indexing progress window:

The progress window also shows how many warnings and errors occured during corpus conversion and indexing. These messages are stored in the corpus log files conversion.log and indexing.log which are placed in the corpus directory. In subsection 3.4 we desribe how to view these log files within the TIGERRegistry application.

When the indexing process is finished, a corpus properties window pops up (cf. screenshot below). Here you can fill in meta information about the corpus such as the corpus name. The corpus properties window is explained in section 4.

After specifying the corpus properties, just click the OK button to finish corpus indexing. Now the new corpus can be found in the corpus tree.

3.4 Viewing corpus log files

The conversion of corpora into the TIGER-XML format and the indexing of TIGER-XML corpora are both implemented in a robust way, i.e. both processes are also capable of handling corpus sentences that do not fulfill the syntactic and semantic restrictions in some minor points. However, warnings and error messages are produced in these cases. All these messages are collected in two corpus log files which are stored in the corpus directory of the new corpus:

conversion.log

Warnings and errors that have been produced during the corpus conversion process, i.e. during the conversion to TIGER-XML.

indexing.log

Warnings and errors that have been produced during the corpus indexing process.

After corpus indexing you may inspect these messages in order to modify your corpus. Of course, you might view these files using your favourite external editor. However, you can also have a look at these files within the TIGERRegistry window. Just mark the corpus your interested in and select the Corpus Logfiles item in the Corpus submenu of the context menu or select the corresponding item in the TIGERRegistry menu.

Now the corpus logging window pops up. It displays the content of the two log files. To keep track of all the messages, the keywords Warning and Error are displayed green-colored and red-colored, respectively.

3.5 List of implemented import filters

Please note: Corpora to be processed by the text-based import filters of TIGERRegistry (except some XML-based filters) have to be encoded in ISO-Latin-1. If characters outside the ISO-Latin-1 character set have to be used in a corpus, please use the following unicode encoding convention: Prefix the hexadecimal unicode number of your character by the string \u. For example, the unicode character corresponding to the hexadecimal number 03a9 (Greek capital letter Omega) has to be encoded as \u03a9.

general () filter

This filter should work with bracketing-style corpora that use braces for structuring. It generates cat, pos and word features.

general [] filter

This should work with bracketing-style corpora that use brackets for structuring. It generates cat, pos and word features.

general PennTreebank filter

This filter should work with UPenn-style corpora. Syntactic functions are modelled as edge labels, traces are modelled as secondary edges.

This filter has been tested for the Wall Street Journal and Brown Corpus (Penn Treebank - bracketing version in mrg/ subdirectory), the Penn-Helsinki Parsed Corpus of Middle English, and the Chinese Treebank.

The Chinese Treebank has to be converted to the mentioned Unicode encoding first. The command line tool native2ascii can be used for this purpose. It is included in Sun's Java Development Kit which you can download at http://java.sun.com. For the Chinese treebank, the command line is the following:

native2ascii -encoding GB2312 chinese.txt unicodeoutput.txt

ATIS corpus filter

This is a special filter for the ATIS corpus format only (Penn Treebank - bracketing version in mrg/ subdirectory). It handles the different pos and word notation and other corpus-specific differences.

SWITCHBOARD corpus filter

This is a special filter for the SWITCHBOARD corpus format only (Penn Treebank - bracketing version in mrg/ subdirectory). It skips code and interjection sections.

Korean treebank filter

This is a special filter for the Korean treebank corpus format only. The Korean Treebank has to be converted to the mentioned Unicode encoding first. The command line tool native2ascii can be used for this purpose. It is included in Sun's Java Development Kit which you can download at http://java.sun.com. For the Korean treebank, the command line is the following:

native2ascii -encoding KSC5601 korean.txt unicodeoutput.txt

Treebanks converted by negra-topenn

This is a special filter for corpora that have been generated using the negra-topenn command line tool. This tool is part of the Negra Corpus deliverable. It has been developed to linguistically transform treebanks which have been annotated according to the Negra annotation scheme to the UPenn style format.

Susanne corpus filter

This is a special filter for the Susanne corpus format only.

Christine corpus filter

This is a special filter for the Christine corpus format only.

general Negra format filter

This filter should work with any corpus encoded according to the Negra format, Version 3 or Version 4. It has been tested for the Negra Corpus, the Negra 2000 Corpus, the VerbMobil Treebank, and the TIGER Corpus Release 1.

LoPar format filter

LoPar is an implementation of a parser for head-lexicalised probabilistic context-free grammars. Grammars are currently available for German and English. LoPar has been developed at IMS, University of Stuttgart (cf. http://www.ims.uni-stuttgart.de/projekte/gramotron/SOFTWARE/LoPar-en.html).

The LoPar format filter is able to process a special output format of LoPar. This output can be generated using the following LoPar command line:

cat input.txt | lopar -in <model> -stems -heads -viterbi -viterbi-probs -tgrep > output.txt

The input file must be a text file in one word per line format.

TreeTagger chunking filter

The TreeTagger is a tool for annotating text with part-of-speech and lemma information. It also comprises a chunker that is based on the tagging output. Chunking modules are currently available for German and English. The TreeTagger has been developed at IMS, University of Stuttgart (cf. http://www.ims.uni-stuttgart.de/projekte/corplex/). The TreeTagger chunking filter is able to process the XML output of the chunker.

YAC format filter

The chunker YAC (Yet Another Chunker) is a rule-based chunker for German. It has been developed at IMS, University of Stuttgart (cf. http://www.ims.uni-stuttgart.de/projekte/corplex/). The YAC format filter is able to process the XML-based YAC output.

DEREKO format filter

This filter should work with corpora encoded according to the DEREKO corpus format. The DEREKO corpus format has been developed within the DEREKO project.

3.6 Corpus conversion only

If you just want to convert a corpus to the TIGER-XML format without subsequent corpus indexing, you can use the corpus conversion feature. Choose the Convert Corpus item in the Corpus menu. The corpus conversion window pops up. Specify the parameters which have been explained in the previous two subsections and press the Start button to start the conversion. The conversion process can be stopped at any time.

4. Changing corpus properties

4.1 Introduction

To change the properties of a corpus, please mark the corpus symbol and press the Corpus Properties button in the button toolbar. The corpus properties window pops up. Now you can change the corpus meta information (cf. subsection 4.2), specify a type system (cf. subsection 4.3), include corpus bookmarks (cf. subsection 4.4), or link predefined corpus templates (cf. subsection 4.5).

4.2 Corpus meta information

To edit the meta information of the corpus (which is displayed by the TIGERSearch and TIGERRegistry GUI), select the Meta tab in the corpus properties window (cf. screenshot below). Now you can edit the corpus meta information except the ID which cannot be changed after corpus creation:

4.3 Feature types

To specify a type hierarchy for a feature, you must first create an XML file comprising the hierarchy definition. The concept of feature types and its XML representation is described in section 8, chapter III. After creating the XML file, you have to register it. Please select the Types tab in the corpus properties window (cf. screenshot above). Now select the feature and type in the path to your file (absolute or relative to the corpus path). In the example, a relative link to the file tigerstts.xml in the corpus directory is specified:

If there are any problems loading a type hierarchy in the TIGERSearch tool (e.g. if there is a feature value in the corpus that is not used in the hierarchy, or if there is a feature value used in the hierarchy that is unknown to the corpus), all warnings are collected and can be inspected within the TIGERSearch corpus documentation tab (cf. subsection 2.2, chapter IV).

4.4 Corpus bookmarks

One helpful feature of the TIGERSearch tool is the management of bookmarks (cf. subsection 2.3, chapter IV). Users can store their favourite bookmarks for later inspection or reuse. The so-called corpus bookmarks file can be linked to a corpus so that all users of the corpus have access to it.

To link a corpus bookmarks file to a corpus, select the Bookmarks tab in the corpus properties window. Type in the path to your bookmarks file (absolute or relative to the corpus path). In the example, a relative link to the file tigercorpus.xml in the corpus directory is specified:

4.5 Corpus templates

Templates definitions (cf. section 9, chapter III for an introduction) are stored in files, template files are organized in directories. In order to link a template collection to a corpus, you have to specify the root directory of your collection. Select the Templates tab in the corpus properties window and type in the path to the templates root directory (absolute or relative to the corpus path). In the example, a relative link to the directory templates/ is specified:

If there are any problems loading the templates in the TIGERSearch tool (e.g. if one of the templates is not wellformed), all warnings are collected and can be inspected within the TIGERSearch corpus documentation tab (cf. subsection 2.2, chapter IV).

VI. The TIGERRegistry administration tool

1. An introduction to TIGERRegistry

1.1 Corpus administration