3.2 Indexing of TIGER-XML files

If your corpus source is encoded in TIGER-XML format, please mark Corpus is in TIGER-XML format at the top of the window. Selecting this option deactivates some parameters of the window:

Please click to enlarge!

Figure: Indexing parameters (TIGER-XML corpus input)

Now you have to specify the following parameters:

Corpus ID

The corpus ID is used by the TIGERSearch software suite to realize corpus-dependent configurations. The corpus ID must be unique with regard to all other indexed corpora. The uniqueness is checked before the indexing process is initiated. The ID has to start with a letter.

TIGER-XML file

The source file (relative paths are evaluated with regard to the working directory) can be either an uncompressed .xml file, or a compressed .xml.gz file, or a .zip file that contains one source file only. Compressed files are automatically decompressed during the indexing process.

Extended indexing

If extended indexing is activated, additional corpus information is retreived during the indexing process. This information is used to improve corpus query processing efficiency. Effiency will increase about 50% at the expense of main memory requirement which also increases about 50%.

Please note: Default indexing requires a constant amount of main memory (about 128 MB). The main memory requirement of the extended indexing process will depend on corpus size. If an out of memory warning is displayed, please modify the main memory configuration of the TIGERRegistry tool (cf. section 4, chapter II).

After specifying the indexing parameters, you can start the indexing process by pressing the Start button. The corpus indexing can be stopped at any time. The current progress of the indexing is displayed by the indexing progress window:

Please click to enlarge!

Figure: Indexing progress window

The progress window also shows how many warnings and errors occured during corpus indexing. These messages are stored in the corpus log file indexing.log which is placed in the corpus directory. In subsection 3.4 we desribe how to view this log file within the TIGERRegistry application.

When the indexing process is finished, the Corpus properties window pops up (see screenshot below). Here you can specify meta information about the corpus such as the corpus name. The corpus properties window is explained in detail in section 4.

Please click to enlarge!

Figure: Corpus properties window

After the corpus properties specification, just press the OK button to finish corpus indexing. Now the new corpus can be found in the corpus tree.