Subsection: Indexing of non TIGER-XML files

3.3 Indexing of non TIGER-XML files

Indexing of non TIGER-XML files consists of two steps: First of all, the corpus is converted to TIGER-XML. Afterwards the generated TIGER-XML corpus is indexed. Thus, you have to choose the Corpus is in Other Format option at the top of the window and specify the following parameters (cf. screenshot):

Corpus ID

The corpus ID is used by the TIGERSearch software to realize corpus-dependent configurations. The corpus ID must be unique with regard to all indexed corpora. The uniqueness is checked before the indexing process is initiated. The ID has to start with a letter.

Import file

The source file can either be an uncompressed file, or a compressed .gz file, or a zip file that contains one source file only (relative paths are evaluated with regard to the working directory). Compressed files are automatically decompressed during the indexing process.

Import filter

Select one of the corpus format filters (cf. screenshot above). For a list of all implemented filters see subsection 3.5.

Convert corpus graphs

You can either convert and index the whole corpus or the first n graphs of the corpus.

Temporary TIGER-XML file

The first step of the indexing is the conversion to TIGER-XML. Thus, a temporary TIGER-XML file is needed. When typing in the corpus ID, a file name is automatically generated by the system (cf. checkbox Default name for XML file). Of course, you can also specify a different file name. Please note that relative paths are evaluated with regard to the working directory.

TIGER-XML file parameters

As the TIGER-XML file is a temporary file, it makes sense to compress it. You can enforce GZIP compression by checking the GZip XML file box. When the indexing progress is finished, the temporary file is automatically deleted. If you want to save the file (e.g. for debugging purposes) just uncheck the Delete after indexing box.

You can also make use of a so-called external header, i.e. a TIGER-XML document header which is stored in a separate file. To use this external header check the External Header box and type in the path of the header file (relative paths are evaluated with regard to the working directory).

Extended indexing

If extended indexing is activated, additional corpus information is retreived during the indexing process. This information is used to improve corpus query processing efficiency. Effiency will increase about 50% at the expense of main memory requirement which also increases about 50%.

Please note: Default indexing requires a constant amount of main memory (about 128 MB). The main memory requirement of the extended indexing process will depend on corpus size. If an out of memory warning is displayed, please modify the main memory configuration of the TIGERRegistry tool (cf. section 4, chapter II).

To start the indexing process press the Start button. The corpus conversion and indexing can be stopped at any time. The current progress is displayed by the Converting & Indexing progress window:

Figure: Conversion and Indexing progress window

The progress window also shows how many warnings and errors occured during corpus conversion and indexing. These messages are stored in the corpus log files conversion.log and indexing.log which are placed in the corpus directory. In subsection 3.4 we desribe how to view these log files within the TIGERRegistry application.

When the indexing process is finished, a corpus properties window pops up (cf. screenshot below). Here you can fill in meta information about the corpus such as the corpus name. The corpus properties window is explained in section 4.

Figure: Corpus properties window

After specifying the corpus properties, just click the OK button to finish corpus indexing. Now the new corpus can be found in the corpus tree.