Next: Inter-punctuation and whitespaces Up: Token-to-word rules Previous: Expansion of numeral expressions   Contents
Another important task of text preprocessing is the handling of abbreviations. They are a big challenge for the text preprocessing component of a TTS system, because of their frequency of occurance and because mandatory guidelines for their formation are missing. Very often the expansion of an abbreviation is ambiguous and could only be solved using the semantic or pragmatic context of the sentence.
- Duden-style abbreviations
- Abbreviations following the classification scheme given in the ``Duden'' are recognized in the function ``german_token_to_words'' and expanded in the function ``ger_lookup_comb_abbr''. The pronunciations of the resulting words are then looked up separately. If no listing is found, they are spelled.
- For a unit to be recognized, the preceding token must be a number and the abbreviation has to be found in ``ger_masseinheit_teststring'' or ``ger_masseinheit_teststring2''. If so, the unit is converted on the basis of the information in ``ger_abbr_masseinheiten_dim_tab'' and ``ger_abbr_masseinheiten_tab''.
- Abbreviations of length 1
- Tokens consisting of one letter are always abbreviations. They are not expanded, because they usually have many different meanings.
- Abbreviations consisting of consonants only
- Tokens consisting of consonants only are always abbreviations, because they are not pronounceable in German. The abbreviation is looked up in the abbreviation tables by ``ger_translate_abbr''. If no listing is found, the abbreviation is spelled.
- Abbreviations consisting of capital letters only
- If a token with only capital letters is found in one of the abbreviation tables it is recognized as an abbreviation. Otherwise it is spoken like a normal word, because often words are written in capital letters to highlight them.
- Abbreviations followed by a period
- If a token has a period as punctuation feature, it is looked up in the appropriate table. If found, it is expanded and the period is deleted from the punctuation feature. Otherwise it is assumed that the period marks the end of a sentence.
- Ambiguous tokens
- A special problem are abbreviations that also appear as regular words. For example ``Art.'' may be the abbreviation for ``Artikel'' or the word ``Art'' at the end of a sentence. To solve this problem, we would have to regard the context of such abbreviations, which is not yet implemented.
Next: Inter-punctuation and whitespaces Up: Token-to-word rules Previous: Expansion of numeral expressions   Contents Martin Barbisch