Next: Abbreviations Up: Token-to-word rules Previous: Splitting of compounds separated   Contents
One of the most important tasks of text preprocessing is the expansion of numeral expressions. Sequences of digits occur in different contexts and are pronounced differently. The following formats are distinguished:
- The numerator is converted with the help of ``german_parse_cardinal'', the fraction bar is not spoken and the denominator is converted by the function ``german_parse_fractal''. Unfortunately, years written in the same way (e.g. WS 97/98) are currently pronounced incorrectly.
- Conversion of the cardinals in a ratio is done by ``german_parse_cardinal'' and ``zu'' is inserted between them ( ``3:5'' goes to ``drei zu fünf'') .
- Phone numbers
- Phone numbers have the same format as fractions, with the exception of the zero at the beginning of the first number (area code). Instead of the slash there may also be a hyphen. Phone numbers are read digit by digit. The function ``german_parse_charlist'' is responsible for this conversion.
- Numeral compositions
- In numeral compositions such as ``Jäger90'' and ``16jährig'', the number is converted with the help of ``german_parse_cardinal'' and the enclosed word is prepended or appended to the number.
- Years between 1100 and 1999 are spoken like ``fünfzehnhundertsiebenundsechzig'' (1567, engl: ``fifteen hundred sixty seven''). As the differentiation between year and cardinal is not reliable, all cardinals in the range specified above are spoken like years.
- Dates are written DAY.MONTH.YEAR. YEAR may be a two or a four digit or completely left out. The conversion is done with the help of ``german_parse_cardinal'' and ``german_parse_ordinal''. The ``german_ordinal_prediction_tree'' is used to determine the inflexional suffixes of ordinals.
- Time is written HOURS.MINUTES or HOURS:MINUTES, followed by ``Uhr'' or ``h'' and spoken as HOURS ``Uhr'' MINUTES. The conversion of hours and minutes is done by ``german_parse_cardinal''. We have to consider that the word ``Uhr'' is not spoken twice, after the hours and the minutes. Therefore, each token is checked whether the preceding token belongs to a time format as well.
- Currencies are written as CARDINAL,CARDINAL followed by a unit (e.g. ``15,60 DM''; ``7,89 sfr''; ...). They are pronounced by inserting the unit between the two cardinals. The numbers are converted using ``german_parse_cardinal''. ``german_fetch_currency'' looks up the unit. Again, we have to consider, that the unit is not spoken twice.
- Floating point numbers
- All sequences of digits that contain a comma and have not been considered so far, are converted as floating point (i.e., ``floating comma'') numbers. The digits to the left of the comma are converted with ``german_parse_cardinal'', then the comma is pronounced and finally the digits to the right of the comma are read one by one with the help of ``german_parse_charlist''.
- Ordinals are cardinals followed by a period. Thus, we have to distinguish between ordinals and cardinals at the end of a sentence. For this task we use a list of words that that can only appear with capitals at the beginning of a sentence . The inflexional suffixes are determined with the help of the ``german_ordinal_prediction_tree''. The ordinal is expanded by ``german_parse_ordinal''.
- If cardinals are grouped by periods or blanks into blocks of three digits for legibility, they are converted into a closed sequence of numerals and expanded with the help of ``german_parse_cardinal''.
- Roman numbers
- Roman numbers are converted into Arabic numbers with ``ger_tok_roman_to_numstring''. They are converted like ordinals. If there is a king's name, a queen's name or the name of an emperor or empress in front of the roman number, the delimiter ``der''/``die'' has to be inserted between the name and the number.
Next: Abbreviations Up: Token-to-word rules Previous: Splitting of compounds separated   Contents Martin Barbisch