see also:Festival im Emacs (in German).

Starting Festival

You can start Festival on the Linux machines in the student pool by invoking a Terminal and then typing


in the terminal. This starts festival in interactive mode, i.e. Festival is ready to accept commands.
An alternative is passing text via standard in to festival, e.g.

echo "Hallo" | festival --tts

If you want to be able to synthesize Umlauts correctly, it is important that you change the encoding to ISO-8859-1 in your Terminal settings.

An example session

Start festival in a terminal


festival>(SayText "Hallo")
#<Utterance 0x407e33e8>

The command "SayText" returns an "utterance", you can see its internal ID in the second line.

<<< Note: depending on your festival init file, some voice is the default voice. In the interactive mode, you can switch from this voice to other voices, e.g. by:


You can check which voices are available by starting to type (voice and then using the <tab> key to see possible expansions.

The variable current-voice holds the name of the currently loaded voice. You can have its value displayed by simply typing


end of Note. >>>

Have a look at the generated utterance structure (in interactive mode, i.e. when you can see the
festival> prompt):

First, save the utterance in a variable called utt:

(set! utt (SayText "Hallo"))

Several "relations" are created when synthesizing. The relations link lists of elements. All elements which belong to a relation are joined in a list. Sometimes these lists are not flat structures, but hierarchically structured, i.e. the lists contain nested lists. Some relations are linked indirectly because they contain the same elements.

Have a look at which relations are present in your utterance structure:

festival> (utt.relationnames utt)

Inspect single relations:

festival> (utt.relation_tree utt 'Word)
((("Hallo" ((id "_2") (name "Hallo") (pbreak "NB") (pos "ITJ")))))

The "Word" relation in this case contains just one element, viz. the word "Hallo". This element has a name (first position in the list) and further attributes (the attribute value pairs following). In the example, "Hallo" has an ID number, its name is explicitly listed as an attribute again, the pbreak states if there is a phrase break following after this word (in this case not: "NB" for "no break"), a part-of-speech tag "ITJ" for interjection. The name is listed twice, once in the attribute list, and once as "reference" for the list element itself.

festival> (utt.relation_tree utt 'Token)
   ((id "_1") (name "Hallo") (whitespace "") (prepunctuation"")))
  (("Hallo" ((id "_2") (name "Hallo") (pbreak "NB") (pos "ITJ"))))))

The "Token" relation also contains just one element, but it's a nested one. Its name is "Hallo" as well. It is noted that there was no whitespace preceding "Hallo" in the text, and that there was no prepunctuation (such as opening brackets, quotes, ...). Below this Token there is the word "Hallo", which is identical to the element in the Word relation above (check the IDs).

The difference between the two levels in the Token relation can be seen for compounds with "-" or for abbreviations:

festival> (set! utt (SayText "Halli-hallo"))
#<Utterance 0x407865a8>
festival> (utt.relation_tree utt 'Token)
   ((id "_1")
    (name "Halli-hallo")
    (whitespace "")
    (prepunctuation "")
    (token_pos "comb_abbr")))
  (("Halli" ((id "_2") (name "Halli") (pbreak "NB") (pos "nil"))))
  (("hallo" ((id "_3") (name "hallo") (pbreak "NB") (pos "ITJ"))))))

Now there are two words connected to the "Halli-hallo" token, "Halli" and "hallo". Similarly for abbreviations: we have the expanded words in the word relation, and the abbreviation as the corresponding token.

festival> (set! utt (SayText "Hallo usw."))
#<Utterance 0x407865a8>
festival> (utt.relation_tree utt 'Token)
   ((id "_1") (name "Hallo") (whitespace "") (prepunctuation"")))
  (("Hallo" ((id "_3") (name "Hallo") (pbreak "NB") (pos "ITJ")))))
   ((id "_2")
    (name "usw")
    (punc ".")
    (whitespace " ")
    (prepunctuation "")))
  (("und" ((id "_4") (name "und") (pbreak "NB") (pos "KO"))))
  (("so" ((id "_5") (name "so") (pbreak "NB") (pos "ADV"))))
  (("weiter" ((id "_6") (name "weiter") (pbreak "BB") (pos "ADJ"))))))

The SylStructure relation is another example of a hierarchical relation, it links the Word level, the Syllable level, and the Segment (Phone) level.

festival> (set! utt (SayText "Hallo"))
#<Utterance 0x407865a8>
festival> (utt.relation_tree utt 'SylStructure)
((("Hallo" ((id "_2") (name "Hallo") (pbreak "NB") (pos "ITJ")))
  (("ha" ((id "_4") (name "ha") (stress 0)))
     ((id "_5")
      (name "h")
      (dur_factor -0.00021387900051195)
      (end 0.43829074501991))))
     ((id "_6")
      (name "a")
      (dur_factor 0.00046763601130806)
      (end 0.558285176754)))))
  (("lo:" ((id "_7") (name "lo:") (stress 1)))
     ((id "_8")
      (name "l")
      (dur_factor -0.00017604799359106)
      (end 0.60260915756226))))
     ((id "_9")
      (name "o:")
      (dur_factor -7.5191303039901e-05)
      (end 0.70340883731842)))))))

When synthesizing, utterances are passed through a series of modules. In many cases modules create new relations, sometimes they just add to existing relations or modify them (for instance, the syllable structure is created during lexicon lookup (module "Word"), but the duration info for the segments or phones (the end and dur_factor features in the above example) is added by the Duration module later.

The "UttType" of an utterance determines which modules the utterance is passed through and in which order. When using SayText, the UttType is always "Text". It is "Phones" if you use the command "SayPhones". You can display the defined UttTypes and the corresponding modules by typing "UttTypes" in interactive mode:

festival> UttTypes

This returns a list of lists: one list per UttType. In each list the first element is the UttType, the following list specifies the modules that festival will use when synthesizing input of this type. For Text Mode, this is the sequence of modules:

  (Initialize utt)
  (Text utt)
  (Token_POS utt)
  (Token utt)
  (POS utt)
  (Phrasify utt)
  (Word utt)
  (Pauses utt)
  (Intonation utt)
  (PostLex utt)
  (Duration utt)
  (Int_Targets utt)
  (Wave_Synth utt))

The command

(set! utt1 (SayText "Hallo"))

is equivalent to

(set! utt1 (Utterance Text "Hallo"))
(utt.synth utt1)
( utt1)

(In the first line, we save the utterance with uttType Text in a variable utt1, then we synthesize and play this utterance.)

And (utt.synth utt) for UttType Text is equivalent to the sequence of commands (modules) specified in the UttTypes variable above.

(Initialize utt)
(Text utt)
(Token_POS utt)
(Token utt)
(POS utt)
(Phrasify utt)
(Word utt)
(Pauses utt)
(Intonation utt)
(PostLex utt)
(Duration utt)
(Int_Targets utt)
(Wave_Synth utt)

This means we can execute single modules by just typing the corresponding command with the utterance as an argument. This way, we can examine step by step which module affects the utterance structure in which way, by calling the modules one by one and examining the resulting utterance in between.

All modules in short:


Segments text into single tokens.


Guesses which type of token a token belongs to (e.g. ordinal number, fraction, abbreviation, date, time, year). This is particularly important for abbreviatons and numbers which are ambiguous and should be pronounced differently depending on their type - Token_POS disambiguates them using some context and gives them an unambiguous label for the token type.


Adds the Word relation to the Token relation, i.e. splitting compounds with "-" into several words, expanding abbreviations, etc.


The part-of-speech tagger, if present. In the above examples no tagger was used. The tagger is used for the (voice_german_de4_linginto), for instance. The tags are subsequently present as attributes in the Word relation.


Determines the places where phrase boundaries should be inserted. The result can be seen in the pbreak attributes in the Word relation: B is for small breaks (corresponding to intermediate phrase boundaries in the ToBI framework), BB is for "big" breaks (corresponding to full intonation phrase boundaries in ToBI), and NB for no break.


This is the lexicon lookup (extended by letter-to-sound rules for unknown words). It creates the hierarchical SylStructure relation and the flat relations Syllable and Segment. If there are no POS tags from the tagger, the word classes are taken from the lexicon. .

Entries in the lexicon can be queried in this way:

festival> (lex.lookup "Hallo")
("hallo" ITJ (((h a) 0) ((l o:) 1)))

The phonemes are in SAMPA notation (for the IMS voices), each syllable is a list which contains a list of phonemes followed by 0 (unstressed) or 1 (stressed).

New entries can be added (possibly overwriting existing entries):

(lex.add.entry '("hallo" ITJ (((h a ) 1) ((l o:) 0))))
festival> (lex.lookup "Hallo")
("hallo" ITJ (((h a) 1) ((l o:) 0)))


Insertion of pauses at the places indicated by Phrasify. (Visible in the silence segment ("_") in the Segment relation (beware, silences ar not linked in the SylStructure relation!))


Intonation prediction (just abstract labels, ToBI labels in case of most IMS voices). Results are written to the Intonation relation, in which each accented and each phrase-final syllable is linked to an intonation event with the corresponding label.

festival> (utt.relation_tree utt 'Intonation)
((("ha" ((id "_4") (name "ha") (stress 1)))
  (("H*L" ((id "_12") (name "H*L"))))))


Postlexical rules, of little importance in the examples above, but allows to change the canonical pronunciation to what would be expected in fluent speech.


Segment duration, check the "end" attributes in the Segment relation.


Predicts concrete F0 Targets for the labels predicted above. Can be seen in the Target relation (one or more targets on some segments, targets are specified by the F0 value and the position within the segment).


The actual synthesis. Nothing to see, just to hear ;-)

© AntjeSchweitzer, 18.10.16