Institute

Studying

Research


 

 Automatic Alignment of Speech signals using the Aligner

 

 0. General

Speech signals can currently annotated on the following levels using the Aligner tools:

Sprachen Ebene zu erzeugende Dateien
German, English, French Phones datei.phones, datei.phonemic, datei.phoneswithQ
German, English, French Words datei.words
German, English Syllables datei.syllables or datei.syls

On the phones and word levels the alignment process can be very time consuming for long speech signals.

1. Preparing your data

Two files are needed for the alignment: the speech signal  (datei.sd or datei.wav) and a text file (datei.txt), which contains the orthorgraphic transcription of the speech signal. The text file is not passed as an arguemnt to the aligner, but it is expected in the same directory as the speech signal.
If there is no text file, the speech signal is played in an endless loop until the text has been typed into the terminal. This can be useful for short signals but is not practical for long signals.

Important:

  • no special characters in the text file. Umlauts and ß are allowed.
  • The text file must be encoded in iso-8859-1 kodiert.
  • It is important that the file basenames are identical between signal and text file.
  • Sampling rate must be 16 kHz, and the file should be a mono channel file!
    Convert e.g. using sox -V datei.48kHz.wav -r 16000 -t wav datei.16kHz.wav remix -
    This command creates a new file datei.16kHz.wav with one channel and a sampling rate of 16 kHz from datei.48kHz.wav (48 kHz).
  • The files must not be symbolic links!

 

2. Using the Aligner

If the language is not German, set the ALANG variable first. It should be eng for English and fra for French data, e.g.:

setenv ALANG fra
(if you use a tcsh shell)


or

export ALANG=fra
(if you use a bash shell, the default at IMS nowadays).

To switch back to German, say either unsetenv ALANG or setenv ALANG deu (in tcsh) or export ALANG=deu (in bash).

The aligner runs on all Linux machines.

The following commands produce the annotation files:

phoneme level:

Alignphones datei.sd

word level:

Alignwords datei.sd

syllable level ($-notation):

phonemic2syl datei.phonemic datei.syl

syllable level (syllable notation):
 
 
 

phonemic2syllables datei.phonemic datei.syllables
Note: the phonemes in the syllable names
can deviate from the ones in the phone label file
because the syllable names are based on the
canonical transcHinweis: die Phoneme im Silbennamen können von denen
im Phones-Labelfile abweichen, da sie auf der kanonischen
Transkription basieren.

 

3. Problems

The following warnings are harmless and can be ignored.

WARNING [-3132] ConvertHParseNetwork: Dict. would be empty: not written in HParse
WARNING [-6553]  LoadESPSLabels: time stamps out of order. in HLEd

If other problems occur, please check the following things before asking.

  • Is the signal file's sampling rate really 16 kHz?
    (e.g. sfinfo datei.wav, see Sampling Rate.)
  • Is the audio format correct? (wav if file ends in wav etc.)
    (check sfinfo datei.wav )

If the problems persist, please set the variable KEEPALIGNERTMPFILES:

setenv KEEPALIGNERTMPFILES
(for tcsh shell)

or

export KEEPALIGNERTMPFILES=1
(for bash shell).

Diese bewirkt, dass die temporär erzeugten Dateien nicht gelöscht werden. Der Aligner gibt am Ende aus,
in welchem Verzeichnis sie gespeichert wurden. Bitte aus dem dort angegeben Verzeichnis die Dateien
transcribed sowie dateiname.htkwords und dateiname.net an mich schicken (s.u., Punkt 5), zusammen mit
dateiname.wav und dateiname.txt sowie möglichst mit der kompletten Ausgabe des Aligners
(also Fehlermeldung sowie vorhergehende Ausgaben). Bitte anschließend die Variable wieder zurücksetzen
(unsetenv KEEPALIGNERTMPFILES in tcsh bzw. export KEEPALIGNERTMPFILES="" in bash) und die Files in /tmp von Hand löschen!!

This causes the temporary file not to be deleted. The Aligner indicates at the end the directory where the files were stored. Please get the follwing files from the directory: transcribed, dateiname.htkwords, dateiname.net, and send them to me (see 5.), together with dateiname.wav and dateiname.txt, optimally with the complete output of the Aligner. Please unset the variable afterwards (unsetenv KEEPALIGNERTMPFILES in tcsh or export KEEPALIGNERTMPFILES="" in bash) and delete the temporary files by hand!!

4. FAQ (source: Markus Fach)

error help
HVite: ERROR [+8220]

InitPronHolders: Word ä not defined in dictionary

the .txt file contains special characters for which it cannot generate acoustic models.

Way out: Check text file and eliminate errors. Typically specially encoded characters like backquotes, typographical quotes, long dashes, etc. Can also occur for unknown words if estimated pronunciation contains unknown phones.

   
Starting HParse...
HParse: ERROR [+3131]
FindNodeTypes: Different num WD_BEGIN (5063) & WD_END nodes (5046)
The aligner internally generates a grammar (.net), which is saved in /tmp. This error produces unequal amounds of WD_BEGIN and WD_END nodes, i.e. the formal structure of the .net file is wrong. There are various reasons why this can occur.

Way out: Interrupt the alignment process right after the error message. Then have a look at the .net.

   
The labels do not align at all with the speech signal (may look like an empty label file depending on the software you use for inspection). Problably the .wav files does not have 0 as start time, but the label files always start at 0. Check the label times using an editor or the less command in case you don't see them in the software for displaying sound files.

Way out: Set the start time of the speech signal to 0 (e.g. using praat).

   

5. Questions, problems, comments?

Mail to Antje Schweitzer