Indirect measurement of laryngeal behavior (inverse filtering)

Indirect measurements of the glottal waveform involve inverse filtering (IF). This method enables the researcher to estimate the glottal flow signal by using the speech signal. It is assumed that the source-filter speech production model and the effects of the vocal tract and lip radiation are cancelled out by the use of an inverse (whitening) filter. A short overview of the method is to follow and also its main weakness will be discussed.

Inverse filtering (IF) is used only for voiced speech segments. The main advantage of this method lies in the possibility of scaling results in flow units (cm3/s) and in its noninvasive nature.

There are two techniques of inverse filtering (Javkin et al., 1987; Ladefoged et al., 1988).

For the first technique the data is recorded using a reference quality condenser microphone with a flat frequency response beginning at a very low frequency (even 0 Hz) and extending to up to 5 or 8 kHz. The advantage of this technique is its wide frequency response which facilitates a detailed representation of the glottal flow signal. Its disadvantage lies in the fact that when this procedure is used, the DC component is not registered. Thus, the results cannot be calibrated and are relative. Additionally, the whole recording channel must preserve the phase of the speech signal. Also, there are very strict conditions of how recordings are to be made (Jackson et al., 1985; Karlsson, 1988).

For the second technique the airflow is registered through a face mask (Rothenberg, 1973), which allows the recording of a DC flow component and the calibration of the measurements in physical units. In this technique the useful frequency response is flat (within +- 3 dB), from 0 Hz to about 1000-1500 Hz, which limits the accuracy with which the glottal pulse can be recovered. Especially information about the abruptness of the vocal fold closure is lost (Hanson, 1995:15). During recordings the mask must seal the face perfectly, since a leak would seriously affect the measurement which also requires the use of stimuli limited to syllables that are produced with the jaw moderately open. Moreover, as Ladefoged et al. (1988) point out, this kind of additional apparatus is quite difficult to use in the field. In some experiments oral pressure is additionally recorded to derive the interpolated subglottal pressure (Rothenberg, 1981) which in turn may be used to estimate the size of the glottal area (Ishizaka & Flanagan, 1972; Titze &Talkin, 1979; Ananthapadmanabha & Fant, 1982). However, this is a complicated process that is prone to errors (Fant, 1980; Cranen & Boves, 1987) due to the limitations of the assumed speech production model as well as the complex methodology.

For the purpose of inverse filtering, the vocal tract is approximated as an acoustic tube of a given length composed of a number of sections with different section areas. This is equivalent to the modelling of the sampled vocal tract transfer function (H(z)) as a superposition of a given number of spectral poles, which in the z-domain can be written as (Atal & Hanauer, 1971):


where p denotes the number of poles, and zi the i-th pole.

The sound pressure radiated from the mouth to the surrounding air is proportional to the time derivative of the lip volume velocity flow (Davis, 1978), which is generally approximated as a high pass filter with a spectrum of a +6 dB/octave slope. In the inverse filtering technique the frequencies and bandwidths of the poles are estimated by using autoregressive modelling (AR) of a signal. This method is also called linear prediction (LPC)11 because the linear combination of the previous input samples is used to predict the next output sample (a linear, discrete time system):


where the model coefficients ai are time-invariant (G - overall system gain in eq. (5)). The model coefficients can be calculated recursively and their estimates are the least square approximations of the true values (see Kay, 1988 for a survey of coefficient computation methods). Thus, the vocal tract transfer function is modelled as


which leads directly back to eq.(4).

An inverse filter (1/H(z)) is applied to every pitch period of the speech signal and the resulting signal is regarded as an approximation of the source signal. There are two main strategies for the estimation of the vocal tract transfer function:

The latter strategy additionally calls for an identification of the closed phase during a pitch period but is nevertheless judged to be more effective.

Both methods require a marking of the pitch periods, which is usually done by marking the instants of glottal closing (called CGI's or epochs). Although several methods of epoch detection based on the processing of the speech waveform have been proposed (Rabiner & Schafer; 1978; Hess, 1991; Strube, 1974; Ma et al. 1994; Cheng & Shaugnessy, 1989; Childers & Ahn, 1995), the task is not trivial and the results, especially for distorted speech, are often unsatisfactory. In order to achieve more accuracy, a techinque of two channel processing is widely used. For two channel processing the CGI's as well as the instants of glottal opening are provided by other means, for example through electroglottography (see section 8) (Krishnamurthy & Childers, 1986; Pinto et al., 1989).

Historically first IF's were used interactively. The operator adjusted the frequencies and bandwidth of the filter, depending on whether the results satisfied the researcher's expectations (Miller, 1959, Wong & Markel, 1976).

Assumptions of  inverse filtering:

  1. Speech is produced by a linear system in which a source signal is modified by a vocal tract
  2. filter.
  3. The system is stationary during an analysis interval.
  4. The glottal pulse spectrum is flat.
  5. The all-pole model of vocal tract characteristics is correct.
  6. The estimates of the bandwidths of spectral poles are correct.

11. a detailed discussion of LPC processing is given by Markel and Gray (1973), Kay (1988),

Makhoul (1975).