Parametrization of the EGG waveform

10. Parametrization of the EGG waveform.

10.1. Pitch period determination

Once the opening and closing instants have been determined, the pitch period and other measures of glottal contact behaviour can be introduced. In the EGG signals the pitch period is usually defined as the duration between maximum positive peaks in the differentiated EGG waveform. This peaks are regarded as instants of glottal closure. The marking of pitch period is usually done automatically by computer programs which use a threshold values to detect the peaks of the signal derivative (Fourcin, 1993; Childers & Krishnamurthy, 1985; Baken, 1992; Orlikoff, 1991). The threshold is usually defined as a medium value between the minimum and the following maximum peak of the waveform. It must be noted, however, that even for normal voices this simple method may fail due to the rapid baseline changes caused by the vertical larynx movements (Gx signal) (Marasek, 1995a). Childers et al. (1990) warn, that multiple peaks in the differentiated EGG signals may occur even for modal speakers or if the signal is noisy.

In the case of pathological voices the location of the closure instant in the EGG signal is no longer as obvious as for normal voices. The adduction phase is often not smooth, and additional peaks in the differentiated EGG waveform are often observed. However, they are not significant. The waveform is irregular (Motta et al.,1990, Hunt, 1988).

The EGG signal is usually preconditioned to remove unwanted signal components. This is achieved by the use of digital linear phase high- or bandpass filtering (Hess & Indefrey, 1987; Schoentgen & de Guchteneere, 1991; Eady et al., 1992; Dickson et al., 1994). The filtered waveform is then subjected to the method of threshold comparison as described above.

Yet another approach to solving this problem was proposed by Vieira, McInnes and Jack (1996, 1996b). The band-pass filtered signal (with zero phase shift) is examined in order to determine the location of the "unique zero crossing" between "significant peaks". A sample x(n) is defined to be a positive significant peak when there is no larger value in the preceeding and following interval of 1 ms. Analogously, the negative significant peak is defined as the smallest value in the two adjacent 1-ms windows (Fig.19). The threshold is then iteratively modified to cope with baseline fluctuations and different speakers. In Vieira's algorithm unique upwards (i.e. during the signal increase) zero crossing is defined ( Vieira's et al., 1996) as:

where

and Avg+(0)=120 is empirically adjusted to match the noise level in the first silent interval.

This method was tested on a large set of data and the average ratio of the imprecize values was 3.37% with 10.71% for the worst case (for the weak and noisy recordings of the patient with vocal folds unilateral paralysis). This approach will be used here as a reference method.

Other methods of marking pitch periods have been proposed by the author (Marasek, 1995a). Both these multistage algorithms use thresholds in both time and amplitude domains. The algorithms use raw and unfiltered EGG waveforms to maintain the undistorted shape of the changes in impedance. This is an advantage because all other methods use filtering, which may distort the data.

Figure 19. The determination of unique zero crossing according to Vieira et al. (1996). A negative significant peak is located at t₁ and the next positive significant peak is searched within a 10 ms window. If the search is sucessful (t₃), an upward moving zero-crossing t₂is linearly interpolated between positive and negative signal samples.

Figure 19. The determination of unique zero crossing according to Vieira et al. (1996). A negative significant peak is located at t₁ and the next positive significant peak is searched within a 10 ms window. If the search is sucessful (t₃), an upward moving zero-crossing t₂is linearly interpolated between positive and negative signal samples.

In the first algorithm, the Gx waveform (which indicates the coarse movements of the larynx) is estimated. The course of the baseline is calculated using a moving-average (MA) model (Fig.20). Initially, an average across a chosen observation window is computed. The window is centered at a given signal sample. Subsequently, the window is moved one sample upwards until the end of signal is reached. Then, the upward intersections of the Lx signal with the MA estimator are used to mark the closure instants. The window length was set to 10 ms. The slow changes were eliminated, but the very rapid changes of the Gx were not recognized and passed on to the next phase of signal description. Based on visual inspections of the EGG waveform it can be stated that the shape of the waveform is in such cases highly distorted and should be excluded from analysis (Fig.20).

The second algorithm (Marasek, 1995b) is principally similar to that proposed by Vieira (Vieira et al., 1996). In the first step of the algorithm, the local maxima of the signal are determined under the assumption that the duration of a pitch period lies within a given interval (for example 2 - 20 ms). Afterwards, the minima between the signal maxima can be identified. Pairs formed in that way are compared with their neighbours and, if they are significantly weaker or shorter, attached to stronger pairs (this may occur in the case of creaky or laryngealized phonation). The instant of closure (CGI) is then calculated as a maximum of the first signal derivative which lies between the last minimum and the next maximum of the signal.

Figure 20. The EGG waveform with MA-modelled Gx.

Figure 20. The EGG waveform with MA-modelled Gx.

The second step of the analysis is shared by both methods. The obtained CGI's are once again examined, this time by a procedure that takes into account the values found for the whole analysed recording. The pairs of markers which occur in isolation or in a very short chain of markers (shorter than five) are classified as incorrect and are removed if they occur between long voiceless segments of speech (of at least 200 ms duration). Differences between two successive CGI markers are analysed and if the duration quotient is greater than five, an algorithm tries to recover the minimum-maximum pair from the original signal. The voiced segments are then marked. The procedure is capable of accurately recognizing the beginning of voicing (Marasek, 1995a).

EGG signal and CGI markers

The fundamental frequency measured at the voice onset and offset is often erroneous due to voice abnormalities in these phonation phases. The very first as well as the very last pitch period in the EGG waveform are often weaker and differ in duration from the pitch periods in between. These problems are partially addressed by Marasek (1995a) in his description of the two-channel (speech + EGG) automated method of Voice Onset Time determination.

Further processing involves an analysis within voiced segments. The intervals between marked minimum-maximum pairs are compared in order to detect deviations from mean interval duration and mean relative amplitude. If the deviations are greater than the given thresholds, the weaker pairs are connected to stronger ones (provided that the resulting pair does not exceed the allowed pitch range).

The second method was tested on both normal and pathological data and compared to the method proposed by Vieira et al. (1996). The results do not show significant differences for modal voices, but for pathological voices Vieira's method works better. The results are summarized in Table 4. The methods were tested on German words spoken in isolation (with pauses between words), produced by modal (normal) as well as pathological speakers. Several thousands of pitch periods were compared. In the results given in Table 4 the comparison of the two methods is biased due to differences in the marking of voiceless segments. This causes a slight delay between the estimated F0-contours, which in turn results in poorer matching. Despite this fact, the experiment has verified that both methods are equally effective (the linear regression shows the difference between estimated fundamental frequencies to be below 1% with a very small standard error, as follows from the F-statistics given in Table 4).

Table 4: The statistical results of the F0 discrimination method comparison.

DEP VAR: F1 N: 3995 MULTIPLE R: 1.000 SQUARED MULTIPLE R: 0.999
ADJUSTED SQUARED MULTIPLE R: .999 STANDARD ERROR OF ESTIMATE: 5.153
VARIABLE COEFFICIENT STD ERROR STD COEF TOLERANCE T P(2 TAIL)
F2 1.012 0.000 1.000 1.000 .28E+04 0.000
ANALYSIS OF VARIANCE
SOURCE SUM-OF-SQUARES DF MEAN-SQUARE F-RATIO P
REGRESSION .211681E+09 1 .211681E+09 7970561.400 0.000
RESIDUAL 106072.004 3994 26.558

Table 4: The statistical results of the F0 discrimination method comparison.
`DEP VAR: F1 N: 3995 MULTIPLE R: 1.000 SQUARED MULTIPLE R: 0.999` `ADJUSTED SQUARED MULTIPLE R: .999 STANDARD ERROR OF ESTIMATE: 5.153` `VARIABLE COEFFICIENT STD ERROR STD COEF TOLERANCE T P(2 TAIL)` `F2 1.012 0.000 1.000 1.000 .28E+04 0.000` `ANALYSIS OF VARIANCE` `SOURCE SUM-OF-SQUARES DF MEAN-SQUARE F-RATIO P` `REGRESSION .211681E+09 1 .211681E+09 7970561.400 0.000` `RESIDUAL 106072.004 3994 26.558`

The interval between two successive instants of the glottal closure is used to estimate the pitch period duration. Hess and Indefrey (1987), using their method of F0 computation based on the EGG signals, found an accuracy of approximation which is better than 0.5% for normal voices. Vieira et al. (1996) report that at least 91.8% of the estimated F0 values had a precision better than 0.103% with an estimated error of 0.058% in relation to the reference F0 contour. These results legitimate the EGG as a very precise and robust carrier of F0 even for moderately pathological voices.

The main parameters can be identified for every pitch period of the EGG waveform. The Speed and Open Quotients in particular can be calculated (see sections 10.2 and 10.3)