1 SPEECH IS AT LEAST 4-DIMENSIONAL: RECEPTIVE FIELDS IN TIME-FREQUENCY Je A. Bilmesy and Dan Ellisy fbilmes,dpweg@icsi.berkeley.edu Department of Electrical Engineering and Computer Sciences University of California at Berkeley Berkeley, CA 94704, USA y International Computer Science Institute 1947 Center Street, Suite 600 Berkeley, CA 94704, USA 1. INTRODUCTION The successful integration of temporal information is crucial for speech recognition: for example, dynamic time warping and more recently hidden Markov models (HMMs) have been critical in speech recognition technology, and both these methods amount to time-aligning feature vectors to stored word and sentence templates. The correlogram [14] relies on the temporal processing of individual outputs of a cochlea-inspired lter bank. Furui [6] posited the existence of perceptual critical points located in time and containing information crucial for the perception of speech, and this idea inspired the statistical speech recognition model SPAM [11] which focusses modeling power on points of transition or maximal spectral change. This suggests that certain time points of a speech signal are in some sense more \important" than others for a recognition task. With HMMs, the locations of the state transitions inuence the particular realization of a model and therefore its likelihood; states between transitions are of secondary importance. The SPAM results [1, 12] in fact show that we can group these \nontransitional" states together into a single broad category and still achieve a good recognition error rate. All the above methods, however, neglect another potential axis of discrimination, namely the cross-spectrotemporal co-information. Non-convex or disjoint patterns with signicant temporal and spectral extent cannot be detected directly. Instead, existing methods focus either across frequency on a small slice of time (e.g., LPC cepstral features of a single speech frame) or across time on a small slice of frequency (e.g., the correlogram). There are in fact several results suggesting that the utilization of cross spectro-temporal co-information can have a benecial eect on speech processing and recognition. In the articial neural network (ANN) speech community [10] it has been shown that using multi-frame context windows can improve recognition scores. The loss of independent feature vectors notwithstanding, ANNs with such \wide" context windows have the potential to learn time- and frequency- like patterns, depending on the features used. In [2], it was shown that a cross-channel correlation algorithm can be used to nd formants in voiced speech in high noise situations. In [5], it was shown that using cross-channel correlation can be used to identify individual sound sources in a mixed auditory scene. Also, in [8], it was suggested that using long-term cross-channel correlation could be used as a measure of speech quality. It can also be argued that the use of cross-spectrotemporal information is biologically plausible. Echoic memory is a temporary buer in the auditory system that holds pre-attentive information for a brief period of time beJuly 30, 1996 fore subsequent, more detailed, and more taxing processing takes place [13]. It is likely that this storage occurs at the post-cochlear level as we have no evidence for such memory before or during cochlear processing. Therefore, the echoic storage can plausibly be thought of as a form of processed spectro-temporal buer. Thus assumed, it would be surprising if subsequent processing did not attempt to nd patterns utilizing not just the temporal or spectral axes alone, but shaped regions spanning both time and frequency. Therefore, it may be postulated that the auditory system has the capability, over a 200ms time-span comparable to echoic store, to observe the co-occurrence of information in dierent spectro-temporal regions. Similar to the cells in the visual system that respond to particular shapes, one may consider receptive elds over a form of post-cochlear spectro-temporal plane. Later stages of the auditory system could derive arbitrarily shaped regions that perhaps dynamically scale, shift, and transform according to a variety of control mechanisms. In this paper, we consider a new representation of speech that attempts to explicitly represent non-convex spectrotemporal co-information. Section 2 discusses the computational aspects of our representation. Section 3 illustrates with an example. And nally, section 4 discusses current and future work. 2. THE MODCROSSGRAM The most general (and most naive) way of computationally encoding the information described above is brute force. That is, given a time-frequency grid, we could derive features based on the co-information among all pairs, triples, quadruples, etc. of grid elements. This clearly would be an infeasible representation for current speech recognition systems. To mitigate this combinatorial explosion of features, we henceforth consider only pairs of grid points within a limited spectro-temporal region and resolution. The modcrossgram (modulation envelopes crosscorrelated) is a new feature extraction method that may be used to compute such spectro-temporal co-information. The processing is as follows (see Figure 1): We rst compute the modulation envelopes in each channel of a criticalband-like lterbank by rectifying and band-pass ltering. As early as 1939 [4], modulation envelopes have been shown to carry the crucial phonetic information of a speech signal. Also, since the envelopes are narrow-band, they may be down-sampled to recover full band-width and reduce subsequent computational demands. Note that we band-pass rather than low-pass lter to remove low-frequency modulation energy known to be of little importance to the speech signal [3, 7]. DRAFT 14:57 2 The modulation envelopes are then processed by shortterm cross-correlation dened as follows: ( )= Ri;j t; ` X N k ( + k)x (t + k + `)w xi t j k (1) =0 where x is the i envelope channel, t is the starting oset within the signals, ` is the correlation lag, N is the number of points used to compute the correlation, and w are windowing coecients. All pairs of channels are processed by the above for each desired time step. The result is, for each t, a rectangular prism whose 3 axes are indexed respectively by the rst frequency channel, the second frequency channel, and the correlation lag (see Figure 1). The resulting representation provides an estimate of the co-information between two frequency channels separated by a given lag ` starting at absolute position t. General cross-correlation between signals x and y is a function of two lags. If we assume joint-stationarity between the signals, the correlation is only dependent on one variable, the dierence between the two lags, and possesses the property R (`) = R (;`). Because Equation 1 is short-term cross-correlation | i.e., because we always use N points to compute the correlation for each t and ` | this property no longer holds. Therefore, we must potentially consider R (t; `) and R (t; `) for all `. The modcrossgram computes more than the cooccurrence of energy between two spectro-temporal regions. Because our envelopes have been band-pass rather than low-pass ltered, the DC components have been removed. Therefore, Equation 1 may nd correlation of the envelopes' spectral components, within limits determined by the bandpass lters, without being dominated by positive correlation from DC osets. Given a representation as described above, a subsequent pattern recognition algorithm can form receptive elds based on disconnected spectro-temporal regions. In addition, deltas computed from such receptive elds can determine spectral change on non-convex temporally or spectrally separated regions rather than just collapsed accross frequency. By discovering the i; j; ` positions that contain the information crucial to speech, features can potentially be formed that are preferentially sensitive to speech-like sounds. th i k xy i;j yx j;i frames to 10 frames (horizontal axis). Because it is dicult to digest such a large quantity of data, we provide a third plot. The bottom plot shows the top 60db of the positive cross-correlation between channel 17 and 8. Each horizontal stripe corresponds to a receptive eld over the coinformation between these two frequency channels at the corresponding lag. Observe that for \ka", at around frame 70, we see signicant correlation at positive lags; a little later, we see signicant correlation at negative lags. This reects the timing dierence between the initial stop release and the subsequent voiced onset. As expected, these correlations are not observed for \ga", which exhibits quite dierent patterns around frame 130. There are of course many other receptive elds available for observation in the modcrossgram. Our belief is that some combination of these will be useful for a pattern recognizer to discriminate between speech sounds. In this proposal, we have introduced the modcrossgram, a new representation of speech signals. We are currently in the process of integrating features derived from the modcrossgram into a standard speech recognition system. In the near future, we plan to use the modcrossgram directly to derive receptive elds that will be appropriate to speech, and then use these as features for a speech recognition system. 3. EXAMPLES The modcrossgram is inherently 4-dimensional { for each time point t, the result is a 3-dimensional rectangular prism indexed by channel numbers i and j , and a bipolar lag time `. It is dicult to visualize all of its information simultaneously. Figure 2 shows 3 plots. The top plot is the normal spectrogram of the utterance \ka ga." As the spectrogram shows, these two syllables dier principally in the voicing onset time { the delay between the stop-burst and the start of the periodic voicing { which is about 80ms in \ka" but nearly simultaneous in \ga" [9]. The middle plot shows the complete modcrossgram from a 22 channel lter bank for the utterance time-aligned to the spectrogram. This picture shows the top 60dB of the positive correlation, a maximum bipolar lag span of 250ms (20 frames), and uses 37.5ms (3 frames) for each correlation point. At each frame number and frequency channel, a small matrix shows the correlation between that channel and all other channels (vertical axis) and the lag from -10 July 30, 1996 4. CURRENT AND FUTURE WORK DRAFT REFERENCES [1] J. Bilmes, N. Morgan, S.-L. Wu, and H. Bourlard. Stochastic perceptual speech models with durational dependence. ICSLP, November 1996. [2] L. Deng and I. Kheirallah. Dynamic formant tracking of noisy speech using temporal analysis on outputs from a nonlinear cochlear model. IEEE Trans. Biomedical Engineering, 40(5):456{467, May 1993. [3] R. Drullman, J. M. Festen, and R. Plomp. Eect of reducing slow temporal modulations on speech reception. JASA, 95(5):2670{2680, May 1994. [4] H. Dudley. Remaking speech. JASA, 11(2):169{177, October 1939. [5] D. P. W. Ellis. A simulation of vowel segregation based on across-channel glottal-pulse synchrony. Technical Report 252, MIT Media Lab Perceptual Computing, Cambridge MA, 92139, 1993. Pres. 126th meeting of the Acoustical Society of America, Denver, Oct. 1993. [6] S. Furui. On the role of spectral transition for speech perception. JASA, 80(4):1016{1025, October 1986. [7] H. Hermansky and N. Morgan. RASTA processing of speech. IEEE Trans. on Speech and Audio Processing, 2(4):578{589, October 1994. [8] T. Houtgast and J. A. Verhave. A physical approach to speech quality assessment: Correlation patterns in the speech spectrogram. In EUROSPEECH 91., volume 1, pages 285{288. Istituto Int. Comunicazioni, 1991. [9] R.D. Kent, J. Dembowski, and N.J. Lass. The acoustic characteristics of american english. In N.J. Lass, editor, Principles of Experimental Phonetics, chapter 5. Mosby, 1996. 14:57 3 envelope follower |x| input sound crosscorrelation N filterbank i t=2 t=1 j lag Figure 1. The process used to compute the ModCrossGram, resulting in a 4-dimensional representation. freq (Hz) 4000 100 2000 0 50 0 0 0.2 0.4 0.6 time (secs) 0.8 1 1.2 60 freq (Hz) 2625 40 1104 464 20 195 lag (frames) 60 80 100 120 140 160 0 60 10 40 0 20 −10 60 80 100 120 12.5ms frame number 140 160 0 Figure 2. Top: Spectrogram, Middle: Modcrossgram, Bottom: Correlation between channels 17 (CF 1560 Hz) and 8 (CF 328 Hz). [10] R. P. Lippmann. Review of neural networks for speech [13] U. Neisser. Cognitive Psychology. Appleton-Centryrecognition. In A. Waibel and K.-F. Lee, editors, ReadCrofts, 1967. ings in Speech Recognition, pages 374{392. Morgan [14] M. Slaney and R. F. Lyon. On the importance of time Kaufmann, 1990. [11] N. Morgan, H. Bourlard, S. Greenberg, and H. Hermansky. Stochastic perceptual auditory-event-based models for speech recognition. Proc. Intl. Conf. on Spoken Language Processing, pages 1943{1946, September 1994. [12] N. Morgan, S.-L. Wu, and H. Bourlard. Digit recognition with stochastic perceptual models. Proc. Eurospeech'95, September 1995. July 30, 1996 DRAFT { a temporal representation of sound. In M. Cooke, S. Beet, and M. Crawford, editors, Visual Representations of Speech Signals, pages 95{116. John Wiley and Sons Ltd., 1994. 14:57
© Copyright 2024 Paperzz