IMPROVEMENT OF BODY-CONDUCTED SPEECH SOUND QUALITY USING THE ACCELERATION DIFFERENCE EVALUATION BY A WORD INTELLIGIBILITY TEST Masashi Nakayama*1, Aya Kajino*2, Seiji Nakagawa*3 and Shunsuke Ishimitsu*1 *1 Hiroshima City University, via 4-1 Otsuka-higashi, Asaminami, Hiroshima 731-3194, Japan e-mail: masashi@hiroshima-cu.ac.jp *2 Kagawa National College of Technology, via 355 Chokushi, Takamatsu, Kagawa 761-8058, Japan *3 National Institute of Advanced Industrial Science and Technology (AIST), via 1-8-31 Midorigaoka, Ikeda, Osaka 563-8577, Japan Because body-conducted speech (BCS) sounds are conducted by skin, muscle, and bone and are not easily disrupted by airborne noises, noise-robust methods of speech detection using BCS have been developed. However, the intelligibility of BCS quickly deteriorates because it does not contain high-frequency components greater than 2 kHz. Thus, we previously proposed and evaluated a method that improves the sound quality of BCS using acceleration difference and noise reduction techniques. However, assessments based on human perception are essential. Consequently, in this paper, we report on evaluations conducted to verify the efficacy of the proposed method using word intelligibility tests and subjects with normal hearing. 1. Introduction Speech conversation is one of the most important means of communication for humans. However, factors such as noise in the air can easily disrupt speech communication. Consequently, approaches that enable robust communication methods and instruments have been widely proposed and investigated in the fields of speech signal processing and human interfacing. In speech signal processing, in particular, robust communication is one of the most significant research topics because speech recognition research has not yet resulted in methods that have achieved sufficiently effective performance for practical use. One such area of research is body-conducted speech (BCS), including bone-conduction sound, in which speech is conducted via the skin and bones in the human body [1]. Noise in the air does not affect this sound; however, the intelligibility of BCS rapidly deteriorates because it does not contain high-frequency components greater than 2 kHz. Consequently, many approaches for improving the sound quality and intelligibility of BCS have been proposed and investigated. We previously proposed and evaluated one such method to improve the sound quality of BCS using acceleration difference and noise reduction methods [2]. In the proposed method, we extract a clear signal using BCS and evaluate its efficacy via signal freICSV22, Florence (Italy) 12-16 July 2015 1 The 22nd International Congress on Sound and Vibration quency characteristics and recognition performance. Although BCS provides a robust signal, the quality of the signal is not clear. Conventional retrieval methods for BCS include modulation transfer function, linear predictive coefficients (LPCs), direct filtering, and the use of throat microphones [3–6]. However, these methods require direct speech sound and/or filters coefficients for sound quality improvement. In addition, conventional microphones cannot extract speech in noisy environments essentially. Thus, we proposed a retrieval method for BCS [6], and evaluated its efficacy using a speech recognition system for statistical evaluation because it can evaluate the effectiveness of improvement performance by acoustic matching of speech recognition. However, assessments based on human perception are essential for advance and practical use. Consequently, in this paper, we report on evaluations conducted of the effectiveness of the proposed method using word intelligibility tests in normal-hearing subjects. In the evaluations, intelligibility was compared among airborne speech, BCS, and two kind of retrieval BCS. 2. Speech and BCS Speech is air-conducted sound and is easily influenced by surrounding noise. By contrast, because BCS is solid propagated sound, it is difficult for noise to affect it. Figures 1 and 2 represent the utterance of a local Japanese place called “Asashi” by a twenty-year-old male, as representative speech and BCS. The utterance was chosen from the JEIDA-100-local-place-name database that is balanced words in phonetic [7]. Table 1 also shows the experimental recording environments. Signals were recorded at 16 kHz with 16 bits. Speech was measured via a microphone positioned at a distance of 30 cm from the mouth, which is the microphone position for practical use, and BCS was measured using an accelerometer placed at the upper lip where confirmed as best signal measurement point [1]. The distance for speech is assumed to be that of a conventional speech interface such as a car navigation system. The measuring position for BCS has already been discussed and proven as a suitable location compared with feature parameters between speech and BCS in previous research [1]. However, BCS does not possess 2 kHz or higher frequency components, and so conventional speech recognition does not work for practical use because there are differences in the quality of the sound and feature parameters. Figure 1. Speech. ICSV22, Florence, Italy, 12-16 July 2015 Figure 2. BCS. 2 The 22nd International Congress on Sound and Vibration Table 1. Specifications for experimental recording environments. Recorder TEAC RD-200T Microphone Ono Sokki MI-1431 Microphone amplifier Ono Sokki SR-2200 Microphone position 30cm (Distance between mouth and microphone) Accelerator Ono Sokki NP-2110 Accelerator amplifier Ono Sokki PS-602 Accelerator position Upper lip 3. Differential acceleration and noise reduction [2] BCS requires higher frequency characteristics in order to be audible. Thus, our challenge was to improve the quality of sound with our signal retrieval method. Consequently, conventional signal retrieval methods need targeted signals and/or parameters such as speeches recorded by microphone, which is inadequate in noisy environments. Therefore, a signal retrieval method for practical use should recover only BCS. On the basis of this progressive idea, we decided to invent a signal retrieval method that forgoes speeches and other parameters. Consequently, we discovered that BCS has effective components at 2 kHz and higher; however, it has very little gain. Thus, our method tries to emphasize the gain using a simple technique. 3.1 Differential acceleration Formula (1), shows the equation used for differential acceleration, which provides preemphasis in high frequency, where x(i) is waveform data in time frame i. x differential (i ) x(i 1) x(i ) (1) xdifferential(i) is the differential acceleration signal between each frame of BCS. Because of its low gain at amplitude, it requires adjustment to a suitable level for hearing or processing. Figure 3 shows the differential acceleration signal estimated from Figure 2 using Formula (1), with the adjusted gain. Although it may appear that the differential acceleration signal becomes speech mixed with stable noise, it actually emphasizes the high-frequency components. This facilitates estimation of the retrieval signal when the noise is removed with conventional noise reduction methods. In this manner, we developed our signal estimation method using differential acceleration and conventional noise reduction. 3.2 Noise reduction As a first approach, we examined the effectiveness of a spectral subtraction technique for the reduction of stable noise [2]. However, no improvement in the frequency components was obtained when it was used with the signal. Although the noise spectrum is simply subtracted by the spectral subtraction method, the Wiener filtering method is employed because it estimates a spectrum envelope of speech using linear prediction coefficients. Therefore, we attempted to extract a clear signal using the Wiener filtering method, and found that it was able to estimate and obtain the effective frequency components from differential acceleration. Thus, we decided that it was a suitable noise reduction technique for our signal retrieval method. Formula (2) defines the Wiener filtering method: H Speech ( ) (2) H Estimate ( ) H Speech ( ) H Noise ( ) ICSV22, Florence, Italy, 12-16 July 2015 3 The 22nd International Congress on Sound and Vibration Estimated spectrum HEstimate(ω) can convert a retrieval signal from differential acceleration. It can be calculated from the speech spectrum HSpeech(ω) and noise spectrum HNoise(ω). HSpeech(ω) is calculated with autocorrelation functions and linear prediction coefficients by the Levinson Durbin algorithm, and HNoise(ω) is then estimated by autocorrelation functions. Figure 3. Differential acceleration Figure 4. Retrieval signal 4. Word intelligibility test for performance evaluation in BCS 4.1 Methods We chose vocabulary sounds from JEIDA 100 Japanese local place names. Speakers with healthy hearing ability read out the words in a chamber room at AIST Kansai—a very quiet environment. The speech and BCS were recorded using microphones and an accelerometer. Consequently, a database was constructed from the utterances of local place names, which resulted in a total of 1,800 samples (composed from one speech/two BCS × 3 male × 3 trials × 100 local place names). In the word intelligibility test, participants heard and answered 30 samples that chosen randomly by each kind of sound, respectively, speech, BCS, differential acceleration, and metrical signal. The participants were four Japanese males and four Japanese females over eighteen-years-old, with healthy hearing and speaking abilities. The results of the word intelligibility test were the average of samples of the respective result calculated from eight participants × 3 males × 30 samples, respectively. The local place names in the 30 samples were chosen randomly. 4.2 Results and discussions Table 2 shows the results of the word intelligibility test, ranked according to the results of intelligibility. The rank from high to low is in the order speech, acceleration difference, retrieval signal 1 (Ret. 1 (IT = 3)) and retrieval signal 2 (Ret. 2 (IT = 6)). In the BCS sounds, differential acceleration achieved the best performance in the intelligibility test. This is because there are 2 kHz higher components with stable noise included, caused by treatment of differential acceleration. Further, it was confirmed that word intelligibility performance depends on noise reduction. These results show the same tendency as those obtained from the intelligibility test using speech recognition [6]. This means that evaluation of word intelligibility can be achieved by using speech recognition, and it can estimate retrieval settings automatically using acceleration difference and noise reduction. ICSV22, Florence, Italy, 12-16 July 2015 4 The 22nd International Congress on Sound and Vibration Table 2. Results of word intelligibility test. Samples Speech BCS Accdiff Ret. 1 (IT = 3) Ret. 2 (IT = 6) Intelligibility [%] 94.74 81.12 84.17 82.92 79.85 5. Conclusion and future work In this paper, performance of retrieval signal of BCS was evaluated by means of a word intelligibility test involving participants, four males and four females. From the results of intelligibility test, differential accretion showed the best performance, and also retrieval signal treated by differential acceleration was also shown. In addition, there is a strong correlation between the results obtained in this word intelligibility test and speech recognition. In future work, we will investigate and discuss the relationship between indelibility and recognition performance, and propose a real-time signal retrieval algorithm to improve conversations with speech communications. References 1 S. Ishimitsu, “Construction of a noise-robust body-conducted speech recognition system,” in Speech Recognition, F. Mihelic and J. Zibert (Ed.), chapter 14, 2008. 2 M. Nakayama, S. Ishimitsu, and S. Nakagawa, “A study of making clear body-conducted speech using differential acceleration,” IEEJ Transactions on Electrical and Electronic Engineering, vol. 6, issue 2, pp. 144–150, 2011 (online: January 2011). 3 S. Ishimitsu, M. Nakayama, and Y. Murakami, “Study of body-conducted speech recognition for support of maritime engine operation,” Journal of the JIME, vol. 39, no. 4, pp. 35–40, 2011. 4 T. Tamiya and T. Shimamura, “Improvement of body-conducted speech quality by adaptive filters,” IEICE Technical Report, SP2006-191, pp. 41–46, 2006. 5 T. T. Vu, M. Unoki, and M. Akagi, “A study on restoration of bone-conducted speech with LPC-based model,” IEICE Technical Report, SP2005-174, pp. 67–78, 2006. 6 Z. Liu, Z. Zhang, A. Acero, J. Droppo, and X. Huang, “Direct filtering for air and boneconductive microphones,” Proceedings of the 6th IEEE International Workshop on Multimedia Signal Processing, pp. 363-366, 2004. 7 S. Itahashi, “A noise database and Japanese common speech data corpus,” Journal of ASJ, vol. 47, no.12, pp. 951–953, 1991. ICSV22, Florence, Italy, 12-16 July 2015 5
© Copyright 2025 Paperzz