improvement of body-conducted speech sound quality using the

IMPROVEMENT OF BODY-CONDUCTED SPEECH SOUND
QUALITY USING THE ACCELERATION DIFFERENCE EVALUATION BY A WORD INTELLIGIBILITY TEST Masashi Nakayama*1, Aya Kajino*2, Seiji Nakagawa*3 and Shunsuke Ishimitsu*1
*1
Hiroshima City University, via 4-1 Otsuka-higashi, Asaminami, Hiroshima 731-3194, Japan
e-mail: masashi@hiroshima-cu.ac.jp
*2
Kagawa National College of Technology, via 355 Chokushi, Takamatsu, Kagawa 761-8058, Japan
*3
National Institute of Advanced Industrial Science and Technology (AIST), via 1-8-31 Midorigaoka,
Ikeda, Osaka 563-8577, Japan
Because body-conducted speech (BCS) sounds are conducted by skin, muscle, and bone and
are not easily disrupted by airborne noises, noise-robust methods of speech detection using
BCS have been developed. However, the intelligibility of BCS quickly deteriorates because
it does not contain high-frequency components greater than 2 kHz. Thus, we previously proposed and evaluated a method that improves the sound quality of BCS using acceleration difference and noise reduction techniques. However, assessments based on human perception
are essential. Consequently, in this paper, we report on evaluations conducted to verify the
efficacy of the proposed method using word intelligibility tests and subjects with normal
hearing.
1. Introduction
Speech conversation is one of the most important means of communication for humans. However, factors such as noise in the air can easily disrupt speech communication. Consequently, approaches that enable robust communication methods and instruments have been widely proposed
and investigated in the fields of speech signal processing and human interfacing. In speech signal
processing, in particular, robust communication is one of the most significant research topics because speech recognition research has not yet resulted in methods that have achieved sufficiently
effective performance for practical use. One such area of research is body-conducted speech (BCS),
including bone-conduction sound, in which speech is conducted via the skin and bones in the human body [1]. Noise in the air does not affect this sound; however, the intelligibility of BCS rapidly
deteriorates because it does not contain high-frequency components greater than 2 kHz.
Consequently, many approaches for improving the sound quality and intelligibility of BCS have
been proposed and investigated. We previously proposed and evaluated one such method to improve the sound quality of BCS using acceleration difference and noise reduction methods [2]. In
the proposed method, we extract a clear signal using BCS and evaluate its efficacy via signal freICSV22, Florence (Italy) 12-16 July 2015
1
The 22nd International Congress on Sound and Vibration
quency characteristics and recognition performance. Although BCS provides a robust signal, the
quality of the signal is not clear. Conventional retrieval methods for BCS include modulation transfer function, linear predictive coefficients (LPCs), direct filtering, and the use of throat microphones
[3–6]. However, these methods require direct speech sound and/or filters coefficients for sound
quality improvement. In addition, conventional microphones cannot extract speech in noisy environments essentially. Thus, we proposed a retrieval method for BCS [6], and evaluated its efficacy
using a speech recognition system for statistical evaluation because it can evaluate the effectiveness
of improvement performance by acoustic matching of speech recognition. However, assessments
based on human perception are essential for advance and practical use. Consequently, in this paper,
we report on evaluations conducted of the effectiveness of the proposed method using word intelligibility tests in normal-hearing subjects. In the evaluations, intelligibility was compared among airborne speech, BCS, and two kind of retrieval BCS.
2. Speech and BCS
Speech is air-conducted sound and is easily influenced by surrounding noise. By contrast, because BCS is solid propagated sound, it is difficult for noise to affect it. Figures 1 and 2 represent
the utterance of a local Japanese place called “Asashi” by a twenty-year-old male, as representative
speech and BCS. The utterance was chosen from the JEIDA-100-local-place-name database that is
balanced words in phonetic [7]. Table 1 also shows the experimental recording environments. Signals were recorded at 16 kHz with 16 bits. Speech was measured via a microphone positioned at a
distance of 30 cm from the mouth, which is the microphone position for practical use, and BCS was
measured using an accelerometer placed at the upper lip where confirmed as best signal measurement point [1]. The distance for speech is assumed to be that of a conventional speech interface
such as a car navigation system. The measuring position for BCS has already been discussed and
proven as a suitable location compared with feature parameters between speech and BCS in previous research [1]. However, BCS does not possess 2 kHz or higher frequency components, and so
conventional speech recognition does not work for practical use because there are differences in the
quality of the sound and feature parameters.
Figure 1. Speech.
ICSV22, Florence, Italy, 12-16 July 2015
Figure 2. BCS.
2
The 22nd International Congress on Sound and Vibration
Table 1. Specifications for experimental recording environments.
Recorder
TEAC RD-200T
Microphone
Ono Sokki MI-1431
Microphone amplifier
Ono Sokki SR-2200
Microphone position
30cm (Distance between mouth and microphone)
Accelerator
Ono Sokki NP-2110
Accelerator amplifier
Ono Sokki PS-602
Accelerator position
Upper lip
3. Differential acceleration and noise reduction [2]
BCS requires higher frequency characteristics in order to be audible. Thus, our challenge
was to improve the quality of sound with our signal retrieval method. Consequently, conventional signal retrieval methods need targeted signals and/or parameters such as speeches recorded by microphone, which is inadequate in noisy environments. Therefore, a signal retrieval
method for practical use should recover only BCS. On the basis of this progressive idea, we
decided to invent a signal retrieval method that forgoes speeches and other parameters. Consequently, we discovered that BCS has effective components at 2 kHz and higher; however, it
has very little gain. Thus, our method tries to emphasize the gain using a simple technique.
3.1 Differential acceleration
Formula (1), shows the equation used for differential acceleration, which provides preemphasis in high frequency, where x(i) is waveform data in time frame i.
x differential (i )  x(i  1)  x(i )
(1)
xdifferential(i) is the differential acceleration signal between each frame of BCS. Because of
its low gain at amplitude, it requires adjustment to a suitable level for hearing or processing.
Figure 3 shows the differential acceleration signal estimated from Figure 2 using Formula (1),
with the adjusted gain. Although it may appear that the differential acceleration signal becomes speech mixed with stable noise, it actually emphasizes the high-frequency components.
This facilitates estimation of the retrieval signal when the noise is removed with conventional
noise reduction methods. In this manner, we developed our signal estimation method using
differential acceleration and conventional noise reduction.
3.2 Noise reduction
As a first approach, we examined the effectiveness of a spectral subtraction technique for
the reduction of stable noise [2]. However, no improvement in the frequency components was
obtained when it was used with the signal. Although the noise spectrum is simply subtracted
by the spectral subtraction method, the Wiener filtering method is employed because it estimates a spectrum envelope of speech using linear prediction coefficients. Therefore, we attempted to extract a clear signal using the Wiener filtering method, and found that it was able
to estimate and obtain the effective frequency components from differential acceleration. Thus,
we decided that it was a suitable noise reduction technique for our signal retrieval method.
Formula (2) defines the Wiener filtering method:
H Speech ( )
(2)
H Estimate ( ) 
H Speech ( )  H Noise ( )
ICSV22, Florence, Italy, 12-16 July 2015
3
The 22nd International Congress on Sound and Vibration
Estimated spectrum HEstimate(ω) can convert a retrieval signal from differential acceleration.
It can be calculated from the speech spectrum HSpeech(ω) and noise spectrum HNoise(ω).
HSpeech(ω) is calculated with autocorrelation functions and linear prediction coefficients by the
Levinson Durbin algorithm, and HNoise(ω) is then estimated by autocorrelation functions.
Figure 3. Differential acceleration
Figure 4. Retrieval signal
4. Word intelligibility test for performance evaluation in BCS
4.1 Methods
We chose vocabulary sounds from JEIDA 100 Japanese local place names. Speakers with
healthy hearing ability read out the words in a chamber room at AIST Kansai—a very quiet environment. The speech and BCS were recorded using microphones and an accelerometer. Consequently, a database was constructed from the utterances of local place names, which resulted in a
total of 1,800 samples (composed from one speech/two BCS × 3 male × 3 trials × 100 local place
names).
In the word intelligibility test, participants heard and answered 30 samples that chosen randomly
by each kind of sound, respectively, speech, BCS, differential acceleration, and metrical signal. The
participants were four Japanese males and four Japanese females over eighteen-years-old, with
healthy hearing and speaking abilities. The results of the word intelligibility test were the average of
samples of the respective result calculated from eight participants × 3 males × 30 samples, respectively. The local place names in the 30 samples were chosen randomly.
4.2 Results and discussions
Table 2 shows the results of the word intelligibility test, ranked according to the results of intelligibility. The rank from high to low is in the order speech, acceleration difference, retrieval signal 1
(Ret. 1 (IT = 3)) and retrieval signal 2 (Ret. 2 (IT = 6)). In the BCS sounds, differential acceleration
achieved the best performance in the intelligibility test. This is because there are 2 kHz higher components with stable noise included, caused by treatment of differential acceleration. Further, it was
confirmed that word intelligibility performance depends on noise reduction. These results show the
same tendency as those obtained from the intelligibility test using speech recognition [6]. This
means that evaluation of word intelligibility can be achieved by using speech recognition, and it can
estimate retrieval settings automatically using acceleration difference and noise reduction.
ICSV22, Florence, Italy, 12-16 July 2015
4
The 22nd International Congress on Sound and Vibration
Table 2. Results of word intelligibility test.
Samples
Speech
BCS
Accdiff
Ret. 1 (IT = 3)
Ret. 2 (IT = 6)
Intelligibility [%]
94.74
81.12
84.17
82.92
79.85
5. Conclusion and future work
In this paper, performance of retrieval signal of BCS was evaluated by means of a word intelligibility test involving participants, four males and four females. From the results of intelligibility test,
differential accretion showed the best performance, and also retrieval signal treated by differential
acceleration was also shown. In addition, there is a strong correlation between the results obtained
in this word intelligibility test and speech recognition.
In future work, we will investigate and discuss the relationship between indelibility and recognition performance, and propose a real-time signal retrieval algorithm to improve conversations with
speech communications.
References
1 S. Ishimitsu, “Construction of a noise-robust body-conducted speech recognition system,” in
Speech Recognition, F. Mihelic and J. Zibert (Ed.), chapter 14, 2008.
2 M. Nakayama, S. Ishimitsu, and S. Nakagawa, “A study of making clear body-conducted
speech using differential acceleration,” IEEJ Transactions on Electrical and Electronic Engineering, vol. 6, issue 2, pp. 144–150, 2011 (online: January 2011).
3 S. Ishimitsu, M. Nakayama, and Y. Murakami, “Study of body-conducted speech recognition
for support of maritime engine operation,” Journal of the JIME, vol. 39, no. 4, pp. 35–40,
2011.
4 T. Tamiya and T. Shimamura, “Improvement of body-conducted speech quality by adaptive
filters,” IEICE Technical Report, SP2006-191, pp. 41–46, 2006.
5 T. T. Vu, M. Unoki, and M. Akagi, “A study on restoration of bone-conducted speech with
LPC-based model,” IEICE Technical Report, SP2005-174, pp. 67–78, 2006.
6 Z. Liu, Z. Zhang, A. Acero, J. Droppo, and X. Huang, “Direct filtering for air and boneconductive microphones,” Proceedings of the 6th IEEE International Workshop on Multimedia Signal Processing, pp. 363-366, 2004.
7 S. Itahashi, “A noise database and Japanese common speech data corpus,” Journal of ASJ, vol.
47, no.12, pp. 951–953, 1991.
ICSV22, Florence, Italy, 12-16 July 2015
5