00530235.pdf

Score Information Decision Fusion Using Support Vector
Machine for a Correlation Filter Based Speaker
Authentication System
Dzati Athiar Ramli, Salina Abdul Samad, and Aini Hussain
Department of Electrical, Electronic and Systems Engineering, Faculty of Engineering,
University Kebangsaan Malaysia, 43600 Bangi Selangor, Malaysia
dzati@vlsi.eng.ukm.my, salina@vlsi.eng.ukm.my,
aini@vlsi.eng.ukm.my.
Abstract. In this paper, we propose a novel decision fusion by fusing score information from
multiple correlation filter outputs of a speaker authentication system. Correlation filter classifier
is designed to yield a sharp peak in the correlation output for an authentic person while no peak
is perceived for the imposter. By appending the scores from multiple correlation filter outputs
as a feature vector, Support Vector Machine (SVM) is then executed for the decision process.
In this study, cepstrumgraphic and spectrographic images are implemented as features to the
system and Unconstrained Minimum Average Correlation Energy (UMACE) filters are used as
classifiers. The first objective of this study is to develop a multiple score decision fusion system
using SVM for speaker authentication. Secondly, the performance of the proposed system using
both features are then evaluated and compared. The Digit Database is used for performance
evaluation and an improvement is observed after implementing multiple score decision fusion
which demonstrates the advantages of the scheme.
Keywords: Correlation Filters, Decision Fusion, Support Vector Machine, Speaker
Authentication.
1 Introduction
Biometric speaker authentication is used to verify a person’s claimed identity. Authentication system compares the claimant’s speech with the client model during the
authentication process [1]. The development of a client model database can be a complicated procedure due to voice variations. These variations occur when the condition
of the vocal tract is affected by the influence of internal problems such as cold or dry
mouth, and also by external problems, for example temperature and humidity. The
performance of a speaker authentication system is also affected by room and line
noise, changing of recording equipment and uncooperative claimants [2], [3]. Thus,
the implementation of biometric systems has to correctly discriminate the biometric
features from one individual to another, and at the same time, the system also needs to
handle the misrepresentations in the features due to the problems stated. In order to
overcome these limitations, we improve the performance of speaker authentication
systems by extracting more information (samples) from the claimant and then executing fusion techniques in the decision process.
E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 235–242, 2009.
© Springer-Verlag Berlin Heidelberg 2009
springerlink.com
236
D.A. Ramli, S.A. Samad, and A. Hussain
So far there are many fusion techniques in literature that have been implemented in
biometric systems for the purpose of enhancing the system performance. These include the fusion of multiple-modalities, multiple-classifiers and multiple-samples [4].
Teoh et. al. in [5] proposed a combination of features of face modality and speech
modality so as to improve the accuracy of biometric authentication systems. Person
identification based on visual and acoustic features has also been reported by Brunelli
and Falavigna in [6]. Suutala and Roning in [7] used Learning Vector Quantization
(LVQ) and Multilayer Perceptron (MLP) as classifiers for footstep profile based person identification whereas in [8], Kittler et.al. utilized Neural Networks and Hidden
Markov Model (HMM) for hand written digit recognition task. The implementation of
multiple-sample fusion approach can be found in [4] and [9]. In general, these studies
revealed that the implementation of the fusion approaches in biometric systems can
improve system performance significantly.
This paper focuses on the fusion of score information from multiple correlation filter outputs for a correlation filter based speaker authentication system. Here, we use
scores extracted from the correlation outputs by considering several samples extracted
from the same modality as independent samples. The scores are then concatenated
together to form a feature vector and then Support Vector Machine (SVM) is executed
to classify the feature vector as either authentic or imposter class. Correlation filters
have been effectively applied in biometric systems for visual applications such as face
verification and fingerprint verification as reported in [10], [11]. Lower face verification
and lip movement for person identification using correlation filters have been implemented in [12], [13], respectively. A study of using correlation filters in speaker verification for speech signal as features can be found in [14]. The advantages of correlation
filters are shift-invariance, ability to trade-off between discrimination and distortion
tolerance and having a close-form expression.
2 Methodology
The database used in this study is obtained from the Audio-Visual Digit Database
(2001) [15]. The database consists of video and corresponding audio of people reciting digits zero to nine. The video of each person is stored as a sequence of JPEG
images with a resolution of 512 x 384 pixels while the corresponding audio provided
as a monophonic, 16 bit, 32 kHz WAV file.
2.1 Spectroghaphic Features
A spectrogram is an image representing the time-varying spectrum of a signal. The
vertical axis (y) shows frequency, the horizontal axis (x) represents time and the pixel
intensity or color represents the amount of energy (acoustic peaks) in the frequency
band y, at time x [16], [17]. Fig.1 shows samples of the spectrogram of the word
‘zero’ from person 3 and person 4 obtained from the database. From the figure, it can
be seen that the spectrogram image contains personal information in terms of the way
the speaker utters the word such as speed and pitch that is showed by the spectrum.
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
Frequency
Frequency
Score Information Decision Fusion Using Support Vector Machine
0.5
0.4
0.3
237
0.5
0.4
0.3
0.2
0.2
0.1
0.1
0
0
0
1000
2000
3000
4000
Time
5000
6000
7000
0
1000
2000
3000
4000
Time
5000
6000
7000
Fig. 1. Examples of the spectrogram image from person 3 and person 4 for the word ‘zero’
Comparing both figures, it can be observed that although the spectrogram image
holds inter-class variations, it also comprises intra-class variations. In order to be
successfully classified by correlation filters, we propose a novel feature extraction
technique. The computation of the spectrogram is described below.
a. Pre-emphasis task. By using a high-pass filter, the speech signal is filtered using
the following equation:
x (t ) = (s(t ) − 0.95) ∗ x (t − 1)
(1)
x ( t ) is the filtered signal, s( t ) is the input signal and t represents time.
b. Framing and windowing task. A Hamming window with 20ms length and 50%
overlapping is used on the signal.
c. Specification of FFT length. A 256-point FFT is used and this value determines
the frequencies at which the discrete-time Fourier transform is computed.
d. The logarithm of energy (acoustic peak) of each frequency bin is then computed.
e. Retaining the high energies. After a spectrogram image is obtained, we aim to
eliminate the small blobs in the image which impose the intra-class variations. This
can be achieved by retaining the high energies of the acoustic peak by setting an appropriate threshold. Here, the FFT magnitudes which are above a certain threshold
are maintained, otherwise they are set to be zero.
f. Morphological opening and closing. Morphological opening process is used to
clear up the residue noisy spots in the image whereas morphological closing is the
task used to recover the original shape of the image caused by the morphological
opening process.
2.2 Cepstrumgraphic Features
Linear Predictive Coding (LPC) is used for the acoustic measurements of speech
signals. This parametric modeling is an approach used to match closely the resonant
structure of the human vocal tract that produces the corresponding sounds [17]. The
computation of the cepstrumgraphic features is described below.
a. Pre-emphasis task. By using a high-pass filter, the speech signal is filtered using
equation 1.
b. Framing and windowing task. A Hamming window with 20ms length and 50%
overlapping is used on the signal.
c. Specification of FFT length. A 256-point FFT is used and this value determines
the frequencies at which the discrete-time Fourier transform is computed.
238
D.A. Ramli, S.A. Samad, and A. Hussain
d. Auto-correlation task. For each frame, a vector of LPC coefficients is computed
from the autocorrelation vector using Durbin recursion method. The LPC-derived
cepstral coefficients (cepstrum) are then derived that lead to 14 coefficients per vector.
e. Resizing task. The feature vectors are then down sampled to the size of 64x64 in
order to be verified by UMACE filters.
2.3 Correlation Filter Classifier
Unconstrained Minimum Average Correlation Energy (UMACE) filters which
evolved from Matched Filter are synthesized in the Fourier domain using a closed
form solution. Several training images are used to synthesize a filter template. The
designed filter is then used for cross-correlating the test image in order to determine
whether the test image is from the authentic class or imposter class. In this process,
the filter optimizes a criterion to produce a desired correlation output plane by minimizing the average correlation energy and at the same time maximizing the correlation output at the origin [10][11].
The optimization of UMACE filter equation can be summarized as follows,
U mace = D −1m
(2)
D is a diagonal matrix with the average power spectrum of the training images placed
along the diagonal elements while m is a column vector containing the mean of the
Fourier transforms of the training images. The resulting correlation plane produce a
sharp peak in the origin and the values at everywhere else are close to zero when the
test image belongs to the same class of the designed filter [10][11]. Fig. 2 shows the
correlation outputs when using a UMACE filter to determine the test image from the
authentic class (left) and imposter class (right).
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
30
30
30
20
10
0
30
20
20
10
20
10
10
0
0
0
Fig. 2. Examples of the correlation plane for the test image from the authentic class (left) and
imposter class (right)
Peak-to-Sidelobe ratio (PSR) metric is used to measure the sharpness of the peak.
The PSR is given by
PSR =
peak − mean
σ
(3)
Here, the peak is the largest value of the test image yield from the correlation output.
Mean and standard deviation are calculated from the 20x20 sidelobe region by excluding a 5x5 central mask [10], [11].
Score Information Decision Fusion Using Support Vector Machine
239
2.4 Support Vector Machine
Support vector machine (SVM) classifier in its simplest form, linear and separable
case is the optimal hyperplane that maximizes the distance of the separating hyperplane from the closest training data point called the support vectors [18], [19].
From [18], the solution of a linearly separable case is given as follows. Consider a
problem of separating the set of training vectors belonging to two separate classes,
{(
) (
)}
D = x 1 , y1 ,... x L , y L ,
x ∈ ℜ n , y ∈ {− 1,−1}
(4)
with a hyperplane,
w, x + b = 0
(5)
The hyperplane that optimally separates the data is the one that minimizes
φ( w ) =
1
w
2
2
(6)
which is equivalent to minimizing an upper bound on VC dimension. The solution to
the optimization problem (7) is given by the saddle point of the Lagrange functional
(Lagrangian)
φ( w , b, α) =
L
1
2
w − ∑ α i ⎛⎜ y i ⎡ w , x i + b⎤ − 1⎞⎟
⎢
⎥⎦ ⎠
2
i =1 ⎝ ⎣
(7)
where α are the Lagrange multipliers. The Lagrangian has to be minimized with respect to w , b and maximized with respect to α ≥ 0 . Equation (7) is then transformed
to its dual problem. Hence, the solution of the linearly separable case is given by,
α* = arg min
α
L
1 L L
∑ ∑ αiα jyi y j xi , x j − ∑ αk
2 i =1 j=1
k =1
(8)
with constrains,
α i ≥ 0, i = 1,..., L
and
L
∑ α jy j = 0
j=1
(9)
Subsequently, consider a SVM as a non-linear and non-separable case. Non-separable
case is considered by adding an upper bound to the Lagrange multipliers and nonlinear case is considered by replacing the inner product by a kernel function. From
[18], the solution of the non-linear and non-separable case is given as
α* = arg min
α
(
)
L
1 L L
∑ ∑ α i α j yi y jK x i , x j − ∑ α k
2 i =1 j=1
k =1
(10)
with constrains,
0 ≤ α i ≤ C, i = 1,..., L and
L
∑ α j y j = 0 x (t ) = (s(t ) − 0.95) ∗ x (t − 1)
j=1
(11)
240
D.A. Ramli, S.A. Samad, and A. Hussain
Non-linear mappings (kernel functions) that can be employed are polynomials, radial
basis functions and certain sigmoid functions.
3 Results and Discussion
Assume that N streams of testing data are extracted from M utterances. Let
s = {s1 , s 2 ,..., s N } be a pool of scores from each utterance. The proposed verification
system is shown in Fig.3.
a11
am1
...
Filter design 1
. . . .
a1n
amn
...
Correlation filter
Filter design n
Correlation filter
. . . .
b1
FFT
bn
IFFT
Correlation output
psr1
. . . .
. . . .
FFT
IFFT
Correlation output
psrn
Support vector machine (polynomial kernel)
(a11… am1 ) … (a1n … amn )– training data
b1, b2 … bn – testing data
m – number of training data
n - number of groups (zero to nine)
Decision
Fig. 3. Verification process using spectrographic / ceptrumgraphic images
For the spectrographic features, we use 250 filters which represent each word for
the 25 persons. Our spectrographic image database consists of 10 groups of spectrographic images (zero to nine) of 25 persons with 46 images per group of size 32x32
pixels, thus 11500 images in total. For each filter, we used 6 training images for the
synthesis of a UMACE filter. Then, 40 images are used for the testing process. These
six training images were chosen based on the largest variations among the images. In
the testing stage, we performed cross correlations of each corresponding word with 40
authentic images and another 40x24=960 imposter images from the other 24 persons.
For the ceptrumgraphic features, we also have 250 filters which represent each
word for the 25 persons. Our ceptrumgraphic image database consists of 10 groups of
ceptrumgraphic images (zero to nine) of 25 persons with 43 images per group of size
64x64 pixels, thus 10750 images in total. For each filter, we used 3 training images
for the synthesis of the UMACE filter and 40 images are used for the testing process.
We performed cross correlations of each corresponding word with 40 authentic images and another 40x24=960 imposter images from the other 24 persons.
Score Information Decision Fusion Using Support Vector Machine
241
For both cases, polynomial kernel has been employed for the decision fusion procedure using SVM. Table 1 below compares the performance of single score decision
and multiple score decision fusions for both spectrographic and ceptrumgrapic features. The false accepted rate (FAR) and false rejected rate (FRR) of multiple score
decision fusion are described in Table 2.
Table 1. Performance of single score decision and multiple score decision fusion
features
spectrographic
cepstrumgraphic
single score
92.75%
90.67%
multiple score
96.04%
95.09%
Table 2. FAR and FRR percentages of multiple score decision fusion
features
spectrographic
cepstrumgraphic
FAR
3.23%
5%
FRR
3.99%
4.91%
4 Conclusion
The multiple score decision fusion approach using support vector machine has been
developed in order to enhance the performance of a correlation filter based speaker
authentication system. Spectrographic and cepstrumgraphic features, are employed as
features and UMACE filters are used as classifiers in the system. By implementing
the proposed decision fusion, the error due to the variation of data can be reduced
hence further enhance the performance of the system. The experimental result is
promising and can be an alternative method to biometric authentication systems.
Acknowledgements. This research is supported by Fundamental Research Grant
Scheme, Malaysian Ministry of Higher Education, FRGS UKM-KK-02-FRGS00362006 and Science Fund, Malaysian Ministry of Science, Technology and Innovation,
01-01-02-SF0374.
References
1. Campbell, J.P.: Speaker Recognition: A Tutorial. Proceeding of the IEEE 85, 1437–1462
(1997)
2. Rosenberg, A.: Automatic speaker verification: A review. Proceeding of IEEE 64(4), 475–
487 (1976)
3. Reynolds, D.A.: An overview of Automatic Speaker Recognition Technology. Proceeding
of IEEE on Acoustics Speech and Signal Processing 4, 4065–4072 (2002)
4. Poh, N., Bengio, S., Korczak, J.: A multi-sample multi-source model for biometric authentication. In: 10th IEEE on Neural Networks for Signal Processing, pp. 375–384 (2002)
5. Teoh, A., Samad, S.A., Hussein, A.: Nearest Neighborhood Classifiers in a Bimodal Biometric Verification System Fusion Decision Scheme. Journal of Research and Practice in
Information Technology 36(1), 47–62 (2004)
242
D.A. Ramli, S.A. Samad, and A. Hussain
6. Brunelli, R., Falavigna, D.: Personal Identification using Multiple Cue. Proceeding of
IEEE Trans. on Pattern Analysis and Machine Intelligence 17(10), 955–966 (1995)
7. Suutala, J., Roning, J.: Combining Classifier with Different Footstep Feature Sets and
Multiple Samples for Person Identification. In: Proceeeding of International Conference on
Acoustics, Speech and Signal Processing, pp. 357–360 (2005)
8. Kittler, J., Hatef, M., Duin, R.P.W., Matas, J.: On combining classifiers. Proceeding of
IEEE Trans on Pattern Analysis and Machine Intelligence 20(3), 226–239 (1998)
9. Cheung, M.C., Mak, M.W., Kung, S.Y.: Multi-Sample Data-Dependent Fusion of Sorted
Score Sequences for Biometric verification. In: IEEE Conference on Acoustics Speech and
Signal Processing (ICASSP 2004), pp. 229–232 (2004)
10. Savvides, M., Vijaya Kumar, B.V.K., Khosla, P.: Face Verification using Correlation Filters. In: 3rd IEEE Automatic Identification Advanced Technologies, pp. 56–61 (2002)
11. Venkataramani, K., Vijaya Kumar, B.V.K.: Fingerprint Verification using Correlation Filters. In: System AVBPA, pp. 886–894 (2003)
12. Samad, S.A., Ramli, D.A., Hussain, A.: Lower Face Verification Centered on Lips using
Correlation Filters. Information Technology Journal 6(8), 1146–1151 (2007)
13. Samad, S.A., Ramli, D.A., Hussain, A.: Person Identification using Lip Motion Sequence.
In: Apolloni, B., Howlett, R.J., Jain, L. (eds.) KES 2007, Part I. LNCS (LNAI), vol. 4692,
pp. 839–846. Springer, Heidelberg (2007)
14. Samad, S.A., Ramli, D.A., Hussain, A.: A Multi-Sample Single-Source Model using Spectrographic Features for Biometric Authentication. In: IEEE International Conference on
Information, Communications and Signal Processing, CD ROM (2007)
15. Sanderson, C., Paliwal, K.K.: Noise Compensation in a Multi-Modal Verification System.
In: Proceeding of International Conference on Acoustics, Speech and Signal Processing,
pp. 157–160 (2001)
16. Spectrogram, http://cslu.cse.ogi.edu/tutordemo/spectrogramReading/spectrogram.html
17. Klevents, R.L., Rodman, R.D.: Voice Recognition: Background of Voice Recognition,
London (1997)
18. Gunn, S.R.: Support Vector Machine for Classification and Regression. Technical Report,
University of Southampton (2005)
19. Wan, V., Campbell, W.M.: Support Vector Machines for Speaker Verification and Identification. In: Proceeding of Neural Networks for Signal Processing, pp. 775–784 (2000)