Speech Speech production Respiratory system Phonatory

Speech
Speech production
• what’s so good about speech?
1. Speech is a unique faculty to humans and one of
the most important, requiring the precise control
and co-ordination of over eighty different muscles,
making speech the highest learned skill a human
can achieve.
2. Speaking requires generally around 1,500 muscle
commands every second but needs only, as
children, a few years to perfect.
3. Although a primary means of communication in
itself, speech can convey other messages through
accents, tone, pitch, and quality.
4. …..but only limited achievements in machine
speech recognition and synthesis
 2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 1
• The act of speech involves three major anatomical
subsystems:
1. respiratory system
including the lungs, rib cage, and diaphragm;
2. phonatory system
which includes the larynx;
3. articulatory features
the lips, teeth, tongue, and jaw.
 2001 Christian Martyn Jones
Respiratory system
Speech and Natural Language Processing
Speech – Slide 2
Phonatory system
nasal cavity
alveolar ridge
velum
teeth
lips
epiglottis
tongue
esophagus
glottis
larynx
lungs
diaphragm
 2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 3
Articulatory system
 2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 4
Articulation of consonants
• ….considering in tern how we create
• How we classify the production of consonants
involves
consonants
vowels
other sound
1. the place of articulation
(the relative position of the lips, teeth, and tongue),
2. the manner of articulation
….using the glottis, tongue, teeth, lips, and nasal
cavity
 2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 5
(how the air-stream from the lungs is obstructed
stops, fricatives, affricates, nasals, liquids and glides),
3. and whether the vocal cords are set to vibrate
 2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 6
1
…..will use American English
Annotation
….rather than British English
but why?
• ASR
takes an acoustic waveform as input and produces as
output a string of words
• my work
• most other researchers work
• TtS
takes a sequence of text words to produce an
acoustic waveform.
….and will also simplify
• not consider regional accents
 2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 7
 2001 Christian Martyn Jones
…two main alphabet standards
standard originally developed by international
phonetic association in 1888 with the idea to
transcribe all human languages
it is more than just a set of symbols (eg one to one
relationship to sounds) and differ for different
languages
• 2. ARPAbet
using ASCII characters rather than more ’nonstandard’ characters
makes it much easier to create phonetic dictionaries,
syntactic and semantic rules, and build them into
ASR systems.....as we will see.
Speech and Natural Language Processing
Speech – Slide 8
Place of articulation
• 1. International Phonetic Alphabet (IPA)
 2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 9
• …..refers to the relative positions of the lips, teeth and
tongue.
• There are six distinct types of classification:
bilabial,
labiodental,
interdental,
alveolar,
alveo-palatal,
and velar.
 2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 11
• The six places of articulation
nasal cavity
soft palate
uvula
hard palate
palate
alveolar ridge
lips
tip of tongue
blade of tongue
back of tongue
jaw
 2001 Christian Martyn Jones
Speech and Natural Language Processing
bilabial, labiodental, interdental, alveolar, alveopalatal,
and velar
describe the parts of the vocal tract which are
responsible for the obstruction of the air flow from
the lungs
• the degree of obstruction the airstream incurs must
also be considered
…...this is the manner of articular.
Speech – Slide 12
 2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 15
2
Manner of articulation
Voicing
• The manner of articulation describes…
• Voicing
...how, and to what degree, air from the lungs is
obstructed
• Terms used:
stops,
fricatives,
affricates,
nasals,
lateral,
retroflex,
and glides
 2001 Christian Martyn Jones
vibration of the vocal cords in order to change the
characteristics of the airstream through the mouth
or nose and the overall acoustic nature of the
phone.
• ...sounds that are generated with the vocal cords
vibrating are voiced, and conversely those sound
requiring static vocal cords are voiceless.
Speech and Natural Language Processing
Speech – Slide 16
 2001 Christian Martyn Jones
Articulation of vowels
Speech and Natural Language Processing
Speech – Slide 19
Features to consider
• The articulation of American English vowels is not as
defined as consonants and can vary a great deal from
speaker to speaker, especially due to dialect
variations.
• Vowels generally present little obstruction and require
wide open mouth positions.
• What we look for are:
1. tongue elevation,
2. part of tongue involved,
3. tongue muscle tension,
4. mouth shape.
• The description of vowels will consider the motion of
the tongue and lips.
 2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 21
 2001 Christian Martyn Jones
Elevation of the tongue
Speech and Natural Language Processing
Speech – Slide 22
Examples
• normally categorised in terms of simply:
high, mid, or low positions.
• /iy/ in ‘beet’ and /ih/ in ‘bit’
• /ey/ in ‘bait’ and /eh/ in ‘bet’
• /ae/ in ‘bat’
high front
mid front
• /uw/ in ‘boot’ and /uh/ in ‘book’
low front
• /ow/ in ‘boat’ and /ao/ in ‘bought’
• /aa/ in ‘bott
 2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 23
 2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 24
3
Region of the tongue
Tongue tenseness
• We classify the region of the tongue as either:
front, central, or back
• vowels requiring above the ‘normal’ level of muscle
tension are termed tense
we have already seen front vowels:
• whilst those in which this degree of tension is not
needed are simply relaxed.
'beet’, ‘bit’, ‘bait’, ‘bet’, 'bat’
• Consider list
‘beat’ and ‘bit, ‘bait’ and ‘bet’, and ‘boot’ and ‘book’
and back vowels such as:
‘boot’, ‘book’, ‘boat’, ‘bought’, and ‘bott’
- the first phone in each group is tense, with the tongue slightly
higher in the mouth than the relaxed second phone
central vowels such as:
- in addition, front vowels which are tense are articulated with the
tongue slightly ahead of a similarly relax phone, whilst back
tense vowels will be pronounced with the tongue further back.
/ah/ in ‘but’ and schwa vowel in ‘machine’
 2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 25
 2001 Christian Martyn Jones
Mouth shape
Speech and Natural Language Processing
Speech – Slide 26
Other phonetic articulations
• a range of lip positions for all phones
• these require more dynamic movement of the tongue
and lips than can be described within the constraints
of consonants and vowels, and include:
• however vowels possess some degree of
generalisation
Off-glides,
diphthongs,
and r-colourisation
consonant /h/
vowels:
'beet’, ‘bit’, ‘bait’, ‘bet’, 'bat’, ‘machine’, and ‘but’
whilst the back vowels of:
‘boot’, ‘book’, ‘boat’, ‘bought’:
 2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 27
 2001 Christian Martyn Jones
Off-glides
Speech and Natural Language Processing
Speech – Slide 29
Diphthongs
•
Already considered are the high, front, tense /iy/ ('beet'), the mid,
front, tense /ey/ ('bait'), the high, back, tense /uw/ ('boot'), and the
mid, back, tense /ow/ ('boat') vowels, however the articulation of
each is more than these simple descriptors can suggest.
•
Similar in composition to off-glides, diphthongs are complex
vowels consisting of a vowel sound followed by the glide /y/ or
/w/. However, what separates diphthongs from off-glides is that
diphthongs involve considerably greater tongue motion
•
The front vowels (/iy/ and /ey/) are composed of a pure vowel /i/
(idealised position) and /e/ followed immediately by the glide /y/
(blade of tongue near hard palate), and similarly the back vowels
involve vowels /u/ and /o/ followed by /w/.
•
The diphthongs of American English are /aw/ in ‘bout’, /ay/ in
‘bite’, and /oy/ in ‘boy’.
•
Each of the constituent phones are pronounced and these
vowels are known as off-glides indicating the increased motion of
the tongue and lips.
 2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 30
 2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 31
4
R-colourisation
/h/
•
The combination of vowel and /r/ sounds are termed as rcolourisation, and differ from off-glides and diphthongs in that
they are in fact two symbols representing a single sound.
•
The r-coloured, mid, central vowel /er/ as in ‘bird’ involves the
articulation of two previously described phones at one time:
•
the mid, central vowel pronunciation of the schwas as in ‘machine’
have already mentioned consonant /h/ assumes the tongue and
lip positions of the proceeding vowel. As the articulation of the
consonant /h/ is very much dependent on that of the vowel, and
vowels do not require the vibration of the vocal cords, it is often
described as a voice-less vowel although grammatically a
consonant.
together with the tongue curl associated with the retroflexed /r/.
•
The tongue is again in motion during pronunciation, however
does not exceed the boundaries of tongue elevation and region
as is the case with diphthongs.
 2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 32
 2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 33
 2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 36
Deaf studies
• will consider the field of speechreading
addressing the visual similarity in speech.
• Research
number of studies since 1950s to the present day
major works from Jeffers, Nitchie, Berger, Fisher, my
own studies
look at here is Nitchie 1979
 2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 34
….less info for speechreaders
lip cues for all…..
• Larry Thronson, a sign language instructor and
counsellor at the Central Coast Centre for
Independent Living in California has said:
“Lipreading is guesswork; under ideal conditions, only
about 30 to 40 percent (of the speech) is retained”
• With my research into lip syncing I found that it can be
more like 50% to 60% of the information is lost
 2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 37
•
how much attention we pay to the visual and auditory information
can vary depending on the ‘cocktail party effect’ and also subject
matter:
•
Predictable words in a conversation are spoken less clearly, as are
references to objects present or passed between speakers in view,
however, when an object is mentioned for the first time it is named more
clearly than subsequent times
•
When speakers are face-to-face their speech becomes degraded even
when they are not actually facing each other. Although it can be seen that
visual cues can greatly assist communications, speakers generally rarely
look at each other during conversations which seems to contradict
suggestions that speakers adopt their articulation to the needs of the
listener. If this was the case then one may speak less clearly when being
watched, however the reverse is true.
•
Although we generally use visual cues less frequently than would initially be
imagined (typically less that 50% of the time), when the listener does look to
the speaker’s face the speaker then articulates much clearer than before
 2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 38
5
Eye gaze
Bad lip sync.
•
Comfort:
•
In the situations of video teleconferencing or cartoon-style animation, the
viewer is aware that the image is synthetic and is less prone to feel uneasy
looking at the speakers face. Therefore, the viewer will look at the image for
a greater percentage of the overall time implying that the level of accuracy
of the articulation should be as high as possible.
•
However, the psychology of human subjects is not always consistent and in
fact listeners find that excessive precision in the articulation of the speaker
to be generally annoying.
•
Similarly, during face-to-face conversations subjects subconsciously
negotiate an acceptable level of mutual eye contact depending on a level of
intimacy and comfort.
•
Note that the very latest computer facial models provide very life-like
persona, and it is plausible that in the future such models may well be
perceived to be real-life images, in which case listeners will find themselves
again unable to make prolonged eye contact during the communication.
 2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 39
•
Although most people have not had any formal speechreading training and
therefore cannot accurately lip-read, bad lip synchronisation is easily
detectable and highly unacceptable to the viewer
•
Extreme cases has been sighted by Groß at Oxford University, England, in
which films dubbed in different languages can not be synchronised with the
visuals and can cause irritation to audience
•
In countries such as Germany and France which regularly dub British or
American films, the spectators appear to be able to ignore the discrepancy
between what they see and what they hear as if the brain can be trained to
accept bad lip synchronisation when the situation is nonsensical.
•
However, if the same audience watch a German actor dubbed in to another
language such as French then they strongly object to the inconsistency
between the speech and mouth synchronisation.
 2001 Christian Martyn Jones
Cartoon lip sync.
•
•
Speech and Natural Language Processing
Speech – Slide 40
Audio/visual mis-matches
Cartoon lip synchronisation does not require the same level of accuracy as
its human counterpart. Instead, animators generally supply only keyframes
in the articulation and the human mind ‘fills’ in the gaps. The audience
perceive the characters in animation to be life-like but at the same time not
human and thus are happy to ignore inaccuracies in the lip synchronisation
which would otherwise be unacceptable. Therefore animators are able to
get away with poor articulation models for simple characters.
• McGurk effect
• what do you….
Disney animators found that although mouth motion required exact
synchronisation, other speechreading cues such as head, body, or gestures
needed to be synchronised three to four frames ahead of the visual action
...hear?
...see?
…with both?
 2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 41
 2001 Christian Martyn Jones
McGurk research
• ……see results
Speech and Natural Language Processing
Speech – Slide 42
why does it occur?
• The work involved filming a female subject whilst
repeated uttering ‘ba-ba’, ‘ga-ga’, ‘pa-pa’, ‘ka-ka’, and
generating four dubbed audio-video sequences in
which the original sound-track and lip movements
were combined correctly and mis-matched as: [ba]
(voice)-[ga] (lips), [ga] (voice)-[ba] (lips), [pa] (voice)[ka] (lips), and [ka] (voice)-[pa] (lips).
 2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 43
•
The McGurk effect can be explained in terms of the visual similarity
•
Speechreading studies show that lip positions for [ga] are frequently
misread as [da], [ka] misread as [ta], and [pa] for [ba].
•
McGurk assumed that the acoustic information for [ba] and [da]
contained some common features which were not present in [ga]. Thus
a [ba] (voice)-[ga] (lips) presentation provided the viewer with visual
information common to [ga] and [da] and auditory information with
features common to [da] and [ba]. The spectator would then respond
with the phone-code for which there was most data: [da].
•
A similar explanation was presupposed for the [pa] (voice)-[ka] (lips)
effects and reversing the audio and visual stimuli. When the acoustic
information does not bear any similarity with the articulation, in the case
of [ka] (voice)-[pa] (lips) the viewer invariably must guess the spoken
message and responds with combinations of [kapka], [pakpa] etc.
 2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 44
6
….further work
…by now
•
Burnham and Dodd researching in Australia found that the McGurk
effect was observed in infants as young as 4 month-old and that the
auditory [ba] and visual [ga] could be perceived as [dh] as well as [da].
•
Also noted the effect to transcend language and phonological
constraints.
•
Work by Massaro has considered the use of synthetic faces on the
McGurk effect. Massaro has extended the sensory mismatch to include
auditory /b/ and visual /d/ to perceived /w/.
• you should have an appreciation of:
speech,
how it is produced,
what articulators are involved,
how to annotate speech using symbols,
how speech appears visually,
similarities in visual speech,
and mis-information cues and difficulties for hearing
impaired.
• next…....how this affects the acoustics of speech......
 2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 45
 2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 46
7