Chapter 8 - speech, language, and audition pg

CS631p- Human-Computer Interaction

Spring 2001

Lecture 8

Speech, Language, and Audition

Communication between humans - speech/hearing -> audio

Communication between humans and machine -> not so

Issue of Speech
General characteristics of speech:

Described as:

An acoustic signal - speech is generated by the articulatory apparatus, and perceived by the auditory system.
Tightly structured system of symbols and meanings - linguistic information in speech signals is always described in terms of discrete and static symbols, like speech sounds, syllables and words.

Speech sounds are generated in the vocal tract - the non-uniform tube formed by the throat (pharynx), the oral cavity (the shape changed through tongue and jaw movements) and the nasal cavity, the shape of which is constant (but very different between speakers).
The dynamically changing tube is acoustically excited at the very far end (by air pulses released through the vibrating vocal folds) or closer to the near end (the lips), by the turbulence caused by air that is forced across sharp ridges.
Different speech sounds correspond with different shapes of the vocal tract, and with different excitation sources.
Speech signals are described by spectro-temporal characteristics . Best way is to stack short-time spectra,

short segments of the signal.

Below: oscillogram (top) and spectrogram of a short utterance

The vocal tract is described by its resonance frequencies. Very useful for vowel sounds, which have the excitation at the far end of the tube. The resonances are often termed ‘formants’. Two such formants suffice to uniquely identify a vowel.

Below: Figure showing vowel sounds

Speech and natural language interfaces

Hopes: make interaction with machines resemble interaction with people.
But: conversing with CPU as with a friend neither technologically feasible nor always desirable
Yet : speech and natural language technologies are practical and useful - often recognition and understanding not required (eg. voice mail, voice, and annotation)
Speech is far more effective when coupled with other modalities of interaction such as gestures

Designers must worry about:

Faster to speak than write

Faster to read than listen

But… good for input but not output

Spatial and temporal relationships often better articulated with gestures and markings

Audio is spatially ubiquitous, while visual channel is localized (e.g. can't see from behind, but can hear 360 degrees)

Stored Speech

good for computer mediated human/human interaction.

(e.g. voice mail)

store-and-forward technology - good for integrating with other document types

Comparison of Voice vs. Written Annotations

- Voice: used for global, high-level comments
- Written: local, low level
- Conclusion : both kinds of annotations can be supported simultaneously

Research by Levine and Ehrlich on Freestyle:

- Annotate e-documents by writing using an electronic stylus or by voice

- Point, mark, speak simultaneously

- Have result captured for later use

- All this could be distributed via email

Problem with Speech files - searching and retrieving information.

Wordspotting research underway

Speech Synthesis

Standard on PCs

ASCII text to sound output

Follows set of production rules

Not context checking (so is wind pronounced wind or wynd)
Cheap

Can be used to deliver stored system messages rather than voicemail

Voice quality varies based on application

Example: Bell Labs Text-to-Speech System

Speech Recognition

Recognize spoken words
No idea what words mean, just words
To interpret send up to higher-level software
"Words" in speech recognition system not necessary words - acoustic patterns looked for with acoustic template
Systems appear in PC for special use:

Naturally Speaking (Dragon Systems)
ViaVoice (IBM)

PC Magazine Evaluation of Voice Recognition Software
Metroplex Voice Computing: Application to Math

Difficulties in Speech Recognition:

Speaker - dependent or - independent:

Does system need to be trained separately for each user?

Size of vocabulary:

State of art systems recognize up to fifty thousand words

Isolated words or continuous speech

When does one end and next begin?
Must separate words by gap of 350ms.

Automatic speech recognition is best treated as a problem in Information Theory - sender and receiver

During human speech production some message is encoded and transmitted through a channel that is at best partially known and that often is noisy.
The recogniser’s task is to decode the message.

All existing automatic speech recognition devices attempt to solve the problem by building probabilistic models of all relevant messages, and to compute the likelihood that a given signal corresponds to (the model of) each of the possible messages.
Messages are defined in terms of words. The way in which words are modelled depends very much on the number of different words in the vocabulary, and on the way in which words can be combined to form complex messages.

If the number of words in the lexicon is small, it is best to build models of full words.
Otherwise it is necessary to build models of the 45 or so different speech sounds used to form all the words in a language. Words are then modeled as sequences of sounds, or more precisely, sequences of sound models.

For decoding messages, prior probability of the words is used via a Bayesian statitical analysis.

We want to know p(w|X) - the probability for the sequence of words w, given the sequence of acoustic observations X.
p(X|w) can be estimated if there is a sufficient number of tokens of the words w spoken by a relevant set of speakers.
If a single speaker will use the recogniser, that specific speaker best produces the training speech. However, in practice it is

Information Theoretic framework speech recognition boils down to searching the (sequence of) words that maximise the likelihood p(X|w).

Big Issues:

Accuracy
Repeatability of performance
Vocabulary size
Ease of modification
Location of microphone
Complete voice control over system
Performance in variable conditions

Problems: speech vs. mouse - 18% faster than mouse,

but use of speech interfered with short-term memory tasks.

Speaker Recognition

Identification of speaker from speech.

Two different actions - require a data base of identity codes and voice patterns of all persons who are known to the system.

Speaker Identification - the machine must use a speech utterance to determine which of the persons has spoken (or whether the speaker was an unknown person -- and therefore probably an intruder).
Speaker Verification - a speaker claims to be one of N persons in the data base of the system, and the task is to decide whether or not that claim can be substantiated.

Speaker Recognition Demo

Natural Language Recognition

Natural language systems input and output ASCII text rather than speech.

In contrast
Speech understanding systems operate on phrase or pattern matching - no deep understanding
Try to combine natural language with speech.

Need to understand:

Syntactic - grammar/structure
Prosodic - inflections, stress, pitch, timing
Pragmatic - where in discussion utterance takes place - location, time, cultural practices, speakers, hearers, surroundings
Semantic - having to do with word meaning

Natural Language Generation

Messages from computers sometimes come in form of natural language systems - status, errors, options.
They are not natural language systems - they are pre-set - they are not generated on the fly - no conversation
This stuff is AI problem.

Speech, Gesture & Multimodal Interaction

Natural language systems are not always most natural way to interact
Gestures can convey intent of utterance much more naturally than speech or written language

Pointing is easier than trying to say what you want

But natural language is better for others

e.g. Put That There (Architecture Machine Group , MIT)

- Interface consisted of a large room, one wall of which was a back projection panel.

- Users sat in the center of the room in a chair wearing magnetic position sensing devices on their wrists to measure hand position.

- Users interacted with objects so wall-sized map using voice input controlled with pointing

- Graphic feedback on map and by speech - uses text-to-speech technology

- Pronouns that and there linked to gesture

Taxonomy of gestures

Symbolic gestures - gestures within culture that have single meaning (e.g. OK!)
Dietic gestures - gestures of pointing or directing attention
Iconic gestures - convey information about size, shape, or orientation
Pantomimic gestures - showing use of movement of some invisible tool.

Only first gesture can be interpreted within further context
See also: Put That Where? Voice and Gesture at the Graphics Interface by Mark Billinghurst

Applications

Most work on voice-based interfaces focuses on telephony

Non-speech Audio

Audio messages from computer systems fall into:

Alarms and warning systems (Dominate video games)
Status and monitoring indicators
Encoded messages and data

To make audio cues more effective user must know meaning of each

Learning and Remembering

Assist users in remembering non-speech audio by using metaphors - sounds on CPU have similar meaning to real world.

e.g.:

Reverberation clunk with empty space. When file saved, the amount of reverberation would provide clue as to how much free space on disk

Musical and Everyday Listening

Two approaches to non-speech audio:

Musical listening - "message" is derived from relationships among acoustical components of sound such as pitch, timing, and timbre
Everyday listening - what one hears is not the acoustical attributes of sound but source of sound

e.g. hearing a door slam - pay attention to fact it was a door

For auditory design, based on musical listening we have equivalents to typography, layout, and color:

Pitch - bases for melody - 96 different pitches in western musical system.

Rhythm - change in timing and attack - most prominent characteristic.

Tempo - speed of events

Dynamics - relative loudness - e.g., crescendo could be used to give the idea of zooming a window.

Timbre - spectral content over time (e.g. sax vs. flute)

Location - where sound originates

e.g. musical themes to identify characters in Peter and the Wolf

Auditory Icon (Bill Gaver)- everday sounds mapped to computer events that are analogous to the sound that the user might expect if the actors involved in the interaction with material tangible objects.

e.g. Sonic Finder (1989) - drop in replacement for the standard finder of the Apple Macintosh

Finder Events	Auditory Icons
Object Selection	Hitting Sound
Opening Folder	Whooshing Sound
Dragging	Scraping Sound
Drop-In	Noise of object Landing
Copying	Pouring sound

SonicFinder's strength is the way it reinforces the existing desktop metaphor by creating an illusion that the components of the system are tangible objects.

Earcon (Stephen Brewster, U.of Glasgow) - non-verbal audio messages that are used in the computer/user interface to provide information to the user about some computer object, operation or interaction. Earcons are constructed from simple building blocks called motives (See above list). Earcons are constructed from motives. Rules for creating earcons:

Repetition: Exact restatement of a preceding motive and its parameters.

Variation: Altering one or more of the variable parameters from the preceding motive.

Contrast: A decided difference in the pitch and/or rhythmic content from the preceding motive.

Sample Experiments

Testing and Validation

e.g. Operating Room:

Staff only able to identify 10 to 15 of 26 alarms

Only 9 to 14 of 23 alarms identifiable in ICU

Result - too many alarms in critical areas and those poorly designed

Human Factors

Prime Attribute - messages can be conveyed without making use of video channel
Visual channel focused, audio more omni-directional
Some people's audio icon is other's noise

Pragmatics of using sound on computers

Standard - MIDI

Perception and Psychoacoustics

Psychoacoustics tells about relationship between perception and physical properties of acoustic signal