Speech

Psy 5054 ]


Speech Perception

  • The Study of Speech Sounds
  • Psychological Background
  • A Computational Analysis of Speech Perception
  • Models of Speech Perception
  • Speech Perception and the Brain

The Study of Speech Sounds

  • Acoustic Phonetics
  • Articulatory Phonetics
  • Phonology

Acoustic Phonetics

  • What are the physical properties of speech?
  • Spectrograms
  • Cues

What are the physical properties of speech?

Spectograms

  • Time (Horizontal Axis)
  • Frequency (Vertical Axis)
  • Intensity
  • Formants
    • Numbering
    • Importance of the First Two Formants

Cues

  • Context Independent (stressed vowels)
  • Context Dependent (e.g., consonants)

Articulatory Phonetics

  • How are speech sounds produced?
  • Consonants
  • Vowels

How are speech sounds produced?

  • The lungs provide a flow of air.
  • The larynx vibrates.
  • The shape of the mouth can be altered to change the sound.
  • The tongue can be used to restrict the flow of air.
  • Air can be redirected through the nasal passage.

Consonants

  • Place of Articulation
  • Manner of Articulation
  • Voicing

Place of Articulation

  • Bi-Labial (pin)
  • Labio-Dental (fin)
  • Dental (thin)
  • Alveolar (zin)
  • Paletal (chin)
  • Velor (kin)
  • Glotal (hin)

Manner of Articulation

  • Stops (tin)
  • Fricatives (sin)
  • Affricates (chin)
  • Nasals (min)
  • Laterals (lin)
  • Semivowels (win)

Voicing

  • Voiced (bin)
  • Unvoiced (pin)

Vowels

  • Height of Tongue
    • High (beet)
    • Mid (bait)
    • Low (bat)
  • Part of the Tongue Involved
    • Front (bat)
    • Central (but)
    • Back (pot)

Phonology

  • What are the basic units of speech?
  • How do speech sounds change when you combine them?

What are the basic units of speech?

  • Phones are physically different speech sounds (vowels or consonants).
  • Phonemes are the categories into which we classify phones.
  • Allephones are different phones that we place in the same phoneme category.
  • Phonemes and allephones are language specific.

How do speech sounds change when you combine them?

  • Example: To make a noun plural, add an "s".
  • The sound of the "s" depends on the context.
    • glass --> glass+ez
    • lip --> lip+s
    • pig --> pig+z

Psychological Background

  • Stages of Speech Perception
  • The Importance of Top-Down Processing
  • The Motor Theory

Stages of Speech Perception

  • The Auditory Stage
  • The Phonetic Stage
  • The Phonological Stage

The Auditory Stage

  • Segmentation
    • The speech stream has to be segmented into phones.
  • Identification of Features
    • Invariant acoustic features (those associated with stressed vowels).
    • What other features are important?
      • Errors in phoneme identification.

Errors in Phoneme Identification

The Phonetic Stage

  • Identify the Phoneme
  • Categorical Perception
    • IV is voice onset time.
    • DVs are phoneme identification ("P" or "B") and ABX discrimination.
    • Results are language specific.

The Phonological Stage

  • Use rules of combination to check consistency.
  • Day’s Dichotic Listening Task
    • banket + lanket --> blanket
    • Replication?

The Importance of Top-Down Processing

  • Only 50% of the words in a tape recording can be identified out of context.
  • Miller & Isard (1963)
  • Warren & Warren (1970)

Miller & Isard (1963)

  • Both syntactic and semantic constraints improve auditory word recognition.
    • Accidents kill motorists on the highway. (+syntax, +semantics)
    • Accidents carry honey between the house. (+syntax, -semantics)
    • Around accidents country honey the shoot. (-syntax, -semantics)

Warren & Warren (1970)

  • Context alters phoneme identification.
    • It was found that the *eel was on the axle.
    • It was found that the *eel was on the shoe.
    • It was found that the *eel was on the orange.
    • It was found that the *eel was on the table.

The Motor Theory

  • Analysis by Synthesis
    • People understand speech by figuring out how to reproduce the speech stream.
  • The Motor Theory of Speech Perception
    • An extreme version of analysis by synthesis.
    • Assumes involvement of the articulatory mechanism.
    • Problems with Motor Theory
    • The McGurk Effect

The McGurk Effect

  • /ba/ + /ga/ --> /da/
    • Hear /ba/
    • See /ga/
    • Perceive /da/
  • From the UCSC Perceptual Sciences Lab Web Site
  • What does this tell us?

A Computational Analysis of Speech Perception

  • What is the input?
    • Frequency, Intensity and time.
    • Context and knowledge of how sounds are produced must play a role.
  • What is the goal?
    • Classification of phonemes.
  • What strategy is used to achieve the goal with the available input?
    • Interactive processing is a must!
    • Parallel processing is necessary to stay within the 100 step maximum.

Models of Speech Perception

  • HEARSAY (Reddy & Newell, 1974)
  • TRACE (McClelland & Elman, 1986)

HEARSAY (Reddy & Newell, 1974)

  • The Design of HEARSAY
  • The HEARSAY ARCHITECTURE
  • The Semantic Component
  • The Syntactic Component
  • A Parsing Example
  • The Lexical Component
  • The Phonemic Component
  • The Parametric Component
  • Questions about HEARSAY
  • Evaluating HEARSAY

The Design of HEARSAY

  • HEARSAY is a computer program.
  • HEARSAY was designed to show that computers can understand speech well enough to do something with it.
  • HEARSAY functions in the restricted domain of voice chess.
  • HEARSAY consists of independent but cooperating components.
  • Each component is able to generate, reject and rank order hypotheses.
  • All communication is through a "blackboard".
  • HEARSAY uses procedural representations.
  • HEARSAY uses the same information found in a spectrogram.
  • HEARSAY gives us a preview of where we are headed in this course!

The HEARSAY ARCHITECTURE

The Semantic Component

  • HEARSAY generates an ordered (best to worst) set of legal moves based on:
    • the rules of chess
    • the current board position
    • a user model (assumes a rational opponent)
  • The current state of the conversation is used to eliminate moves ("e.g., "capture" eliminates non-capture moves).

The Syntactic Component

  • HEARSAY’s grammar consists of 18 rewrite rules such as:
    • move à move1 + check_word OR move1
    • move1 à regular_move OR capture OR castle
    • capture à man_loc + capture_word + man_loc OR . . .
    • capture_word à "takes" OR . . .
  • These rules can generate more than 5 million sentences.
  • A "generate-and-test" procedure is used to predict the next word.

A Parsing Example

The Lexical Component

  • HEARSAY has a lexicon of 31 chess words.
  • Each lexical entry includes:
    • a phonemic description of the word
    • its stress pattern
    • its grammatical category
    • a procedural representation of its meaning

The Phonemic Component

  • Characteristics (Acoustic Features) of Phonemes
  • Rules for Dealing with Missing or Extra Segments
  • Juncture Rules
  • Rules for Distinguishing Pairs
  • Uniquely Identifiable Sounds (Stressed Vowels)
  • Phoneme Boundaries

The Parametric Component

  • Speaker Characteristics
  • Allophonic Variability and Context

Questions about HEARSAY

  • Is the representation local or distributed?
  • How many levels of representation are assumed?
  • What processes are assumed?
  • Are the processes bottom-up, top-down, or interactive?
  • Are the processes sequential or parallel within levels?
  • Are the processes sequential or parallel between levels?
  • Are the processes controlled or automatic?
  • Are the processes symbolic or sub-symbolic?
  • Are the structures and processes modular?

Evaluating HEARSAY

  • HEARSAY was evaluated on 19 utterances containing 101 words.
  • The entact model was correct on
    • 88% of the words
    • 46% of the sentences
  • Without the semantic component it was correct on
    • 65% of the words
    • 14% of the sentences
  • Without the semantic or syntactic components it was correct on
    • 40% of the words
    • 0% of the sentences

TRACE (McClelland & Elman, 1986)

  • An interactive activation model for speech recognition.
  • Three levels of representation.
    • Distinctive Features X Time
      • Articulatory properties that influence perception.
      • Example: +/- voicing
    • Phonemes X Time
    • Words
  • No attempt is made to segment the speech stream prior to phoneme identification.

Speech Perception and the Brain

  • Massive parallelism satisfies the 100 step maximum.
  • TRACE uses brain style computation, HEARSAY does not.
  • According to the WLG Model of aphasia:
    • Werneke’s area stores information about word sounds.
    • Broca’s area is the speech planning and programming area.
  • Functional imaging studies confirm that these areas are involved in speech perception.
  • They are also involved in reading (phonological recoding)!
 

Psy 5054 ]

The views and opinions expressed in this page are strictly those of the page author. The contents of this page have not been reviewed or approved by the University of Minnesota

This page was last updated on 03/02/00.