Auditory-Visual Integration: Implications for the New Millennium
This presentation is the property of its rightful owner.
Sponsored Links
1 / 92

Auditory-Visual Integration: Implications for the New Millennium PowerPoint PPT Presentation


  • 105 Views
  • Uploaded on
  • Presentation posted in: General

Auditory-Visual Integration: Implications for the New Millennium. 100. NH - Auditory Consonants. 90. HI Auditory Consonants. 80. 70. HI-Auditory Sentences. 60. Percent Correct Recognition. ASR Sentences. 50. 40. 30. 20. 10. 0. -15. -10. -5. 0. 5. 10. 15. 20.

Download Presentation

Auditory-Visual Integration: Implications for the New Millennium

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Auditory visual integration implications for the new millennium

Auditory-Visual Integration: Implications for the New Millennium


Auditory visual integration implications for the new millennium

100

NH - Auditory Consonants

90

HI Auditory Consonants

80

70

HI-Auditory Sentences

60

Percent Correct Recognition

ASR Sentences

50

40

30

20

10

0

-15

-10

-5

0

5

10

15

20

Speech-to-Noise Ratio (dB)

Speech Recognition by Man and Machine


Auditory visual integration implications for the new millennium

100

NH - Audiovisual Consonants

90

NH - Auditory Consonants

80

HI - Audiovisual Consonants

70

60

HI Auditory Consonants

Percent Correct Recognition

50

HI-Auditory Sentences

40

ASR Sentences

30

20

10

0

-15

-10

-5

0

5

10

15

20

Speech-to-Noise Ratio (dB)

Auditory-Visual vs. Audio Speech Recognition


Common problems common solutions

Automatic Speech Recognition

Hearing Aid Technology

Synthetic Speech

Learning and Education - Classrooms

Foreign Language Training

Common Problems – Common Solutions


Auditory visual integration implications for the new millennium

How It All Works - The Prevailing View

  • Information extracted from both sources independently

  • Integration of extracted information

  • Decision statistic

Evaluation

Integration

Decision

From Massaro, 1998


Possible roles of speechreading

Provide segmental information that is redundant with acoustic information

vowels

Possible Roles of Speechreading


Auditory visual integration implications for the new millennium

Possible Roles of Speechreading

  • Provide segmental information that is redundant with acoustic information

    • vowels

  • Provide segmental information that is complementary with acoustic information

    • consonants


Auditory visual integration implications for the new millennium

Possible Roles of Speechreading

  • Provide segmental information that is redundant with acoustic information

    • vowels

  • Provide segmental information that is complementary with acoustic information

    • consonants

  • Direct auditory analyses to the target signal

    • who, where, when, what (spectral)


Auditory visual integration implications for the new millennium

Visual Feature Distribution

0%

3%

4%

Voicing

Manner

Place

Other

93%

%Information Transmitted re: Total Information Received


Auditory visual integration implications for the new millennium

100

80

60

Aided Auditory

40

Voicing

20

Manner

Place

0

0

20

40

60

80

100

Unaided Auditory

Speech Feature Distribution - Aided A


Auditory visual integration implications for the new millennium

Voicing

Manner

Place

Speech Feature Distribution - Unaided AV

100

80

60

Unaided Auditory-Visual

40

20

0

0

20

40

60

80

100

Unaided Auditory


Auditory visual integration implications for the new millennium

Speech Feature Distribution - Aided AV

100

80

60

Aided Auditory-Visual

40

Voicing

Manner

20

Place

0

0

20

40

60

80

100

Unaided Auditory


Auditory visual integration implications for the new millennium

Hypothetical Consonant Recognition - Perfect Feature Transmission

1

0.8

0.6

Probability Correct

0.4

0.2

0

Voicing

Manner

Place

Voicing + Manner

Feature(s) Transmitted

Auditory consonant recognition based on perfect transmission of indicated feature. Responses within each feature category were uniformly distributed.

A

Visual Consonant recognition. Data from 11 normally-hearing subjects (Grant and

Walden, 1996).

V

Predicted AV consonant recognition based on PRE model of integration (Braida, 1991).

PRE


Auditory visual integration implications for the new millennium

Feature Distribution re: Center Frequency

100

90

80

70

60

Percent Information Transmitted

(re: Total Received)

50

40

30

20

Voicing + Manner

10

Place

0

0

1000

2000

3000

4000

5000

Bandwidth Center Frequency (Hz)


Auditory visual integration implications for the new millennium

Summary: Hearing Aids and AV Speech

  • Hearing aids enhance all speech features to some extent

    • Place cues are enhanced the most

    • Voicing cues are enhanced the least

  • Visual cues enhance mostly place cues, with little support to manner and virtually no support to voicing

  • Model predictions show that auditory place cues are redundant with speechreading and offer little benefit under AV conditions.

  • Model predictions also show that AV speech recognition is best supported by auditory cues for voicing and manner

  • Voicing and manner cues are best represented acoustically in the low frequencies (below 500 Hz).


Auditory visual integration implications for the new millennium

The Independence of Sensory Systems???

  • Information is extracted independently from A and V modalities

    • Early versus Late Integration

    • Most models are "late integration" models


Auditory visual integration implications for the new millennium

The Independence of Sensory Systems ???

  • Information is extracted independently from A and V modalities

    • Early versus Late Integration

    • Most models are "late integration" models

    • BUT

  • Speechreading activates primary auditory cortex (cf. Sams et al., 1991)


Auditory visual integration implications for the new millennium

The Independence of Sensory Systems???

  • Information is extracted independently from A and V modalities

    • Early versus Late Integration

    • Most models are "late integration" models

    • BUT

  • Speechreading activates primary auditory cortex (cf. Sams et al., 1991)

  • Population of neurons in cat Superior Colliculus respond only to bimodal input (cf. Stein and Meredith, 1993)


Auditory visual integration implications for the new millennium

Sensory Integration: Many Questions

  • How is the auditory system modulated by visual speech activity?

  • What is the temporal window governing this interaction?


The effect of speechreading on masked detection thresholds for spoken sentences

The Effect of Speechreading on Masked Detection Thresholds for Spoken Sentences

Ken W. Grant and Philip Franz Seitz

Army Audiology and Speech Center

Walter Reed Army Medical Center

Washington, D.C. 20307, USA

http://www.wramc.amedd.army.mil/departments/aasc/avlab/

[email protected]


Auditory visual integration implications for the new millennium

Acknowledgments and Thanks

Technical Assistance

Philip Franz Seitz

Research Funding

National Institutes for Health


Coherence masking protection

CMP (Gordon, 1997a,b)

Identification or classification of acoustic signals in noise is improved by adding additional signal components which are non-informative by themselves, provided that the relevant and non-relevant signal components “cohere” in some way (e.g., common fundamental frequency, common onset and offset, etc.)

Coherence Masking Protection


Bimodal coherence masking protection

BCMP (Grant and Seitz, 2000, Grant, 2001)

Detection of speech in noise is improved by watching a talker (i.e., speechreading) as they produce the target speech signal, provided that the "visible" movement of the lips and acoustic amplitude envelope are highly correlated.

Bimodal Coherence Masking Protection


Bimodal coherence

Acoustic/Optic Relationship

Reduction of Signal Uncertainty

in time domain

in spectral domain

Greater Neural Drive to Cognitive Centers

neural priming - auditory cortex

bimodal cells in superior colliculus

Bimodal Coherence


Basic paradigm for bcmp

Auditory-only speech detection

Auditory-visual speech detection

speech

speech

Basic Paradigm for BCMP

...

...

noise

noise

noise

noise


Experiment 1 natural speech

9 Interleaved Tracks - 2IFC Sentence Detection in Noise

3 Audio target sentences - IEEE/Harvard Corpus

Same 3 target sentences with matched video

Same 3 target sentences with unmatched video (from 3 additional sentences)

White noise masker centered on target with random 100-300 ms extra onset and offset duration

10 Normal-Hearing Subjects

Experiment 1: Natural Speech


Auditory visual integration implications for the new millennium

Congruent versus Incongruent Speech

-26

A

AV

M

-25

AV

UM

-24

Masked Threshold (dB)

-23

-22

-21

Sentence 1

Sentence 2

Sentence 3

Average

Target Sentence


Auditory visual integration implications for the new millennium

BCMP for Congruent and Incongruent Speech

3.0

AV

M

2.5

AV

UM

2.15

2.0

1.70

1.56

1.5

Bimodal Coherence Protection (dB)

0.84

1.0

0.5

0.16

0.07

0.06

0.0

-0.05

-0.5

-1.0

Sentence 1

Sentence 2

Sentence 3

Average

Target Sentence


Auditory visual integration implications for the new millennium

Acoustic Envelope and Lip Area Functions

80

WB

70

F1

F2

60

F3

50

RMS Amplitude (dB)

40

30

20

10

0

Watch the log float in the wide river

2.5

2.0

1.5

Lip Area (in2)

1.0

0.5

0.0

0

500

1000

1500

2000

2500

3000

3500

Time (ms)


Auditory visual integration implications for the new millennium

e

p

o

l

e

v

n

E

S

M

R

Cross Modality Correlation - Lip Area versus Amplitude Envelope

Sentence 3

Sentence 2

WB

r=0.52

r=0.35

F1

r=0.49

r=0.32

F2

r=0.65

r=0.47

F3

r=0.60

r=0.45

Lip Area

Lip Area


Auditory visual integration implications for the new millennium

1.0

0.8

0.6

0.4

0.2

Correlation Coefficient

80

0

70

-0.2

60

50

-0.4

RMS Amplitude (dB)

40

-0.6

30

20

-0.8

Watch the log float in the wide river

10

0

-1.0

0

500

1000

1500

2000

2500

3000

3500

Time (ms)

Local Correlations (Lip Area versus F2-Amplitude Envelope


Auditory visual integration implications for the new millennium

Local Correlations (Lip Area versus F2-Amplitude Envelope

1.0

0.8

0.6

0.4

0.2

Correlation Coefficient

0.0

80

70

-0.2

60

50

-0.4

RMS Amplitude (dB)

40

-0.6

30

20

-0.8

Both brothers wear the same size

10

-1.0

0

0

500

1000

1500

2000

2500

3000

3500

Time (ms)


Auditory visual integration implications for the new millennium

speech

Methodology for Orthographic BCMP

  • Auditory-only speech detection

  • Auditory + orthographic speech detection

...

...

text

speech

noise

noise

noise

noise


Experiment 2 orthography

6 Interleaved Tracks - 2IFC Sentence Detection in Noise

3 Audio target sentences - IEEE/Harvard Corpus

Same 3 target sentences with matched orthography

White noise masker centered on target with random 100-300 ms extra onset and offset duration

6 Normal-Hearing Subjects (from EXP 1)

Experiment 2: Orthography


Auditory visual integration implications for the new millennium

BCMP for Orthographic Speech

1.5

1

0.78

Bimodal Coherence Protection (dB)

0.50

0.5

0.38

0.33

0

Sentence 1

Sentence 2

Sentence 3

Average

Target Sentence


Auditory visual integration implications for the new millennium

speech

speech

Methodology for Filtered BCMP

F1 (100-800 Hz) F2 (800-2200 Hz)

  • Auditory-only detection of filtered speech

  • Auditory-visual detection of filtered speech

...

...

noise

noise

noise

noise


Auditory visual integration implications for the new millennium

Experiment 3: Filtered Speech

  • 6 Interleaved Tracks - 2IFC Sentence Detection in Noise

    • 2 Audio target sentences - IEEE/Harvard Corpus - each filtered into F1 and F2 bands

    • Same 2 target sentences that produced largest BCMP

    • White noise masker centered on target with random 100-300 ms extra onset and offset duration

  • 6 Normal-Hearing Subjects (from EXP 2)


  • Auditory visual integration implications for the new millennium

    3

    AV

    F1

    AV

    F2

    2.5

    AV

    WB

    AV

    O

    2

    Masking Protection (dB)

    1.5

    1

    0.5

    0

    Sentence 2

    Sentence 3

    Average

    Target Sentence

    BCMP for Filtered Speech


    Auditory visual integration implications for the new millennium

    0

    0

    50

    50

    100

    100

    150

    150

    200

    200

    250

    250

    300

    300

    350

    350

    Cross Modality Correlations (lip area versus acoustic envelope)

    1

    80

    1

    80

    S2

    S2

    0.8

    0.8

    70

    70

    0.6

    0.6

    60

    60

    0.4

    0.4

    50

    50

    0.2

    0.2

    Correlation Coefficient

    Correlation Coefficient

    Peak Amplitude (dB)

    Peak Amplitude (dB)

    0

    40

    0

    40

    -0.2

    -0.2

    30

    30

    -0.4

    -0.4

    20

    20

    -0.6

    -0.6

    10

    10

    -0.8

    -0.8

    F1

    F2

    -1

    0

    -1

    0

    1

    80

    1

    80

    S3

    S3

    0.8

    0.8

    70

    70

    0.6

    0.6

    60

    60

    0.4

    0.4

    50

    50

    0.2

    0.2

    Correlation Coefficient

    Correlation Coefficient

    Peak Amplitude (dB)

    Peak Amplitude (dB)

    0

    40

    0

    40

    -0.2

    -0.2

    30

    30

    -0.4

    -0.4

    20

    20

    -0.6

    -0.6

    10

    10

    -0.8

    -0.8

    F1

    F2

    -1

    0

    -1

    0

    Time (ms)

    Time (ms)


    Bcmp summary

    Speechreading reduces auditory speech detection thresholds by about 1.5 dB (range: 1-3 dB depending on sentence)

    BCMP – Summary


    Auditory visual integration implications for the new millennium

    BCMP – Summary

    • Speechreading reduces auditory speech detection thresholds by about 1.5 dB (range: 1-3 dB depending on sentence)

    • Amount of BCMP depends on the degree of coherence between acoustic envelope and facial kinematics


    Auditory visual integration implications for the new millennium

    BCMP – Summary

    • Speechreading reduces auditory speech detection thresholds by about 1.5 dB (range: 1-3 dB depending on sentence)

    • Amount of BCMP depends on the degree of coherence between acoustic envelope and facial kinematics

    • Providing listeners with explicit (orthographic) knowledge of the identity of the target sentence reduces speech detection thresholds by about 0.5 dB, independent of the specific target sentence


    Auditory visual integration implications for the new millennium

    BCMP – Summary

    • Speechreading reduces auditory speech detection thresholds by about 1.5 dB (range: 1-3 dB depending on sentence)

    • Amount of BCMP depends on the degree of coherence between acoustic envelope and facial kinematics

    • Providing listeners with explicit (orthographic) knowledge of the identity of the target sentence reduces speech detection thresholds by about 0.5 dB, independent of the specific target sentence

    • Manipulating the degree of coherence between area of mouth opening and acoustic envelope by filtering the target speech has a direct effect on BCMP


    Auditory visual integration implications for the new millennium

    Temporal Window for Auditory-Visual Integration


    Auditory visual integration implications for the new millennium

    Speech Intelligibility Derived from

    Asynchronous Processing

    of

    Auditory-Visual Information

    Ken W. Grant

    Army Audiology and Speech Center

    Walter Reed Army Medical Center

    Washington, D.C. 20307, USA

    http://www.wramc.amedd.army.mil/departments/aasc/avlab

    [email protected]

    Steven Greenberg

    International Computer Science Institute

    1947 Center Street, Berkeley, CA 94704, USA

    http://www.icsi.berkeley.edu/~steveng

    [email protected]


    Auditory visual integration implications for the new millennium

    Acknowledgments and Thanks

    Technical Assistance

    Takayuki Arai, Rosaria Silipo

    Research Funding

    U.S. National Science Foundation


    Spectral temporal integration

    Time-Course of Spectral-Temporal Integration

    Within Modality

    4 narrow spectral slits

    delay two middle bands with respect to outer bands

    Across Modality

    2 narrow spectral slits (2 outer bands)

    delay audio with respect to video

    Spectral-Temporal Integration


    Auditory visual integration implications for the new millennium

    +

    Video Leads

    Audio Leads

    40 – 400 ms

    40 – 400 ms

    +

    Baseline Condition

    SYNCHRONOUS A/V

    Auditory-Visual Asynchrony - Paradigm

    Video of spoken (Harvard/IEEE) sentences, presented in tandem with asparse spectral representation (low- and high-frequency slits) of the same material


    Auditory visual integration implications for the new millennium

    Auditory-Visual Integration - Preview

    When the AUDIO signal LEADS the VIDEO, there is a progressive decline in intelligibility, similar to that observed for audio-alone signals

    When the VIDEO signal LEADS the AUDIO, intelligibility is preserved for asynchrony intervals as long as 200 ms

    Why? Why? Why?

    We’ll return to these data shortly

    But first, let’s take a look at audio-alone speech intelligibility data in order to gain some perspective on the audio-visual case

    The audio-alone data come from earlier studies by Greenberg and colleagues using TIMIT sentences

    9 Subjects


    Auditory visual integration implications for the new millennium

    AUDIO-ALONE

    EXPERIMENTS


    Auditory visual integration implications for the new millennium

    Audio (Alone) Spectral Slit Paradigm

    The edge of each slit was separated from its nearest neighbor by an octave

    Can listeners decode spoken sentences using just four narrow (1/3 octave) channels (“slits”) distributed across the spectrum? – YES (cf. next slide)

    What is the intelligibility of each slit alone and in combination with others?


    Auditory visual integration implications for the new millennium

    Word Intelligibility - Single and Multiple Slits

    89%

    60%

    13%

    4

    4

    5340

    5340

    3

    3

    Slit Number

    2120

    2120

    Slit Number

    CF (Hz)

    CF (Hz)

    2

    2

    841

    841

    1

    1

    334

    334

    2%

    9%

    9%

    4%


    Auditory visual integration implications for the new millennium

    Slit Asynchrony Affects Intelligibility

    Desynchronizing the slits by more than 25 ms results in a significant decline in intelligibility

    The effect of asynchrony on intelligibility is relatively symmetrical

    These data are from a different set of subjects than those participating in the study described earlier - hence slightly different numbers for the baseline conditions


    Auditory visual integration implications for the new millennium

    Intelligibility and Slit Asynchrony

    Desynchronizing the two central slits relative to the lateral ones has a pronounced effect on intelligibility

    Asynchrony greater than 50 ms results in intelligibility lower than baseline


    Auditory visual integration implications for the new millennium

    AUDIO-VISUAL

    EXPERIMENTS


    Auditory visual integration implications for the new millennium

    Focus on Audio-Leading-Video Conditions

    When the AUDIO signal LEADS the VIDEO, there is a progressive decline in intelligibility, similar to that observed for audio-alone signals

    These data are next compared with data from the previous slide to illustrated the similarity in the slope of the function


    Auditory visual integration implications for the new millennium

    Comparison of A/V and Audio-Alone Data

    The decline in intelligibility for the audio-alone condition is similar to that of the audio-leading-video condition

    Such similarity in the slopes associated with intelligibility for both experiments suggest that the underlying mechanisms may be similar

    The intelligibility of the audio-alone signals is higher than the A/V signals due to slits 2+3 being highly intelligible by themselves


    Auditory visual integration implications for the new millennium

    Focus on Video-Leading-Audio Conditions

    When the VIDEO signal LEADS the AUDIO, intelligibility is preserved for asynchrony intervals as large as 200 ms

    These data are rather strange, implying some form of “immunity” against intelligibility degradation when the video channel leads the audio

    We’ll consider a variety of interpretations in a few minutes


    Auditory visual integration implications for the new millennium

    Auditory-Visual Integration

    The slope of intelligibility-decline associated with the video-leading-audio conditions is rather different from the audio-leading-video conditions

    WHY? WHY? WHY?

    There are several interpretations of these data – we’ll consider several on the following slide


    Auditory visual integration implications for the new millennium

    INTERPRETATION

    OF THE DATA


    Possible interpretations of the data 1

    The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

    Possible Interpretations of the Data – 1


    Possible interpretations of the data 11

    The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

    Visual information is processed in the brain much more slowly than auditory information (difference in first spike latency in V1 versus A1 is about 30-50 ms)

    Possible Interpretations of the Data – 1


    Possible interpretations of the data 12

    The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

    Visual information is processed in the brain much more slowly than auditory information (difference in first spike latency in V1 versus A1 is about 30-50 ms)

    Some problems with this interpretation ….

    Possible Interpretations of the Data – 1


    Possible interpretations of the data 13

    The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

    Visual information is processed in the brain much more slowly than auditory information (difference in first spike latency in V1 versus A1 is about 30-50 ms)

    Some problems with this interpretation ….

    Simple differences in neural conduction times may account for a preference for visual leading information BUT can not account for asymmetry in temporal window

    Possible Interpretations of the Data – 1


    Auditory visual integration implications for the new millennium

    Auditory-Visual Integration

    The slope of intelligibility-decline associated with the video-leading-audio conditions is rather different from the audio-leading-video conditions

    WHY? WHY? WHY?


    Possible interpretations of the data 2

    The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

    Possible Interpretations of the Data – 2


    Possible interpretations of the data 21

    The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

    There is certain information in the video component of the speech signal (and in the mid-frequency audio channels) that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be?

    Possible Interpretations of the Data – 2


    Possible interpretations of the data 22

    The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

    There is certain information in the video component of the speech signal (and in the mid-frequency audio channels) that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be?

    The visual component of the speech signal is most closely associated with place-of-articulation information (cf. Grant and Walden, 1996)

    Possible Interpretations of the Data – 2


    Possible interpretations of the data 23

    The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

    There is certain information in the video component of the speech signal (and in the mid-frequency audio channels) that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be?

    The visual component of the speech signal is most closely associated with place-of-articulation information (Grant and Walden, 1996)

    In the (audio) speech signal, place-of-articulation information usually evolves over two or three phonetic segments (i.e., a syllable in length)

    Possible Interpretations of the Data – 2


    Possible interpretations of the data 24

    The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

    There is certain information in the video component of the speech signal that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be?

    The visual component of the speech signal is most closely associated with place-of-articulation information (Grant and Walden, 1996)

    In the (audio) speech signal, place-of-articulation information usually evolves over two or three phonetic segments (i.e., a syllable in length)

    BUT ….. the data imply that the modality arriving first determines the mode (and hence the time constant of processing) for combining information across sensory channels

    Possible Interpretations of the Data – 2


    Possible interpretations of the data 3

    The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

    Possible Interpretations of the Data – 3


    Possible interpretations of the data 31

    The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

    The brain has evolved under conditions where the visual signal arrives prior to the audio signal, but where the time disparity between the two modalities varies from situation to situation

    Possible Interpretations of the Data – 3


    Auditory visual integration implications for the new millennium

    Possible Interpretations of the Data – 3

    The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

    The brain has evolved under conditions where the visual signal arrives prior to the audio signal, but where the time disparity between the two modalities varies from situation to situation

    Therefore, processing of A and V information is tolerant of asynchronous arrival times so long as the visual information leads the auditory information


    Auditory visual integration implications for the new millennium

    Possible Interpretations of the Data – 3

    The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

    The brain has evolved under conditions where the visual signal arrives prior to the audio signal, but where the time disparity between the two modalities varies from situation to situation

    Therefore, processing of A and V information is tolerant of asynchronous arrival times so long as the visual information leads the auditory information

    BUT… Why the 200 ms value? Given the speed of light and sound (cf. 1129 ft/sec), 200 ms would suggest that we can process AV stimuli at 225.8 ft. That's may be a bit much!


    Auditory visual integration implications for the new millennium

    Possible Interpretations of the Data – 3

    The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

    The brain has evolved under conditions where the visual signal arrives prior to the audio signal, but where the time disparity between the two modalities varies from situation to situation

    Therefore, processing of A and V information is tolerant of asynchronous arrival times so long as the visual information leads the auditory information

    BUT… Why the 200 ms value? Given the speed of light and sound (cf. 1129 ft/sec), 200 ms would suggest that we can process AV stimuli at 225.8 ft. That's may be a bit much!

    In addition to speed differences, we also must consider that speech movements precede audio output by variable amounts. The limit of 200 ms may reflect a boundary associated with syllabic length units, over which integration is probably not beneficial.


    Auditory visual integration implications for the new millennium

    SUMMARY


    Audio video integration summary

    Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality

    Audio-Video Integration – Summary


    Audio video integration summary1

    Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality

    This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)

    Audio-Video Integration – Summary


    Audio video integration summary2

    Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality

    This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)

    When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony

    Audio-Video Integration – Summary


    Audio video integration summary3

    Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality

    This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)

    When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony

    When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms

    Audio-Video Integration – Summary


    Audio video integration summary4

    Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality

    This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)

    When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony

    When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms

    There are many potential interpretations of the data

    Audio-Video Integration – Summary


    Audio video integration summary5

    Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality

    This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)

    When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony

    When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms

    There are many potential interpretations of the data

    The simplest explanation is also the most ecologically valid: we prefer conditions where visual information leads audio information because that's what we're confronted with in the real world. But there are still questions remaining.

    Audio-Video Integration – Summary


    Auditory visual integration implications for the new millennium

    Audio-Video Integration – Summary

    Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality

    This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)

    When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony

    When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms

    There are many potential interpretations of the data

    The simplest explanation is also the most ecologically valid: we prefer conditions where visual information leads audio information because that's what we're confronted with in the real world. But there are still questions remaining.

    For visual lag conditions, the similarity between within-modality and across-modality integration is intriguing, and may suggest an interpretation based on the linguistic information conveyed by A and V modalities and which modality first activates the system


    Auditory visual integration implications for the new millennium

    That’s All, Folks

    Many Thanks for Your Time and Attention


    Auditory visual integration implications for the new millennium

    VARIABILITY

    AMONG SUBJECTS


    One further wrinkle to the story

    Perhaps the most intriguing property of the experimental results concerns the intelligibility patterns associated with individual subjects

    One Further Wrinkle to the Story ….


    One further wrinkle to the story1

    Perhaps the most intriguing property of the experimental results concerns the intelligibility patterns associated with individual subjects

    For eight of the nine subjects, the condition associated with the highest intelligibility was one in which the video signal led the audio

    One Further Wrinkle to the Story ….


    One further wrinkle to the story2

    Perhaps the most intriguing property of the experimental results concerns the intelligibility patterns associated with individual subjects

    For eight of the nine subjects, the condition associated with the highest intelligibility was one in which the video signal led the audio

    The length of optimal asynchrony (in terms of intelligibility) varies from subject to subject, but is generally between 80 and 120 ms

    One Further Wrinkle to the Story ….


    Auditory visual integration implications for the new millennium

    Auditory-Visual Integration - by Individual S's

    Video signal leading is better than synchronous for 8 of 9 subjects

    These data are complex, but the implications are clear.

    Audio-visual integration is a complicated, poorly understood process, at least with respect to speech intelligibility

    Variation across subjects


    Auditory visual integration implications for the new millennium

    Hearing Aids and Speechreading

    • Are hearing aids optimally designed and fit for auditory-visual speech input?

      • How can they be when visual cues are not considered in the design or fitting process!


    Auditory visual integration implications for the new millennium

    Consonant Recognition - Amplification versus Speechreading

    100

    Input Level = 50 SPL

    Walden, Grant, & Cord (2001). Ear and Hearing 22, 333-341.

    80

    Percent Correct

    60

    40

    20

    0

    Unaided - A

    Aided - A

    Unaided - AV

    Aided - AV

    Test Condition


    Auditory visual integration implications for the new millennium

    Hearing Aid Summary

    • To optimize AV speech recognition, hearing aids must…

      • Provide accurate reception of speech cues not readily available through speechreading (complementary cues)

        • voicing

        • manner

      • There may have to be a trade off between providing the best auditory alone experience and one where AV cues are optimized


  • Login