Auditory-Visual Integration: Implications for the New Millennium
Sponsored Links
This presentation is the property of its rightful owner.
1 / 92

Auditory-Visual Integration: Implications for the New Millennium PowerPoint PPT Presentation


  • 110 Views
  • Uploaded on
  • Presentation posted in: General

Auditory-Visual Integration: Implications for the New Millennium. 100. NH - Auditory Consonants. 90. HI Auditory Consonants. 80. 70. HI-Auditory Sentences. 60. Percent Correct Recognition. ASR Sentences. 50. 40. 30. 20. 10. 0. -15. -10. -5. 0. 5. 10. 15. 20.

Download Presentation

Auditory-Visual Integration: Implications for the New Millennium

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Auditory-Visual Integration: Implications for the New Millennium


100

NH - Auditory Consonants

90

HI Auditory Consonants

80

70

HI-Auditory Sentences

60

Percent Correct Recognition

ASR Sentences

50

40

30

20

10

0

-15

-10

-5

0

5

10

15

20

Speech-to-Noise Ratio (dB)

Speech Recognition by Man and Machine


100

NH - Audiovisual Consonants

90

NH - Auditory Consonants

80

HI - Audiovisual Consonants

70

60

HI Auditory Consonants

Percent Correct Recognition

50

HI-Auditory Sentences

40

ASR Sentences

30

20

10

0

-15

-10

-5

0

5

10

15

20

Speech-to-Noise Ratio (dB)

Auditory-Visual vs. Audio Speech Recognition


Automatic Speech Recognition

Hearing Aid Technology

Synthetic Speech

Learning and Education - Classrooms

Foreign Language Training

Common Problems – Common Solutions


How It All Works - The Prevailing View

  • Information extracted from both sources independently

  • Integration of extracted information

  • Decision statistic

Evaluation

Integration

Decision

From Massaro, 1998


Provide segmental information that is redundant with acoustic information

vowels

Possible Roles of Speechreading


Possible Roles of Speechreading

  • Provide segmental information that is redundant with acoustic information

    • vowels

  • Provide segmental information that is complementary with acoustic information

    • consonants


Possible Roles of Speechreading

  • Provide segmental information that is redundant with acoustic information

    • vowels

  • Provide segmental information that is complementary with acoustic information

    • consonants

  • Direct auditory analyses to the target signal

    • who, where, when, what (spectral)


Visual Feature Distribution

0%

3%

4%

Voicing

Manner

Place

Other

93%

%Information Transmitted re: Total Information Received


100

80

60

Aided Auditory

40

Voicing

20

Manner

Place

0

0

20

40

60

80

100

Unaided Auditory

Speech Feature Distribution - Aided A


Voicing

Manner

Place

Speech Feature Distribution - Unaided AV

100

80

60

Unaided Auditory-Visual

40

20

0

0

20

40

60

80

100

Unaided Auditory


Speech Feature Distribution - Aided AV

100

80

60

Aided Auditory-Visual

40

Voicing

Manner

20

Place

0

0

20

40

60

80

100

Unaided Auditory


Hypothetical Consonant Recognition - Perfect Feature Transmission

1

0.8

0.6

Probability Correct

0.4

0.2

0

Voicing

Manner

Place

Voicing + Manner

Feature(s) Transmitted

Auditory consonant recognition based on perfect transmission of indicated feature. Responses within each feature category were uniformly distributed.

A

Visual Consonant recognition. Data from 11 normally-hearing subjects (Grant and

Walden, 1996).

V

Predicted AV consonant recognition based on PRE model of integration (Braida, 1991).

PRE


Feature Distribution re: Center Frequency

100

90

80

70

60

Percent Information Transmitted

(re: Total Received)

50

40

30

20

Voicing + Manner

10

Place

0

0

1000

2000

3000

4000

5000

Bandwidth Center Frequency (Hz)


Summary: Hearing Aids and AV Speech

  • Hearing aids enhance all speech features to some extent

    • Place cues are enhanced the most

    • Voicing cues are enhanced the least

  • Visual cues enhance mostly place cues, with little support to manner and virtually no support to voicing

  • Model predictions show that auditory place cues are redundant with speechreading and offer little benefit under AV conditions.

  • Model predictions also show that AV speech recognition is best supported by auditory cues for voicing and manner

  • Voicing and manner cues are best represented acoustically in the low frequencies (below 500 Hz).


The Independence of Sensory Systems???

  • Information is extracted independently from A and V modalities

    • Early versus Late Integration

    • Most models are "late integration" models


The Independence of Sensory Systems ???

  • Information is extracted independently from A and V modalities

    • Early versus Late Integration

    • Most models are "late integration" models

    • BUT

  • Speechreading activates primary auditory cortex (cf. Sams et al., 1991)


The Independence of Sensory Systems???

  • Information is extracted independently from A and V modalities

    • Early versus Late Integration

    • Most models are "late integration" models

    • BUT

  • Speechreading activates primary auditory cortex (cf. Sams et al., 1991)

  • Population of neurons in cat Superior Colliculus respond only to bimodal input (cf. Stein and Meredith, 1993)


Sensory Integration: Many Questions

  • How is the auditory system modulated by visual speech activity?

  • What is the temporal window governing this interaction?


The Effect of Speechreading on Masked Detection Thresholds for Spoken Sentences

Ken W. Grant and Philip Franz Seitz

Army Audiology and Speech Center

Walter Reed Army Medical Center

Washington, D.C. 20307, USA

http://www.wramc.amedd.army.mil/departments/aasc/avlab/

[email protected]


Acknowledgments and Thanks

Technical Assistance

Philip Franz Seitz

Research Funding

National Institutes for Health


CMP (Gordon, 1997a,b)

Identification or classification of acoustic signals in noise is improved by adding additional signal components which are non-informative by themselves, provided that the relevant and non-relevant signal components “cohere” in some way (e.g., common fundamental frequency, common onset and offset, etc.)

Coherence Masking Protection


BCMP (Grant and Seitz, 2000, Grant, 2001)

Detection of speech in noise is improved by watching a talker (i.e., speechreading) as they produce the target speech signal, provided that the "visible" movement of the lips and acoustic amplitude envelope are highly correlated.

Bimodal Coherence Masking Protection


Acoustic/Optic Relationship

Reduction of Signal Uncertainty

in time domain

in spectral domain

Greater Neural Drive to Cognitive Centers

neural priming - auditory cortex

bimodal cells in superior colliculus

Bimodal Coherence


Auditory-only speech detection

Auditory-visual speech detection

speech

speech

Basic Paradigm for BCMP

...

...

noise

noise

noise

noise


9 Interleaved Tracks - 2IFC Sentence Detection in Noise

3 Audio target sentences - IEEE/Harvard Corpus

Same 3 target sentences with matched video

Same 3 target sentences with unmatched video (from 3 additional sentences)

White noise masker centered on target with random 100-300 ms extra onset and offset duration

10 Normal-Hearing Subjects

Experiment 1: Natural Speech


Congruent versus Incongruent Speech

-26

A

AV

M

-25

AV

UM

-24

Masked Threshold (dB)

-23

-22

-21

Sentence 1

Sentence 2

Sentence 3

Average

Target Sentence


BCMP for Congruent and Incongruent Speech

3.0

AV

M

2.5

AV

UM

2.15

2.0

1.70

1.56

1.5

Bimodal Coherence Protection (dB)

0.84

1.0

0.5

0.16

0.07

0.06

0.0

-0.05

-0.5

-1.0

Sentence 1

Sentence 2

Sentence 3

Average

Target Sentence


Acoustic Envelope and Lip Area Functions

80

WB

70

F1

F2

60

F3

50

RMS Amplitude (dB)

40

30

20

10

0

Watch the log float in the wide river

2.5

2.0

1.5

Lip Area (in2)

1.0

0.5

0.0

0

500

1000

1500

2000

2500

3000

3500

Time (ms)


e

p

o

l

e

v

n

E

S

M

R

Cross Modality Correlation - Lip Area versus Amplitude Envelope

Sentence 3

Sentence 2

WB

r=0.52

r=0.35

F1

r=0.49

r=0.32

F2

r=0.65

r=0.47

F3

r=0.60

r=0.45

Lip Area

Lip Area


1.0

0.8

0.6

0.4

0.2

Correlation Coefficient

80

0

70

-0.2

60

50

-0.4

RMS Amplitude (dB)

40

-0.6

30

20

-0.8

Watch the log float in the wide river

10

0

-1.0

0

500

1000

1500

2000

2500

3000

3500

Time (ms)

Local Correlations (Lip Area versus F2-Amplitude Envelope


Local Correlations (Lip Area versus F2-Amplitude Envelope

1.0

0.8

0.6

0.4

0.2

Correlation Coefficient

0.0

80

70

-0.2

60

50

-0.4

RMS Amplitude (dB)

40

-0.6

30

20

-0.8

Both brothers wear the same size

10

-1.0

0

0

500

1000

1500

2000

2500

3000

3500

Time (ms)


speech

Methodology for Orthographic BCMP

  • Auditory-only speech detection

  • Auditory + orthographic speech detection

...

...

text

speech

noise

noise

noise

noise


6 Interleaved Tracks - 2IFC Sentence Detection in Noise

3 Audio target sentences - IEEE/Harvard Corpus

Same 3 target sentences with matched orthography

White noise masker centered on target with random 100-300 ms extra onset and offset duration

6 Normal-Hearing Subjects (from EXP 1)

Experiment 2: Orthography


BCMP for Orthographic Speech

1.5

1

0.78

Bimodal Coherence Protection (dB)

0.50

0.5

0.38

0.33

0

Sentence 1

Sentence 2

Sentence 3

Average

Target Sentence


speech

speech

Methodology for Filtered BCMP

F1 (100-800 Hz) F2 (800-2200 Hz)

  • Auditory-only detection of filtered speech

  • Auditory-visual detection of filtered speech

...

...

noise

noise

noise

noise


Experiment 3: Filtered Speech

  • 6 Interleaved Tracks - 2IFC Sentence Detection in Noise

    • 2 Audio target sentences - IEEE/Harvard Corpus - each filtered into F1 and F2 bands

    • Same 2 target sentences that produced largest BCMP

    • White noise masker centered on target with random 100-300 ms extra onset and offset duration

  • 6 Normal-Hearing Subjects (from EXP 2)


  • 3

    AV

    F1

    AV

    F2

    2.5

    AV

    WB

    AV

    O

    2

    Masking Protection (dB)

    1.5

    1

    0.5

    0

    Sentence 2

    Sentence 3

    Average

    Target Sentence

    BCMP for Filtered Speech


    0

    0

    50

    50

    100

    100

    150

    150

    200

    200

    250

    250

    300

    300

    350

    350

    Cross Modality Correlations (lip area versus acoustic envelope)

    1

    80

    1

    80

    S2

    S2

    0.8

    0.8

    70

    70

    0.6

    0.6

    60

    60

    0.4

    0.4

    50

    50

    0.2

    0.2

    Correlation Coefficient

    Correlation Coefficient

    Peak Amplitude (dB)

    Peak Amplitude (dB)

    0

    40

    0

    40

    -0.2

    -0.2

    30

    30

    -0.4

    -0.4

    20

    20

    -0.6

    -0.6

    10

    10

    -0.8

    -0.8

    F1

    F2

    -1

    0

    -1

    0

    1

    80

    1

    80

    S3

    S3

    0.8

    0.8

    70

    70

    0.6

    0.6

    60

    60

    0.4

    0.4

    50

    50

    0.2

    0.2

    Correlation Coefficient

    Correlation Coefficient

    Peak Amplitude (dB)

    Peak Amplitude (dB)

    0

    40

    0

    40

    -0.2

    -0.2

    30

    30

    -0.4

    -0.4

    20

    20

    -0.6

    -0.6

    10

    10

    -0.8

    -0.8

    F1

    F2

    -1

    0

    -1

    0

    Time (ms)

    Time (ms)


    Speechreading reduces auditory speech detection thresholds by about 1.5 dB (range: 1-3 dB depending on sentence)

    BCMP – Summary


    BCMP – Summary

    • Speechreading reduces auditory speech detection thresholds by about 1.5 dB (range: 1-3 dB depending on sentence)

    • Amount of BCMP depends on the degree of coherence between acoustic envelope and facial kinematics


    BCMP – Summary

    • Speechreading reduces auditory speech detection thresholds by about 1.5 dB (range: 1-3 dB depending on sentence)

    • Amount of BCMP depends on the degree of coherence between acoustic envelope and facial kinematics

    • Providing listeners with explicit (orthographic) knowledge of the identity of the target sentence reduces speech detection thresholds by about 0.5 dB, independent of the specific target sentence


    BCMP – Summary

    • Speechreading reduces auditory speech detection thresholds by about 1.5 dB (range: 1-3 dB depending on sentence)

    • Amount of BCMP depends on the degree of coherence between acoustic envelope and facial kinematics

    • Providing listeners with explicit (orthographic) knowledge of the identity of the target sentence reduces speech detection thresholds by about 0.5 dB, independent of the specific target sentence

    • Manipulating the degree of coherence between area of mouth opening and acoustic envelope by filtering the target speech has a direct effect on BCMP


    Temporal Window for Auditory-Visual Integration


    Speech Intelligibility Derived from

    Asynchronous Processing

    of

    Auditory-Visual Information

    Ken W. Grant

    Army Audiology and Speech Center

    Walter Reed Army Medical Center

    Washington, D.C. 20307, USA

    http://www.wramc.amedd.army.mil/departments/aasc/avlab

    [email protected]

    Steven Greenberg

    International Computer Science Institute

    1947 Center Street, Berkeley, CA 94704, USA

    http://www.icsi.berkeley.edu/~steveng

    [email protected]


    Acknowledgments and Thanks

    Technical Assistance

    Takayuki Arai, Rosaria Silipo

    Research Funding

    U.S. National Science Foundation


    Time-Course of Spectral-Temporal Integration

    Within Modality

    4 narrow spectral slits

    delay two middle bands with respect to outer bands

    Across Modality

    2 narrow spectral slits (2 outer bands)

    delay audio with respect to video

    Spectral-Temporal Integration


    +

    Video Leads

    Audio Leads

    40 – 400 ms

    40 – 400 ms

    +

    Baseline Condition

    SYNCHRONOUS A/V

    Auditory-Visual Asynchrony - Paradigm

    Video of spoken (Harvard/IEEE) sentences, presented in tandem with asparse spectral representation (low- and high-frequency slits) of the same material


    Auditory-Visual Integration - Preview

    When the AUDIO signal LEADS the VIDEO, there is a progressive decline in intelligibility, similar to that observed for audio-alone signals

    When the VIDEO signal LEADS the AUDIO, intelligibility is preserved for asynchrony intervals as long as 200 ms

    Why? Why? Why?

    We’ll return to these data shortly

    But first, let’s take a look at audio-alone speech intelligibility data in order to gain some perspective on the audio-visual case

    The audio-alone data come from earlier studies by Greenberg and colleagues using TIMIT sentences

    9 Subjects


    AUDIO-ALONE

    EXPERIMENTS


    Audio (Alone) Spectral Slit Paradigm

    The edge of each slit was separated from its nearest neighbor by an octave

    Can listeners decode spoken sentences using just four narrow (1/3 octave) channels (“slits”) distributed across the spectrum? – YES (cf. next slide)

    What is the intelligibility of each slit alone and in combination with others?


    Word Intelligibility - Single and Multiple Slits

    89%

    60%

    13%

    4

    4

    5340

    5340

    3

    3

    Slit Number

    2120

    2120

    Slit Number

    CF (Hz)

    CF (Hz)

    2

    2

    841

    841

    1

    1

    334

    334

    2%

    9%

    9%

    4%


    Slit Asynchrony Affects Intelligibility

    Desynchronizing the slits by more than 25 ms results in a significant decline in intelligibility

    The effect of asynchrony on intelligibility is relatively symmetrical

    These data are from a different set of subjects than those participating in the study described earlier - hence slightly different numbers for the baseline conditions


    Intelligibility and Slit Asynchrony

    Desynchronizing the two central slits relative to the lateral ones has a pronounced effect on intelligibility

    Asynchrony greater than 50 ms results in intelligibility lower than baseline


    AUDIO-VISUAL

    EXPERIMENTS


    Focus on Audio-Leading-Video Conditions

    When the AUDIO signal LEADS the VIDEO, there is a progressive decline in intelligibility, similar to that observed for audio-alone signals

    These data are next compared with data from the previous slide to illustrated the similarity in the slope of the function


    Comparison of A/V and Audio-Alone Data

    The decline in intelligibility for the audio-alone condition is similar to that of the audio-leading-video condition

    Such similarity in the slopes associated with intelligibility for both experiments suggest that the underlying mechanisms may be similar

    The intelligibility of the audio-alone signals is higher than the A/V signals due to slits 2+3 being highly intelligible by themselves


    Focus on Video-Leading-Audio Conditions

    When the VIDEO signal LEADS the AUDIO, intelligibility is preserved for asynchrony intervals as large as 200 ms

    These data are rather strange, implying some form of “immunity” against intelligibility degradation when the video channel leads the audio

    We’ll consider a variety of interpretations in a few minutes


    Auditory-Visual Integration

    The slope of intelligibility-decline associated with the video-leading-audio conditions is rather different from the audio-leading-video conditions

    WHY? WHY? WHY?

    There are several interpretations of these data – we’ll consider several on the following slide


    INTERPRETATION

    OF THE DATA


    The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

    Possible Interpretations of the Data – 1


    The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

    Visual information is processed in the brain much more slowly than auditory information (difference in first spike latency in V1 versus A1 is about 30-50 ms)

    Possible Interpretations of the Data – 1


    The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

    Visual information is processed in the brain much more slowly than auditory information (difference in first spike latency in V1 versus A1 is about 30-50 ms)

    Some problems with this interpretation ….

    Possible Interpretations of the Data – 1


    The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

    Visual information is processed in the brain much more slowly than auditory information (difference in first spike latency in V1 versus A1 is about 30-50 ms)

    Some problems with this interpretation ….

    Simple differences in neural conduction times may account for a preference for visual leading information BUT can not account for asymmetry in temporal window

    Possible Interpretations of the Data – 1


    Auditory-Visual Integration

    The slope of intelligibility-decline associated with the video-leading-audio conditions is rather different from the audio-leading-video conditions

    WHY? WHY? WHY?


    The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

    Possible Interpretations of the Data – 2


    The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

    There is certain information in the video component of the speech signal (and in the mid-frequency audio channels) that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be?

    Possible Interpretations of the Data – 2


    The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

    There is certain information in the video component of the speech signal (and in the mid-frequency audio channels) that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be?

    The visual component of the speech signal is most closely associated with place-of-articulation information (cf. Grant and Walden, 1996)

    Possible Interpretations of the Data – 2


    The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

    There is certain information in the video component of the speech signal (and in the mid-frequency audio channels) that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be?

    The visual component of the speech signal is most closely associated with place-of-articulation information (Grant and Walden, 1996)

    In the (audio) speech signal, place-of-articulation information usually evolves over two or three phonetic segments (i.e., a syllable in length)

    Possible Interpretations of the Data – 2


    The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

    There is certain information in the video component of the speech signal that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be?

    The visual component of the speech signal is most closely associated with place-of-articulation information (Grant and Walden, 1996)

    In the (audio) speech signal, place-of-articulation information usually evolves over two or three phonetic segments (i.e., a syllable in length)

    BUT ….. the data imply that the modality arriving first determines the mode (and hence the time constant of processing) for combining information across sensory channels

    Possible Interpretations of the Data – 2


    The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

    Possible Interpretations of the Data – 3


    The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

    The brain has evolved under conditions where the visual signal arrives prior to the audio signal, but where the time disparity between the two modalities varies from situation to situation

    Possible Interpretations of the Data – 3


    Possible Interpretations of the Data – 3

    The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

    The brain has evolved under conditions where the visual signal arrives prior to the audio signal, but where the time disparity between the two modalities varies from situation to situation

    Therefore, processing of A and V information is tolerant of asynchronous arrival times so long as the visual information leads the auditory information


    Possible Interpretations of the Data – 3

    The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

    The brain has evolved under conditions where the visual signal arrives prior to the audio signal, but where the time disparity between the two modalities varies from situation to situation

    Therefore, processing of A and V information is tolerant of asynchronous arrival times so long as the visual information leads the auditory information

    BUT… Why the 200 ms value? Given the speed of light and sound (cf. 1129 ft/sec), 200 ms would suggest that we can process AV stimuli at 225.8 ft. That's may be a bit much!


    Possible Interpretations of the Data – 3

    The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

    The brain has evolved under conditions where the visual signal arrives prior to the audio signal, but where the time disparity between the two modalities varies from situation to situation

    Therefore, processing of A and V information is tolerant of asynchronous arrival times so long as the visual information leads the auditory information

    BUT… Why the 200 ms value? Given the speed of light and sound (cf. 1129 ft/sec), 200 ms would suggest that we can process AV stimuli at 225.8 ft. That's may be a bit much!

    In addition to speed differences, we also must consider that speech movements precede audio output by variable amounts. The limit of 200 ms may reflect a boundary associated with syllabic length units, over which integration is probably not beneficial.


    SUMMARY


    Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality

    Audio-Video Integration – Summary


    Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality

    This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)

    Audio-Video Integration – Summary


    Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality

    This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)

    When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony

    Audio-Video Integration – Summary


    Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality

    This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)

    When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony

    When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms

    Audio-Video Integration – Summary


    Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality

    This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)

    When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony

    When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms

    There are many potential interpretations of the data

    Audio-Video Integration – Summary


    Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality

    This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)

    When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony

    When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms

    There are many potential interpretations of the data

    The simplest explanation is also the most ecologically valid: we prefer conditions where visual information leads audio information because that's what we're confronted with in the real world. But there are still questions remaining.

    Audio-Video Integration – Summary


    Audio-Video Integration – Summary

    Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality

    This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)

    When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony

    When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms

    There are many potential interpretations of the data

    The simplest explanation is also the most ecologically valid: we prefer conditions where visual information leads audio information because that's what we're confronted with in the real world. But there are still questions remaining.

    For visual lag conditions, the similarity between within-modality and across-modality integration is intriguing, and may suggest an interpretation based on the linguistic information conveyed by A and V modalities and which modality first activates the system


    That’s All, Folks

    Many Thanks for Your Time and Attention


    VARIABILITY

    AMONG SUBJECTS


    Perhaps the most intriguing property of the experimental results concerns the intelligibility patterns associated with individual subjects

    One Further Wrinkle to the Story ….


    Perhaps the most intriguing property of the experimental results concerns the intelligibility patterns associated with individual subjects

    For eight of the nine subjects, the condition associated with the highest intelligibility was one in which the video signal led the audio

    One Further Wrinkle to the Story ….


    Perhaps the most intriguing property of the experimental results concerns the intelligibility patterns associated with individual subjects

    For eight of the nine subjects, the condition associated with the highest intelligibility was one in which the video signal led the audio

    The length of optimal asynchrony (in terms of intelligibility) varies from subject to subject, but is generally between 80 and 120 ms

    One Further Wrinkle to the Story ….


    Auditory-Visual Integration - by Individual S's

    Video signal leading is better than synchronous for 8 of 9 subjects

    These data are complex, but the implications are clear.

    Audio-visual integration is a complicated, poorly understood process, at least with respect to speech intelligibility

    Variation across subjects


    Hearing Aids and Speechreading

    • Are hearing aids optimally designed and fit for auditory-visual speech input?

      • How can they be when visual cues are not considered in the design or fitting process!


    Consonant Recognition - Amplification versus Speechreading

    100

    Input Level = 50 SPL

    Walden, Grant, & Cord (2001). Ear and Hearing 22, 333-341.

    80

    Percent Correct

    60

    40

    20

    0

    Unaided - A

    Aided - A

    Unaided - AV

    Aided - AV

    Test Condition


    Hearing Aid Summary

    • To optimize AV speech recognition, hearing aids must…

      • Provide accurate reception of speech cues not readily available through speechreading (complementary cues)

        • voicing

        • manner

      • There may have to be a trade off between providing the best auditory alone experience and one where AV cues are optimized


  • Login