multilingual access to large spoken archives
Download
Skip this Video
Download Presentation
Multilingual Access to Large Spoken Archives

Loading in 2 Seconds...

play fullscreen
1 / 61

Multilingual Access to Large Spoken Archives - PowerPoint PPT Presentation


  • 207 Views
  • Uploaded on

Multilingual Access to Large Spoken Archives. Douglas W. Oard University of Maryland, College Park, MD, USA. MALACH Project’s Goal. Dramatically improve access to large multilingual spoken word collections. … by capitalizing on the unique characteristics of the Survivors

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Multilingual Access to Large Spoken Archives' - betty_james


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
multilingual access to large spoken archives

Multilingual Access to Large Spoken Archives

Douglas W. Oard

University of Maryland, College Park, MD, USA

malach project s goal
MALACH Project’s Goal

Dramatically improve access to

large multilingual spoken word

collections

… by capitalizing on the unique

characteristics of the Survivors

of the Shoah Visual History

Foundation\'s collection of

videotaped oral history interviews.

spoken word collections
Spoken Word Collections
  • Broadcast programming
    • News, interview, talk radio, sports, entertainment
  • Scripted stories
    • Books on tape, poetry reading, theater
  • Spontaneous storytelling
    • Oral history, folklore
  • Incidental recording
    • Speeches, oral arguments, meetings, phone calls
some statistics
Some Statistics
  • 2,000 U.S. radio stations webcasting
  • 250,000 hours of oral history in British Library
  • 35 million audio streams indexed by SingingFish
    • Over 1 million searches per day
  • ~100 billion hours of phone calls each year
economics of the web in 1995
Economics of the Web in 1995
  • Affordable storage
    • 300,000 words/$
  • Adequate backbone capacity
    • 25,000 simultaneous transfers
  • Adequate “last mile” bandwidth
    • 1 second/screen
  • Display capability
    • 10% of US population
  • Effective search capabilities
    • Lycos, Yahoo
spoken word collections today
Spoken Word Collections Today

1.5 million words/$

  • Affordable storage
    • 300,000 words/$
  • Adequate backbone capacity
    • 25,000 simultaneous transfers
  • Adequate “last mile” bandwidth
    • 1 second/screen
  • Display capability
    • 10% of US population
  • Effective search capabilities
    • Lycos, Yahoo

30 million

20% of capacity

38% recent use

research issues

MALACH

Research Issues
  • Acquisition
  • Segmentation
  • Description
  • Synchronization
  • Rights management
  • Preservation
description strategies
Description Strategies
  • Transcription
    • Manual transcription (with optional post-editing)
  • Annotation
    • Manually assign descriptors to points in a recording
    • Recommender systems (ratings, link analysis, …)
  • Associated materials
    • Interviewer’s notes, speech scripts, producer’s logs
  • Automatic
    • Create access points with automatic speech processing
key results from trec tdt
Key Results from TREC/TDT
  • Recognition and retrieval can be decomposed
    • Word recognition/retrieval works well in English
  • Retrieval is robust with recognition errors
    • Up to 40% word error rate is tolerable
  • Retrieval is robust with segmentation errors
    • Vocabulary shift/pauses provide strong cues
supporting information access

Search System

Query

Formulation

Query

Search

Ranked List

Selection

Query Reformulation

and

Relevance Feedback

Recording

Examination

Recording

Source

Reselection

Delivery

Supporting Information Access

Source

Selection

broadcast news retrieval study
Broadcast News Retrieval Study
  • NPR Online
    • Manually prepared transcripts
    • Human cataloging
  • SpeechBot
    • Automatic Speech Recognition
    • Automatic indexing
study design
Study Design
  • Seminar on visual and sound materials
    • Recruited 5 students
  • After training, we provided 2 topics
    • 3 searched NPR Online, 2 searched SpeechBot
  • All then tried both systems with a 3rd topic
    • Each choosing their own topic
  • Rich data collection
    • Observation, think aloud, semi-structured interview
  • Model-guided inductive analysis
    • Coded to the model with QSR NVivo
some useful insights
Some Useful Insights
  • Recognition errors may not bother the system, but they do bother the user!
  • Segment-level indexing can be useful
shoah foundation s collection
Shoah Foundation’s Collection
  • Enormous scale
    • 116,000 hours; 52,000 interviews; 180 TB
  • Grand challenges
    • 32 languages, accents, elderly, emotional, …
  • Accessible
    • $100 million collection and digitization investment
  • Annotated
    • 10,000 hours (~200,000 segments) fully described
  • Users
    • A department working full time on dissemination
existing annotations
Existing Annotations
  • 72 million untranscribed words
    • From ~4,000 speakers
  • Interview-level ground truth
    • Pre-interview questionnaire (names, locations, …)
    • Free-text summary
  • Segment-level ground truth
    • Topic boundaries: average ~3 min/segment
    • Labels: Names, topic, locations, year(s)
    • Descriptions: summary + cataloguer’s scratchpad
annotated data example
Annotated Data Example

Location-Time

Subject

Person

Berlin-1939 Employment Josef Stein

Berlin-1939 Family life Gretchen Stein

Anna Stein

interview time

Dresden-1939 Relocation

Transportation-rail

Dresden-1939 Schooling Gunter Wendt

Maria

malach overview

Observational studies

Formative evaluation

Summative evaluation

ASR

Spontaneous

Accented

Language switching

User

Needs

NLP

Components

Evidence integration

Translingual search

Spatial/temporal

Multi-scale segmentation

Multilingual classification

Entity normalization

Prototype

MALACH Overview

Query

Formulation

Speech

Recognition

Automatic

Search

Boundary

Detection

Content

Tagging

Interactive

Selection

malach overview26

ASR

Spontaneous

Accented

Language switching

MALACH Overview

Query

Formulation

Speech

Recognition

Automatic

Search

Boundary

Detection

Content

Tagging

Interactive

Selection

asr research focus
ASR Research Focus
  • Accuracy
    • Spontaneous speech
    • Accented/multilingual/emotional/elderly
    • Application-specific loss functions
  • Affordability
    • Minimal transcription
    • Replicable process
application tuned asr
Application-Tuned ASR
  • Acoustic model
    • Transcribe short segments from many speakers
    • Unsupervised adaptation
  • Language model
    • Transcribed segments
    • Interpolation
asr game plan
ASR Game Plan

HoursWord

LanguageTranscribedError Rate

English 200 39.6%

Czech 84 39.4%

Russian 20 (of 100) 66.6%

Polish

Slovak

As of May 2003

english transcription time
English Transcription Time

~2,000 hours to manually transcribe

200 hours from 800 speakers

Instances (N=830)

Hours to transcribe 15 minutes of speech

english asr error rate
English ASR Error Rate

Training: 65 hours (acoustic model)/200 hours (language model)

malach overview32

Observational studies

Formative evaluation

Summative evaluation

User

Needs

MALACH Overview

Query

Formulation

Speech

Recognition

Automatic

Search

Boundary

Detection

Content

Tagging

Interactive

Selection

who uses the collection
History

Linguistics

Journalism

Material culture

Education

Psychology

Political science

Law enforcement

Book

Documentary film

Research paper

CDROM

Study guide

Obituary

Evidence

Personal use

Who Uses the Collection?

Discipline

Products

Based on analysis of 280 access requests

question types
Question Types
  • Content
    • Person, organization
    • Place, type of place (e.g., camp, ghetto)
    • Time, time period
    • Event, subject
  • Mode of expression
    • Language
    • Displayed artifacts (photographs, objects, …)
    • Affective reaction (e.g., vivid, moving, …)
  • Age appropriateness
observational studies
Four searchers

History/Political Science

Holocaust studies

Holocaust studies

Documentary filmmaker

Sequential observation

Rich data collection

Intermediary interaction

Semi-structured interviews

Observational notes

Think-aloud

Screen capture

Four searchers

Ethnography

German Studies

Sociology

High school teacher

Simultaneous observation

Opportunistic data collection

Intermediary interaction

Semi-structured interviews

Observational notes

Focus group discussions

Observational Studies

Workshop 1 (June)

Workshop 2 (August)

observed selection criteria
Observed Selection Criteria
  • Topicality (57%)
    • Judged based on: Person, place, …
  • Accessibility (23%)
    • Judged based on: Time to load video
  • Comprehensibility (14%)
    • Judged based on: Language, speaking style
malach overview40

NLP

Components

Multi-scale segmentation

Multilingual classification

Entity normalization

MALACH Overview

Query

Formulation

Speech

Recognition

Automatic

Search

Boundary

Detection

Content

Tagging

Interactive

Selection

topic segmentation
Topic Segmentation

“True” segmentation:

transcripts aligned with scratchpad-based boundaries

cataloguer

rethinking the problem
Rethinking the Problem
  • Segment-then-label models planned speech well
    • Producers assemble stories to create programs
    • Stories typically have a dominant theme
  • The structure of natural speech is different
    • Creation: digressions, asides, clarification, …
    • Use: intended use may affect desired granularity
      • Documentary film: brief snippet to illustrate a point
      • Classroom teacher: longer self-contextualizing story
ontolog labeling unplanned speech
OntoLog: Labeling Unplanned Speech
  • Manually assigned labels; start and end at any time
    • Ontology-based aggregation helps manage complexity
slide45
Goal

Use available data to estimate the temporal extent of labels in a way that optimizes the utility of the resulting estimates for interactive searching and browsing

characteristics of the problem
Characteristics of the Problem
  • Clear sequential dependencies
    • Living in Dresden negates living in Berlin
  • Heuristic basis for class models
    • Persons, based on type of relationship
    • Date/Time, based on part-whole relationship
    • Topics, based on a defined hierarchy
  • Heuristic basis for guessing without training
    • Text similarity between labels and spoken words
  • Heuristic basis for smoothing
    • Sub-sentence retrieval granularity is unlikely
manually assigned onset marks
Manually Assigned Onset Marks

Location-Time

Subject

Person

Berlin-1939

Employment

Josef Stein

Gretchen Stein

Family Life

Anna Stein

interview time

Relocation

Transportation-rail

Dresden-1939

Gunter Wendt

Schooling

Maria

some additional results
Some Additional Results
  • Named entity recognition
    • F > 0.8 (on manual transcripts)
  • Cross-language ranked retrieval (on news)
    • Czech/English similar to other language pairs
looking forward 2003
Looking Forward: 2003
  • Component development
    • ASR, segmentation, classification, retrieval
  • Ranked retrieval test collection
    • 1,000 hours of English recognition
    • 25 judged topics in English and Czech
  • Interactive retrieval
    • Integrating free text and thesaurus-based search
relevance categories
Relevance Categories
  • Overall relevanceAssessment is informed by the assessments for the individual reasons for relevance (categories of relevance), but the relationship is not straightforward
  • Provides direct evidence
  • Provides indirect / circumstantial evidence
  • Provides context(e.g., causes for the phenomenon of interest)
  • Provides comparison (similarity or contrast, same phenomenon in different environment, similar phenomenon)
  • Provides pointer to source of information
scale for overall relevance
Scale for overall relevance

Strictly from the point of view of finding out about the topic, how useful is this segment for the requester? This judgment is made independently of whether another segment (or 25 other segments) give the same information.

4 Makes an important contribution to the topic, right on target

3 Makes an important contribution to the topic

2 Should be looked at for an exhaustive treatment of the topic

1 Should be looked at if the user wants to leave no stone unturned

0 No need to look at this at all

direct relevance
Direct relevance

Direct evidence for what the user asks for

Directly on topic, direct aboutness. The information describes the events or circumstances asked for or otherwise speaks directly to what the user is looking for. First-hand accounts are preferred, e.g., the testimony contains a report on the interviewee\'s own experience, or an eye-witness account on what happened, or self-report on how a survivor felt. Second-hand accounts (hearsay) are acceptable, such as a report on what an eyewitness told the interviewee or a report on how somebody else felt.

* Direct Evidence *- Evidence that stands on its own to prove an alleged fact, such as testimony of a witness who says she saw a defendant pointing a gun at a victim during a robbery. Direct proof of a fact, such as testimony by a witness about what that witness personally saw or heard or did. (\'Lectric Law Library\'s Lexicon)

indirect relevance
Indirect relevance

Provides indirect evidence on the topic, indirect aboutness (data from which one could infer, with some probability, something about the topic, what in law is known as circumstantial evidence) Such evidence often deals with events or circumstances that could not have happened or would not normally have happened unless the event or circumstance of interest (to be proven) has happened. It may also deal with events or circumstances that precede the events or circumstances of interest, either enabling them (establishing their possibility) or establishing their impossibility. This category takes precedence over context. One could say that provides indirect evidence also provides context (but the reverse is not true).

* Circumstances, Circumstantial Evidence * Circumstantial evidence is best explained by saying what it is not - it is not direct evidence from a witness who saw or heard something. Circumstantial evidence is a fact that can be used to infer another fact.

context
Context

Provides background / context for topic, sheds additional light on a topic, facilitates understanding that some piece of information is directly on topic.

So this category covers a variety of things. Things that influence, set the stage, or provide the environment for what the user asks for. (To take the law analogy again any things in the history of a person who has committed a crime that might explain why he committed it).

Includes support for or hindrance of an activity that is the topic of the query andactivities or circumstances that immediately follow on the activity or circumstance of interest.

In a way, this category is broader than indirect If a context element can serve as indirect evidence, indirect takes precedence.

comparison
Comparison

Provides information on similar / parallel situations or on a contrasting situation for comparison

The basic theme of what the user is interested in, but played out in a different place or time or type of situation.

Comparable segments will be those segments that provide information either on similar/parallel topics, or on contrasting topics. This type of relevance relationship identifies items that can aid understanding of the larger framework, perhaps contributing to identification of query terms or revision of search strategies. An example would be a segment in which an interviewee describes activities like activities described in a topic description, but which occurred at a different place or time than the topic description

pointer
Pointer
  • Provides pointers to a source of more information. This could be a person, group, another segment, etc
  • Pointers will be segments that provide suggestions or explicit evidence of where to find more relevant information. An example of a pointer segment would be one in which an interviewee identifies another interviewee who had personal experiences directly associated with the topic. The value of these segments is in identifying other relevant segments, particularly but not limited to segments about a topic.
quality assurance
Quality Assurance
  • 20 topics were redone, 10 were reviewed.
  • Redo: A second assessor did a topic from scratch
  • Review: A second assessor reviewed the first assessors work and did additional searches when needed.
  • Assessors would then get together and discuss their interpretation of the topic and resolved differences in relevance judgments.
  • Assessors kept notes on the process.
looking forward 2006
Looking Forward: 2006
  • Working systems in five languages
    • Real users searching real data
  • Rich experience beyond broadcast news
    • Frameworks, components, systems
  • Affordable application-tuned systems
    • Oral history, lectures, speeches, meetings, …
for more information
For More Information
  • The MALACH project
    • http://www.clsp.jhu.edu/research/malach/
  • NSF/EU Spoken Word Access Group
    • http://www.dcs.shef.ac.uk/spandh/projects/swag/
  • Speech-based retrieval
    • http://www.glue.umd.edu/~dlrg/speech/
ad