1 / 61

Multilingual Access to Large Spoken Archives

Multilingual Access to Large Spoken Archives. Douglas W. Oard University of Maryland, College Park, MD, USA. MALACH Project’s Goal. Dramatically improve access to large multilingual spoken word collections. … by capitalizing on the unique characteristics of the Survivors

betty_james
Download Presentation

Multilingual Access to Large Spoken Archives

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multilingual Access to Large Spoken Archives Douglas W. Oard University of Maryland, College Park, MD, USA

  2. MALACH Project’s Goal Dramatically improve access to large multilingual spoken word collections … by capitalizing on the unique characteristics of the Survivors of the Shoah Visual History Foundation's collection of videotaped oral history interviews.

  3. Spoken Word Collections • Broadcast programming • News, interview, talk radio, sports, entertainment • Scripted stories • Books on tape, poetry reading, theater • Spontaneous storytelling • Oral history, folklore • Incidental recording • Speeches, oral arguments, meetings, phone calls

  4. Some Statistics • 2,000 U.S. radio stations webcasting • 250,000 hours of oral history in British Library • 35 million audio streams indexed by SingingFish • Over 1 million searches per day • ~100 billion hours of phone calls each year

  5. Economics of the Web in 1995 • Affordable storage • 300,000 words/$ • Adequate backbone capacity • 25,000 simultaneous transfers • Adequate “last mile” bandwidth • 1 second/screen • Display capability • 10% of US population • Effective search capabilities • Lycos, Yahoo

  6. Spoken Word Collections Today 1.5 million words/$ • Affordable storage • 300,000 words/$ • Adequate backbone capacity • 25,000 simultaneous transfers • Adequate “last mile” bandwidth • 1 second/screen • Display capability • 10% of US population • Effective search capabilities • Lycos, Yahoo 30 million 20% of capacity 38% recent use

  7. MALACH Research Issues • Acquisition • Segmentation • Description • Synchronization • Rights management • Preservation

  8. Description Strategies • Transcription • Manual transcription (with optional post-editing) • Annotation • Manually assign descriptors to points in a recording • Recommender systems (ratings, link analysis, …) • Associated materials • Interviewer’s notes, speech scripts, producer’s logs • Automatic • Create access points with automatic speech processing

  9. Key Results from TREC/TDT • Recognition and retrieval can be decomposed • Word recognition/retrieval works well in English • Retrieval is robust with recognition errors • Up to 40% word error rate is tolerable • Retrieval is robust with segmentation errors • Vocabulary shift/pauses provide strong cues

  10. Search System Query Formulation Query Search Ranked List Selection Query Reformulation and Relevance Feedback Recording Examination Recording Source Reselection Delivery Supporting Information Access Source Selection

  11. Broadcast News Retrieval Study • NPR Online • Manually prepared transcripts • Human cataloging • SpeechBot • Automatic Speech Recognition • Automatic indexing

  12. NPR Online

  13. SpeechBot

  14. Study Design • Seminar on visual and sound materials • Recruited 5 students • After training, we provided 2 topics • 3 searched NPR Online, 2 searched SpeechBot • All then tried both systems with a 3rd topic • Each choosing their own topic • Rich data collection • Observation, think aloud, semi-structured interview • Model-guided inductive analysis • Coded to the model with QSR NVivo

  15. Criterion-Attribute Framework

  16. Some Useful Insights • Recognition errors may not bother the system, but they do bother the user! • Segment-level indexing can be useful

  17. Shoah Foundation’s Collection • Enormous scale • 116,000 hours; 52,000 interviews; 180 TB • Grand challenges • 32 languages, accents, elderly, emotional, … • Accessible • $100 million collection and digitization investment • Annotated • 10,000 hours (~200,000 segments) fully described • Users • A department working full time on dissemination

  18. Example Video

  19. Existing Annotations • 72 million untranscribed words • From ~4,000 speakers • Interview-level ground truth • Pre-interview questionnaire (names, locations, …) • Free-text summary • Segment-level ground truth • Topic boundaries: average ~3 min/segment • Labels: Names, topic, locations, year(s) • Descriptions: summary + cataloguer’s scratchpad

  20. Annotated Data Example Location-Time Subject Person Berlin-1939 Employment Josef Stein Berlin-1939 Family life Gretchen Stein Anna Stein interview time Dresden-1939 Relocation Transportation-rail Dresden-1939 Schooling Gunter Wendt Maria

  21. Observational studies Formative evaluation Summative evaluation ASR Spontaneous Accented Language switching User Needs NLP Components Evidence integration Translingual search Spatial/temporal Multi-scale segmentation Multilingual classification Entity normalization Prototype MALACH Overview Query Formulation Speech Recognition Automatic Search Boundary Detection Content Tagging Interactive Selection

  22. ASR Spontaneous Accented Language switching MALACH Overview Query Formulation Speech Recognition Automatic Search Boundary Detection Content Tagging Interactive Selection

  23. ASR Research Focus • Accuracy • Spontaneous speech • Accented/multilingual/emotional/elderly • Application-specific loss functions • Affordability • Minimal transcription • Replicable process

  24. Application-Tuned ASR • Acoustic model • Transcribe short segments from many speakers • Unsupervised adaptation • Language model • Transcribed segments • Interpolation

  25. ASR Game Plan HoursWord LanguageTranscribedError Rate English 200 39.6% Czech 84 39.4% Russian 20 (of 100) 66.6% Polish Slovak As of May 2003

  26. English Transcription Time ~2,000 hours to manually transcribe 200 hours from 800 speakers Instances (N=830) Hours to transcribe 15 minutes of speech

  27. English ASR Error Rate Training: 65 hours (acoustic model)/200 hours (language model)

  28. Observational studies Formative evaluation Summative evaluation User Needs MALACH Overview Query Formulation Speech Recognition Automatic Search Boundary Detection Content Tagging Interactive Selection

  29. History Linguistics Journalism Material culture Education Psychology Political science Law enforcement Book Documentary film Research paper CDROM Study guide Obituary Evidence Personal use Who Uses the Collection? Discipline Products Based on analysis of 280 access requests

  30. Question Types • Content • Person, organization • Place, type of place (e.g., camp, ghetto) • Time, time period • Event, subject • Mode of expression • Language • Displayed artifacts (photographs, objects, …) • Affective reaction (e.g., vivid, moving, …) • Age appropriateness

  31. Four searchers History/Political Science Holocaust studies Holocaust studies Documentary filmmaker Sequential observation Rich data collection Intermediary interaction Semi-structured interviews Observational notes Think-aloud Screen capture Four searchers Ethnography German Studies Sociology High school teacher Simultaneous observation Opportunistic data collection Intermediary interaction Semi-structured interviews Observational notes Focus group discussions Observational Studies Workshop 1 (June) Workshop 2 (August)

  32. Segment Viewer

  33. Observed Selection Criteria • Topicality (57%) • Judged based on: Person, place, … • Accessibility (23%) • Judged based on: Time to load video • Comprehensibility (14%) • Judged based on: Language, speaking style

  34. References to Named Entities

  35. Functionality

  36. NLP Components Multi-scale segmentation Multilingual classification Entity normalization MALACH Overview Query Formulation Speech Recognition Automatic Search Boundary Detection Content Tagging Interactive Selection

  37. Topic Segmentation “True” segmentation: transcripts aligned with scratchpad-based boundaries cataloguer

  38. Effect of ASR Errors

  39. Rethinking the Problem • Segment-then-label models planned speech well • Producers assemble stories to create programs • Stories typically have a dominant theme • The structure of natural speech is different • Creation: digressions, asides, clarification, … • Use: intended use may affect desired granularity • Documentary film: brief snippet to illustrate a point • Classroom teacher: longer self-contextualizing story

  40. OntoLog: Labeling Unplanned Speech • Manually assigned labels; start and end at any time • Ontology-based aggregation helps manage complexity

  41. Goal Use available data to estimate the temporal extent of labels in a way that optimizes the utility of the resulting estimates for interactive searching and browsing

  42. Labels Multi-Scale Segmentation Time

  43. Characteristics of the Problem • Clear sequential dependencies • Living in Dresden negates living in Berlin • Heuristic basis for class models • Persons, based on type of relationship • Date/Time, based on part-whole relationship • Topics, based on a defined hierarchy • Heuristic basis for guessing without training • Text similarity between labels and spoken words • Heuristic basis for smoothing • Sub-sentence retrieval granularity is unlikely

  44. Manually Assigned Onset Marks Location-Time Subject Person Berlin-1939 Employment Josef Stein Gretchen Stein Family Life Anna Stein interview time Relocation Transportation-rail Dresden-1939 Gunter Wendt Schooling Maria

  45. Some Additional Results • Named entity recognition • F > 0.8 (on manual transcripts) • Cross-language ranked retrieval (on news) • Czech/English similar to other language pairs

  46. Looking Forward: 2003 • Component development • ASR, segmentation, classification, retrieval • Ranked retrieval test collection • 1,000 hours of English recognition • 25 judged topics in English and Czech • Interactive retrieval • Integrating free text and thesaurus-based search

More Related