problems and prospects in collecting spoken language data
Download
Skip this Video
Download Presentation
Problems and Prospects in Collecting Spoken Language Data

Loading in 2 Seconds...

play fullscreen
1 / 22

Problems and Prospects in Collecting Spoken Language Data - PowerPoint PPT Presentation


  • 144 Views
  • Uploaded on

Problems and Prospects in Collecting Spoken Language Data. Kishore Prahallad Suryakanth V Gangashetty B. Yegnanarayana Raj Reddy IIIT Hyderabad, India Carnegie Mellon University, USA. Outline. Need for digital library of audio and video data Characteristics of spoken language data

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Problems and Prospects in Collecting Spoken Language Data' - skah


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
problems and prospects in collecting spoken language data

Problems and Prospects in Collecting Spoken Language Data

Kishore Prahallad

Suryakanth V Gangashetty

B. Yegnanarayana

Raj Reddy

IIIT Hyderabad, India

Carnegie Mellon University, USA.

outline
Outline
  • Need for digital library of audio and video data
  • Characteristics of spoken language data
  • Prototype data collection
    • IIIT Hyderabad
    • IIT Madras
    • Lessons Learnt
  • Proposal to collect IL data
    • as a part of Jimbaker’s global project.
need for digital library of audio video data
Need for Digital Library of Audio & Video Data
  • Current and future data will be in audio and video formats
  • Current technology makes it possible to digitize and store such large amounts of data
  • Collection, storage and indexing of such data makes it possible to provide information to current and future generation
  • Acts as test bed for several research challenges exists in organizing, indexing and retrieving such large data collections
    • Algorithms for quick and easier access to the information present in AV format by providing a query using text / audio / video modes
    • Algorithms using multi-modal data for bio-metric authentication
    • Development of multi-lingual speech synthesis and speech recognition systems
characteristics of spoken language data
Characteristics of Spoken Language Data
  • Message - Information to be conveyed
  • Speaker – Who is the speaker?
  • His/her background – Age, gender, literacy levels, knowledge levels, mannerisms etc.
  • Emotions – Anger, sad, happy etc.
  • Idiolect – An individual distinctive style of speaking
  • Medium of transmission – Microphone, telephone, satellite etc.
  • Environment - party-environment, airport/station,
  • Language
  • Dialect – grammar and the vocabulary associated with a regional or social use of a language.
  • Culture and civilization – The richness of usage of vocabulary, grammar etc, indicates the times of the language and the society.
characteristics of spoken language data5
Characteristics of Spoken Language Data
  • How a language was spoken 25 years ago, 50 years ago, 100 years ago and beyond?
  • How a famous poem was recited or sung by the author?
  • How a particular language was spoken in different geographical locations of a state/country?
  • How a particular language/dialect has evolved over a period of time?
  • What were the rare languages/dialects (which were no more in existence)?. How they were spoken?
phase 0 prototype data collection at iiit hyd
Phase 0: Prototype data collection at IIIT Hyd
  • High quality studio recordings
    • 2 hrs of single speaker recordings for speech synthesis
    • Telugu, Hindi, Tamil and Indian-English
    • Developed text to speech systems in these 4 languages
  • Telephone and Cell-phone corpus
    • 150 hrs (540 speakers)
    • Telugu, Tamil and Marathi
    • Developed speech recognition systems in these 3 languages
phase 0 prototype data collection at iit madras
Phase 0: Prototype data collection at IIT Madras
  • 15 hours (72 speakers)
  • TV news in Tamil, Telugu and Hindi Languages
    • Text to speech systems (TTS)
    • Language Identification
    • Duration modeling for TTS systems
tools aiding for acquisition correction of speech data
Tools Aiding for Acquisition/Correction of Speech Data
  • Transcription correction tool (TCT)
    • Spoken errors at phone, syllable, word level
    • Background noise, abrupt begin or end, low SNR
    • TCT corrects the above errors in three levels
  • Audio & Video Transcription Tool
    • Used to annotate movie databases
  • Correction of Segment labels
    • Emulabel
lessons learnt
Lessons Learnt
  • Speech correction needs 3-6 times more than collection
    • Better to collect more data than correcting
  • Needs a unified framework
    • Standardize, processes, procedure and tools
  • Need larger collection of spoken and text corpora
    • For building practical speech systems in Indian languages
proposal for collection of larger spoken language data for il
Proposal for collection of larger Spoken Language Data for IL
  • Focus of information present in speech mode
  • Collect spoken language data from all Indian languages and also from neighboring countries
  • Collect about 200,000 (.2 M) hours of speech
    • As a part of JimBaker’s global project of collecting 1 Million hours of speech
new in our approach
New in our approach
  • Collection of large speech data upto 200,000 (0.2 M) hours
    • All Indian languages and dialects
      • 23 official Indian languages
      • Approx. 10,000 hours per language
    • All types: Traditional, Read, spoken, conversational, dialog, movies, broadcast etc.
    • All modes: microphone, clean, telephone, cellphone, satellite etc
  • Standard procedure for organizing, annotating and indexing
  • More focus on larger collection (and elimination than of correction)
  • Make available this data for general public use
key make a difference capability
Key Make-A-Difference Capability
  • Availability of information (Stories, lectures, poems, books, articles) in spoken language
      • For illiterate
      • Vision Impaired
  • Collection and Storage of spoken language data of popular as well as rare languages & dialects
  • Promotes research and development in
    • Speech Technology
      • Speech-to-speech translation in Indian languages
      • Phonetic engine (Language Independent)
      • Speech synthesis (Text-to-speech for Indian languages)
      • Speaker recognition (Text independent and dependent)
      • Language Identification
      • Speech enhancement
      • Speech signal processing
    • Biometrics:
      • Multimodal: Audio-Video modes
    • Information Access, Storage and Retrieval
      • Audio-video data (indexing)
      • Data Mining (searching)
      • Speech Coding (Ultra-low bit coding)
implementation plan
Implementation Plan
  • Phase 1: (3.5 months)
    • 10 languages
    • 33,300 hours
  • Phase 2: (8 months)
    • 10 (of phase 1) languages
    • 66,000 hours
  • Phase 3: (10 months)
    • 13 - remaining languages
    • 80,000 hours
mid term and final terms
Mid-Term and Final Terms
  • Mid-Term
    • Phase 1, collection of 33,300 hours of speech
    • Collection, Storage and Indexing of speech data for public information access
    • Visible research output using the speech data
    • Demonstrations of speech technology products
      • Speech recognition in 10 languages
  • Final Term
    • Phase 1 + Phase 2
impact of audio digital library
Impact of Audio Digital Library
  • Availability of information in spoken language form for illiterate and others
  • Promotes research in speech technology for Indian languages
  • Enable to develop speech technology products useful for common man
  • Examples:
    • Speech-speech translation systems
      • For information exchange
    • Screen readers,
      • For illiterate and physically challenged
    • Naturally speaking dialog systems
      • For information access over voice mode
phase 1 time estimate
Phase 1: Time Estimate
  • Phase 1:
    • 10 official Indian languages
    • Parallel collection of data
    • ~ 3000 hours per language
      • 5,000 - 10,000 speakers
      • > 10 min of speech each per speaker
    • Total: 33,300 hours
  • Time Estimates: (~ 3.5 months all 10 languages)
    • 10 persons-team per language
    • Each person works
      • 8 hours a day
      • 30 mins of speech recording per hour
        • 1-3 speakers per hour
      • 240 mins of speech per day
        • 1-24 speakers per day,
    • 240 speakers per day
    • 20,000 speakers per language in 84 working days
phase 1 cost estimate
Phase 1: Cost Estimate
  • Man power cost: Rs 140 Lakhs
  • Equipment cost: Rs 55 Lakhs
  • Communication cost: Rs 40 Lakhs
  • Contingency (10%): Rs 25 Lakhs

Total Cost: Rs 2.6 Crores (~ $ 565,000)

man power cost
Man-Power Cost
  • Data collection Team: Rs 86 lakhs
      • 10 (for data collection) x Rs 10 K PM
      • 10 (for data correction) x Rs 10 K PM
      • 1 data manager (Rs 15 K PM)
      • 4 months cost: 8, 60, 000 per language
  • 5 engineers: Rs 4 Lakhs
    • B.Tech Level (Rs 20,000 PM)
  • Gifts per speaker: Rs 50 Lakhs
    • Rs 25 per speaker
machines cost
Machines Cost
  • Machines:
    • 30 servers: Rs 30 Lakhs
      • 3 servers per languages
      • Each server has 4 ports for data collection
    • 30 CTI cards: Rs 20 Lakhs
  • Storage: 20 TB: Rs 5 Lakhs
    • Two copies of 20 TB
communications cost
Communications Cost
  • Telephonic charges: Rs 20 Lakhs
    • Rs 1 per min (local telephonic charges)
  • Transportation: Rs 20 Lakhs
ad