Problems and prospects in collecting spoken language data
Download
1 / 22

Problems and Prospects in Collecting Spoken Language Data - PowerPoint PPT Presentation


  • 148 Views
  • Uploaded on

Problems and Prospects in Collecting Spoken Language Data. Kishore Prahallad Suryakanth V Gangashetty B. Yegnanarayana Raj Reddy IIIT Hyderabad, India Carnegie Mellon University, USA. Outline. Need for digital library of audio and video data Characteristics of spoken language data

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Problems and Prospects in Collecting Spoken Language Data' - skah


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Problems and prospects in collecting spoken language data l.jpg

Problems and Prospects in Collecting Spoken Language Data

Kishore Prahallad

Suryakanth V Gangashetty

B. Yegnanarayana

Raj Reddy

IIIT Hyderabad, India

Carnegie Mellon University, USA.


Outline l.jpg
Outline

  • Need for digital library of audio and video data

  • Characteristics of spoken language data

  • Prototype data collection

    • IIIT Hyderabad

    • IIT Madras

    • Lessons Learnt

  • Proposal to collect IL data

    • as a part of Jimbaker’s global project.


Need for digital library of audio video data l.jpg
Need for Digital Library of Audio & Video Data

  • Current and future data will be in audio and video formats

  • Current technology makes it possible to digitize and store such large amounts of data

  • Collection, storage and indexing of such data makes it possible to provide information to current and future generation

  • Acts as test bed for several research challenges exists in organizing, indexing and retrieving such large data collections

    • Algorithms for quick and easier access to the information present in AV format by providing a query using text / audio / video modes

    • Algorithms using multi-modal data for bio-metric authentication

    • Development of multi-lingual speech synthesis and speech recognition systems


Characteristics of spoken language data l.jpg
Characteristics of Spoken Language Data

  • Message - Information to be conveyed

  • Speaker – Who is the speaker?

  • His/her background – Age, gender, literacy levels, knowledge levels, mannerisms etc.

  • Emotions – Anger, sad, happy etc.

  • Idiolect – An individual distinctive style of speaking

  • Medium of transmission – Microphone, telephone, satellite etc.

  • Environment - party-environment, airport/station,

  • Language

  • Dialect – grammar and the vocabulary associated with a regional or social use of a language.

  • Culture and civilization – The richness of usage of vocabulary, grammar etc, indicates the times of the language and the society.


Characteristics of spoken language data5 l.jpg
Characteristics of Spoken Language Data

  • How a language was spoken 25 years ago, 50 years ago, 100 years ago and beyond?

  • How a famous poem was recited or sung by the author?

  • How a particular language was spoken in different geographical locations of a state/country?

  • How a particular language/dialect has evolved over a period of time?

  • What were the rare languages/dialects (which were no more in existence)?. How they were spoken?


Phase 0 prototype data collection at iiit hyd l.jpg
Phase 0: Prototype data collection at IIIT Hyd

  • High quality studio recordings

    • 2 hrs of single speaker recordings for speech synthesis

    • Telugu, Hindi, Tamil and Indian-English

    • Developed text to speech systems in these 4 languages

  • Telephone and Cell-phone corpus

    • 150 hrs (540 speakers)

    • Telugu, Tamil and Marathi

    • Developed speech recognition systems in these 3 languages


Phase 0 prototype data collection at iit madras l.jpg
Phase 0: Prototype data collection at IIT Madras

  • 15 hours (72 speakers)

  • TV news in Tamil, Telugu and Hindi Languages

    • Text to speech systems (TTS)

    • Language Identification

    • Duration modeling for TTS systems


Tools aiding for acquisition correction of speech data l.jpg
Tools Aiding for Acquisition/Correction of Speech Data

  • Transcription correction tool (TCT)

    • Spoken errors at phone, syllable, word level

    • Background noise, abrupt begin or end, low SNR

    • TCT corrects the above errors in three levels

  • Audio & Video Transcription Tool

    • Used to annotate movie databases

  • Correction of Segment labels

    • Emulabel


Lessons learnt l.jpg
Lessons Learnt

  • Speech correction needs 3-6 times more than collection

    • Better to collect more data than correcting

  • Needs a unified framework

    • Standardize, processes, procedure and tools

  • Need larger collection of spoken and text corpora

    • For building practical speech systems in Indian languages


Proposal for collection of larger spoken language data for il l.jpg
Proposal for collection of larger Spoken Language Data for IL

  • Focus of information present in speech mode

  • Collect spoken language data from all Indian languages and also from neighboring countries

  • Collect about 200,000 (.2 M) hours of speech

    • As a part of JimBaker’s global project of collecting 1 Million hours of speech


New in our approach l.jpg
New in our approach IL

  • Collection of large speech data upto 200,000 (0.2 M) hours

    • All Indian languages and dialects

      • 23 official Indian languages

      • Approx. 10,000 hours per language

    • All types: Traditional, Read, spoken, conversational, dialog, movies, broadcast etc.

    • All modes: microphone, clean, telephone, cellphone, satellite etc

  • Standard procedure for organizing, annotating and indexing

  • More focus on larger collection (and elimination than of correction)

  • Make available this data for general public use


Key make a difference capability l.jpg
Key Make-A-Difference Capability IL

  • Availability of information (Stories, lectures, poems, books, articles) in spoken language

    • For illiterate

    • Vision Impaired

  • Collection and Storage of spoken language data of popular as well as rare languages & dialects

  • Promotes research and development in

    • Speech Technology

      • Speech-to-speech translation in Indian languages

      • Phonetic engine (Language Independent)

      • Speech synthesis (Text-to-speech for Indian languages)

      • Speaker recognition (Text independent and dependent)

      • Language Identification

      • Speech enhancement

      • Speech signal processing

    • Biometrics:

      • Multimodal: Audio-Video modes

    • Information Access, Storage and Retrieval

      • Audio-video data (indexing)

      • Data Mining (searching)

      • Speech Coding (Ultra-low bit coding)


  • Implementation plan l.jpg
    Implementation Plan IL

    • Phase 1: (3.5 months)

      • 10 languages

      • 33,300 hours

    • Phase 2: (8 months)

      • 10 (of phase 1) languages

      • 66,000 hours

    • Phase 3: (10 months)

      • 13 - remaining languages

      • 80,000 hours


    Mid term and final terms l.jpg
    Mid-Term and Final Terms IL

    • Mid-Term

      • Phase 1, collection of 33,300 hours of speech

      • Collection, Storage and Indexing of speech data for public information access

      • Visible research output using the speech data

      • Demonstrations of speech technology products

        • Speech recognition in 10 languages

    • Final Term

      • Phase 1 + Phase 2


    Slide15 l.jpg

    Q & A IL



    Impact of audio digital library l.jpg
    Impact of Audio Digital Library IL

    • Availability of information in spoken language form for illiterate and others

    • Promotes research in speech technology for Indian languages

    • Enable to develop speech technology products useful for common man

    • Examples:

      • Speech-speech translation systems

        • For information exchange

      • Screen readers,

        • For illiterate and physically challenged

      • Naturally speaking dialog systems

        • For information access over voice mode


    Phase 1 time estimate l.jpg
    Phase 1: Time Estimate IL

    • Phase 1:

      • 10 official Indian languages

      • Parallel collection of data

      • ~ 3000 hours per language

        • 5,000 - 10,000 speakers

        • > 10 min of speech each per speaker

      • Total: 33,300 hours

    • Time Estimates: (~ 3.5 months all 10 languages)

      • 10 persons-team per language

      • Each person works

        • 8 hours a day

        • 30 mins of speech recording per hour

          • 1-3 speakers per hour

        • 240 mins of speech per day

          • 1-24 speakers per day,

      • 240 speakers per day

      • 20,000 speakers per language in 84 working days


    Phase 1 cost estimate l.jpg
    Phase 1: Cost Estimate IL

    • Man power cost: Rs 140 Lakhs

    • Equipment cost: Rs 55 Lakhs

    • Communication cost: Rs 40 Lakhs

    • Contingency (10%): Rs 25 Lakhs

      Total Cost: Rs 2.6 Crores (~ $ 565,000)


    Man power cost l.jpg
    Man-Power Cost IL

    • Data collection Team: Rs 86 lakhs

      • 10 (for data collection) x Rs 10 K PM

      • 10 (for data correction) x Rs 10 K PM

      • 1 data manager (Rs 15 K PM)

      • 4 months cost: 8, 60, 000 per language

  • 5 engineers: Rs 4 Lakhs

    • B.Tech Level (Rs 20,000 PM)

  • Gifts per speaker: Rs 50 Lakhs

    • Rs 25 per speaker


  • Machines cost l.jpg
    Machines Cost IL

    • Machines:

      • 30 servers: Rs 30 Lakhs

        • 3 servers per languages

        • Each server has 4 ports for data collection

      • 30 CTI cards: Rs 20 Lakhs

    • Storage: 20 TB: Rs 5 Lakhs

      • Two copies of 20 TB


    Communications cost l.jpg
    Communications Cost IL

    • Telephonic charges: Rs 20 Lakhs

      • Rs 1 per min (local telephonic charges)

    • Transportation: Rs 20 Lakhs