1 / 13

OPTIMAL TEXT SELECTION ALGORITHM

OPTIMAL TEXT SELECTION ALGORITHM. ASR Project Meetings Dt: 08 June 2004 - Rohit Kumar - LTRC, IIIT Hyderabad. OPTIMAL TEXT SELECTION ALGORITHM. Basic Greedy Algorithm Get the Frequency Distribution of basic units in a language by analyzing a large corpus

liam
Download Presentation

OPTIMAL TEXT SELECTION ALGORITHM

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. OPTIMAL TEXT SELECTION ALGORITHM ASR Project Meetings Dt: 08 June 2004 - Rohit Kumar - LTRC, IIIT Hyderabad

  2. OPTIMAL TEXT SELECTION ALGORITHM • Basic Greedy Algorithm • Get the Frequency Distribution of basic units in a language by analyzing a large corpus • Iterate for as many sentence as you want to select • For Each Sentence on the Corpus • Score the sentence for its desirability in Optimal Text • Choose the sentence with best score into Optimal Text • Delete Selected Sentence from Corpus • Update the frequency distribution based on the sentence selected

  3. OPTIMAL TEXT SELECTION ALGORITHM Analysis Step This Gives you a sequence of basic units Diphone, Triphones, Syllables … Basically a set of sentences This Gives you a sequence of phonemes Text Corpus Syllabifier Phonetizer Counts the number of occurrences of each basic unit Unit Distribution Analysis Unit Frequency ka 10000 ek 8756 ne 6593 … Corpus Frequency Distribution

  4. OPTIMAL TEXT SELECTION ALGORITHM How to Score each Sentence This Gives you a sequence of basic units Diphone, Triphones, Syllables … This Gives you a sequence of phonemes Syllabifier Phonetizer Sentence Ranking Algorithm 1. Each units is scored on the basis of its desirability. 2. Desirability is proportional to Frequency of the unit in large corpus 3. Sentence Score = Sum of Fn(Score of all units in the sentence) / Number of Units Corpus Frequency Distribution Scoring Function could either be linear or Inverse function

  5. OPTIMAL TEXT SELECTION ALGORITHM How to Update the Frequency Distribution This Gives you a sequence of basic units Diphone, Triphones, Syllables … Sentence This Gives you a sequence of phonemes Syllabifier Phonetizer Counts the number of occurrences of each basic unit Unit Distribution Analysis Sentence Level Unit Frequency Distribution Modified Corpus Frequency Distribution Corpus Frequency Distribution For Each Unit in Sentence Frequency Distribution, Subtract Modify its corpus frequency by K x (Frequency of Unit in Sentence)

  6. OPTIMAL TEXT SELECTION ALGORITHM Issues • Complete Desirable Coverage will not be possible with one step simple selection as it will bring a large number of sentences into the optimal text. • “Optimal Text means Maximum Coverage and Minimum Size” How to Solve Follow multiple small steps as described ahead

  7. OPTIMAL TEXT SELECTION ALGORITHM Our Strategy for Optimal Text Selection • From the large database, filter out sentences that are not of length between 5 to 15 words • From the frequency analysis of the unit, choose a set of N units (out of total M units), whose frequency is higher than a threshold (say around above 100). • Select the sentences (say X) which cover these N units • Repeat the process again with P (P = M - N) units – but restrict the number of sentences to be not more than 2 * X • For all the remaining units, select words which cover these units

  8. OPTIMAL TEXT SELECTION ALGORITHM Phonetizer: A class that takes as input a text and gives as output a sequence of phonemes. What Phonemes ?? We will be following ITrans-3 as the notation across all our work. Word Itrans – 3 Phonemes namaste namaste n , a , m , a , s , t , e dhanywad dhanywaad dh , a , n , y , w , aa , d textile t’ekstaail t’ , e , k , s , t , aa , i , l khabrein qhabren’ qh , a , b , r , e , n’ krishna krxshhnaa k , rx , shh , n , aa

  9. OPTIMAL TEXT SELECTION ALGORITHM • Class Details • Class constructor • AddText (inputs string, no output) • GetPhoneme (no input, outputs one phoneme) • IsEmpty (no input, outputs flag if no text to work on left) • 3 is the phonetizing function which breaks a text into phonemes and will broadly be the same of all languages. • The list of phonemes is shown in the next slide

  10. OPTIMAL TEXT SELECTION ALGORITHM Phoneme list (for hindi, minor modifications for other languages) a a1 aa aa* aa1 i ii u uu e e* e1 ai o au n' : *' * rx lx rxx lxx k kh g gh ng- ch chh j jh nj- t' t'h d' d'h nd- t th d dh n n~ p ph b bh m y r r~ l l' l'~ v sh shh s s- h q qh gx z dr~ dd~ f y~

  11. OPTIMAL TEXT SELECTION ALGORITHM Handling English Words • Dictionary Lookup • Letter to Sounds Module

  12. OPTIMAL TEXT SELECTION ALGORITHM Implementation: Syllabifier • Basic Units: • Diphones (2 phones), Triphones (3 phones), Syllables • Basically takes the phonemes from Phonetizer and gives units. So if we are working with triphones • t’ , e , k , s , t , aa , i , l >> t’-e-k , e-k-s, k-s-t, s-t-aa, t-aa-i, aa-i-l • Class Details • Class constructor • AddPhoneme (inputs a string, no output) • GetUnit (no input, outputs one string) • IsEmpty (no input, outputs flag if no phonemes left)

  13. OPTIMAL TEXT SELECTION ALGORITHM Effort for each Language • Collect the Corpus (most of Hindi, Telugu, Tamil, Marathi already available) • Automatic Cleaning and Conversions on the Corpus • * English Word to ITrans Conversion by dictionary lookup • Modifying the Phonetizer (and Syllabifier) for the language • Running the OTS strategy • Manually Checking Selected Corpus and Corrections • 6. Optional: Reiterating 1 or more steps in OTS Strategy if need be

More Related