1 / 80

Speech and Language Modeling

Speech and Language Modeling. Shaz Husain Albert Kalim Kevin Leung Nathan Liang. Voice Recognition . The field of Computer Science that deals with designing computer systems that can recognize spoken words.

lotus
Download Presentation

Speech and Language Modeling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speech and Language Modeling Shaz Husain Albert Kalim Kevin Leung Nathan Liang

  2. Voice Recognition • The field of Computer Science that deals with designing computer systems that can recognize spoken words. • Voice Recognition implies only that the computer can take dictation, not that it understands what is being said.

  3. Voice Recognition (continued) • A number of voice recognition systems are available on the market. The most powerful can recognize thousands of words. • However, they generally require an extended training session during which the computer system becomes accustomed to a particular voice and accent. Such systems are said to be speaker dependent.

  4. Voice Recognition (continued) • Many systems also require that the speaker speak slowly and distinctly and separate each word with a short pause. These systems are called discrete speech systems. • Recently, great strides have been made in continuous speech systems -- voice recognition systems that allow you to speak naturally. There are now several continuous-speech systems available for personal computers.

  5. Voice Recognition (continued) • Because of their limitations and high cost, voice recognition systems have traditionally been used only in a few specialized situations. For example, such systems are useful in instances when the user is unable to use a keyboard to enter data because his or her hands are occupied or disabled. Instead of typing commands, the user can simply speak into a headset. • Increasingly, however, as the cost decreases and performance improves, speech recognition systems are entering the mainstream and are being used as an alternative to keyboards

  6. Natural Language Processing • Comprehending human languages falls under a different field of computer science called natural languageprocessing. • Natural Language: human language. English, French, and Mandarin are natural languages. Computer languages, such as FORTRAN and C, are not. • Probably the single most challenging problem in Computer Science is to develop computers that can understand natural languages. So far, the complete solution to this problem has proved elusive, although a great deal of progress has been made.

  7. Proteus Project • At New York University, members of the Proteus Project have been doing Natural Language Processing (NLP) research since the 1960's. • Basic Research: Grammars and Parsers, Translation Models, Domain-Specific Language, Bitext Maps and Alignment, Evaluation Methodologies, Paraphrasing, and Predicate-Argument Structure.

  8. Proteus Project: Grammars and Parsers • Grammars are models of linguistic structure. Parsers are algorithms that infer linguistic structure, given a grammar and a linguistic expression. • Given a grammar, we can design a parser to infer structure from linguistic data. Also, given some parsed data, we can learn a grammar. • Example of Research Applications: Apple Pie Parser for English. For example, I love an apple pie will be parsed as (S (NP (PRP I)) (VP (VBP love) (NP (DT an) (NN apple) (NN pie))) (. -PERIOD-)) Web-based application: http://complingone.georgetown.edu/~linguist/applepie.html

  9. Proteus Project: Translation Models • Translation models describe the abstract/mathematical relationship between two or more languages. • Also called models of translational equivalence because the main thing that they aim to predict is whether expressions in different languages have equivalent meanings. • A good translation model is the key to many trans-lingual applications, the most famous of which is machine translation.

  10. Proteus Project: Domain-specific Language • Sentences in different domains of discourse are structurally different. • For example, imperative sentences are common in computer manuals, but not in annual company reports. It would be useful to characterize these differences in a systematic way.

  11. Proteus Project: Bitext Maps and Alignment • A "bitext" consists of two texts that are mutual translations. • A bitext map is a description of the correspondence relation between elements of the two halves of a bitext. • Finding such a map is the first step to building translation models. It is also the first step in applications like automatic detection of omissions in translations.

  12. Proteus Project: Evaluation Methodologies • There are many correct ways to say almost anything, and many shades of meaning. This "ambiguity" of natural languages makes the evaluation of NLP systems difficult enough to be a research topic in itself. • Proteus Project has invented new evaluation methods in two areas of NLP where evaluation is notoriously difficult: translation modeling and word sense disambiguation. An example of research applications: General Text Matcher (GTM). GTM measures the similarity between texts. Simple Applet for GTM: http://nlp.cs.nyu.edu/call_gtm.html

  13. Proteus Project: Paraphrasing • A paraphrase relation exists between two phrases which convey the same information. • The recognition of paraphrases is an essential part of many natural language applications: if we want to process text reporting fact "X", we need to know all the alternative ways in which "X" can be expressed.  • Capturing paraphrases by hand is an almost overwhelming task because they are so common and many are domain specific.  • Therefore, Project Proteus begun to develop procedures which learn paraphrase from text.  The basic idea is that they look for news stories from the same day which report the same event, and then examine the different ways in which the same fact gets reported

  14. Proteus Project: Predicate-Argument Structure • An analysis of sentences in terms of predicates and arguments. • It is a "deeper" level of linguistic analysis than constituent structure or simple dependency structure, in particular one that regularizes over nearly equivalent surface strings.

  15. Language Modeling • A bad language model

  16. Language Modeling (continued)

  17. Language Modeling (continued)

  18. Language Modeling: Introduction • Language modeling • One of the basic tasks to build a speech recognition system • help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. • lets the recognizer make the right guess when two different sentences sound the same.

  19. Basics of Language Modeling • Language modeling has been studied under two different points of view. • First, as a problem of grammar inference: • the model has to discriminate the sentences which belong to the language from those which do not belong. • Second, as a problem of probability estimation. • If the model is used to recognize the decision is usually based on the maximum a posteriori rule. The best sentence L is chosen so that the probability of the sentence, knowing the observations O, is maximized.

  20. What is a Language Model • A Language model is a probability distribution over word sequences • P(“And nothing but the truth”)  0.001 • P(“And nuts sing on the roof”)  0

  21. How Language Models work • Hard to compute • P(“And nothing but the truth”) • Decompose probability • P(“And nothing but the truth) = P(“And”) P(“nothing|and”)  P(“but|and nothing”)  P(“the|and nothing but”)  P(“truth|and nothing but the”)

  22. Types of Language Modeling • Statistical Language Modeling • N-grams/ Trigrams Language Modeling • Structured Language Modeling

  23. Statistical Language Model • A statistical language model (SLM) is a probability distribution P(s) over strings S that attempts to reflect how frequently a string S occurs as a sentence.

  24. The Trigram / N-grams LM • Assume each word depends only on the previous two/n-1 words (three words total – tri means three, gram means writing) • P(“the|… whole truth and nothing but”)  P(“the|nothing but”) • P(“truth|… whole truth and nothing but the”)  P(“truth|but the”)

  25. Structured Language Models • Language has structure – noun phrases, verb phrases, etc. • Use structure of language to detect long distance information • Promising results • But: time consuming; language is right branching

  26. Evaluation • Perplexity - is geometric average inverse probability • measures language model difficulty, not acoustic difficulty. • Lower the perplexity, the closer we are to true model.

  27. Language Modeling Techniques • Smoothing • addresses the problem of data sparsity: there is rarely enough data to accurately estimate the parameters of a language model. • gives a way to combine less specific, more accurate information with more specific, but noisier data • Eg. deleted interpolation and Katz (or Good-Turing) smoothing, Modified Kneser-Ney smoothing • Caching • is a widely used technique that uses the observation that recently observed words are likely to occur again. Models from recently observed data can be combined with more general models to improve performance.

  28. LM Techniques (continued) • Skipping models • use the observation that even words that are not directly adjacent to the target word contain useful information. • Sentence-mixture models • use the observation that there are many different kinds of sentences. By modeling each sentence type separately, performance is improved. • Clustering • Words can be grouped together into clusters through various automatic techniques; then the probability of a cluster can be predicted instead of the probability of the word. • can be used to make smaller models or better performing ones.

  29. Smoothing: Finding Parameter Values • Split data into training, “heldout”, test • Try lots of different values for  on heldout data, pick best • Test on test data • Sometimes, can use tricks like “EM” (estimation maximization) to find values • Heldout should have (at least) 100-1000 words per parameter. • enough test data to be statistically significant. (1000s of words perhaps)

  30. Caching: Real Life • Someone says “I swear to tell the truth” • System hears “I swerve to smell the soup” • Cache remembers! • Person says “The whole truth”, and, with cache, system hears “The whole soup.”– errors are locked in. • Caching works well when users corrects as they go, poorly or even hurts without correction.

  31. Caching • If you say something, you are likely to say it again later. • Interpolate trigram with cache

  32. Skipping • P(z|…rstuvwxy)  P(z|vwxy) • Why not P(z|v_xy) –“skipping” n-gram – skips value of 3-back word. • Example: “P(time|show John a good)” -> P(time | show ____ a good) • P(…rstuvwxy)  P(z|vwxy) + P(z|vw_y) + (1--)P(z|v_xy)

  33. Clustering • CLUSTERING = CLASSES (same thing) • What is P(“Tuesday | party on”) • Similar to P(“Monday | party on”) • Similar to P(“Tuesday | celebration on”) • Put words in clusters: • WEEKDAY = Sunday, Monday, Tuesday, … • EVENT=party, celebration, birthday, …

  34. Predictive Clustering Example • Find P(Tuesday | party on) • Psmooth (WEEKDAY | party on)  Psmooth (Tuesday | party on WEEKDAY) • C( party on Tuesday) = 0 • C(party on Wednesday) = 10 • C(arriving on Tuesday) = 10 • C(on Tuesday) = 100 • Psmooth (WEEKDAY | party on) is high • Psmooth (Tuesday | party on WEEKDAY) backs off to Psmooth (Tuesday | on WEEKDAY)

  35. Microsoft Language Modeling Research Microsoft language modeling research falls into several categories: • Language Model Adaptation. Natural language technology in general and language models in particular are very brittle when moving from one domain to another. Current statistical language models are built from text specific to newspapers and TV/radio broadcasts which has little to do with the everyday use of language by a particular individual. We are investigating means of adapting a general-domain statistical language model to a new domain/user when we have access to limited amounts of sample data from the new domain/user.

  36. Microsoft Language Modeling Research • Can Syntactic Structure Help? Current language models make no use of the syntactic properties of natural language but rather use very simple statistics such as word co-occurences. Recent results show that incorporating syntactic constraints in a statistical language model reduces the word erroror rate on a conventional dictation task by 10% . We are working on finding the best way of "putting language into language models" as well as exploring the new possibilities opened by such structured language models for other tasks such as speech and language understanding.

  37. Microsoft Language Modeling Research • Speech Utterance Classification A simple first step to more natural user interfaces in interactive voice response systems is automated call routing. Instead of listening to prompts like "If you are trying to reach department X say Yes, otherwise say No" or punching keys on your telephone keypad, one could simply state in a sentence what the problem is, for example "There is a fraudulous transaction on my last statement" and get connected to the right customer service representative. We are developing technology that aims at classifying speech utterances in a limited set of classes, enhancing the role of the traditional language model such that it also assigns a category to a given utterance

  38. Microsoft Language Modeling Research • Building the best language models we can. In general, the better the language model, the lower the error rate of the speech recognizer. By putting together the best results available on language modeling, we have created a language model that outperforms a standard baseline by 45%, leading to a 10% reduction in error rate for our speech recognizer. The system has the best reported results of any language model.

  39. Microsoft Language Modeling Research • Language modeling for other applications. Speech recognition is not the only use for language models. They are also useful in fields like handwriting recognition, spelling correction, even typing Chinese! Like speech recognition, all of these are areas where the input is ambiguous in some way, and a language model can help us guess the most likely input. We're also working on finding new uses for language models, in other areas.

  40. Microsoft Speech Software Development Kit • enables developers to create, debug and deploy speech-enabled ASP.NET Web applications intended for deployment to a Microsoft Speech Server. • applications are designed for devices ranging from telephones to Windows Mobile™-based devices and desktop PCs.

  41. Speech Application Language Tags (SALT) • SALT is an XML based API that brings speech interactions to the Web. • SALT is an extension of HTML and other markup languages (cHTML, XHTML, WML) that adds a powerful speech interface to Web pages, while maintaining and leveraging all the advantages of the Web application model. These tags are designed to be used for both voice-only browsers (for example, a browser accessed over the telephone) and multimodal browsers. • SALT is a small set of XML elements, with associated attributes and DOM object properties, events, and methods, which may be used in conjunction with a source markup document to apply a speech interface to the source page. The SALT formalism and semantics are independent of the nature of the source document, so SALT can be used equally effectively within HTML and all its flavors, or with WML, or with any other SGML-derived markup.

  42. What kind of applications can we build with SALT? • SALT can be used to add speech recognition and synthesis and telephony capabilities to HTML or XHTML based applications, making them accessible from telephones or other GUI–based devices such as PCs, telephones, tablet PCs and wireless personal digital assistants (PDAs).

  43. XML (Extensible Markup Language) • XML is a collection of protocols for representing structured data in a text format that makes it straightforward to interchange XML documents on different computer systems. • XML allows new markups. • XML contains sets of data structures. They can be transformed into appropriate formats like XSL or XSLT.

  44. The main top-level elements • <prompt …> • For speech synthesis configuration and prompt playing • <listen …> • For speech recognizer configuration, recognition execution and post-processing, and recording • <dtmf …> • For configuration and control of DTMF collection • <smex …> • for general-purpose communnication with platform components

  45. The input elements <listen> and <dtmf> also contain grammars and binding controls • <grammar …> • For specifying input grammar resources • <bind …> • For processing of recognition results • <record …> • For recording audio input

  46. Speech Library Example

  47. Speech Library Example

  48. Example <input name=”Date” type=”Dates” /> <input name=”PersonToMeet” type=”text” /> <input name=”Duration” type=”time” /> … <prompt …> Schedule a meeting <value targetElement=”Date”/> Date <value targetElement=”Duration”/> Duration <value targetElement=”PersonToMeet”/> Person </prompt> <listen …> <grammar …/> <bind test=”/@confidence $lt$ 50” targetElement=”prompt_confirm” targetMethod=”start” targetElement=”listen_confirm” targetMethod=”start” /> <bind test=”/@confidence $ge$ 50” targetElement=”Date” value=”//Meeting/Date”/> targetElement=”Duration” value=”//Meeting/Duration”/> targetElement=”PersonToMeet” value=”//Meeting/Person” /> … </listen>

  49. Example (continued) <l propname=”DayOfWeek”> <p valstr=”Sun”> Sunday </p> <p valstr=”Mon”> Monday </p> <p valstr=”Mon”> first day </p> .. .. .. <p valstr=”Sat”> Saturday </p> </l> Voice: monday Generates an XML element: <DayOfWeek text=”first day”>Mon</DayOfWeek> <I propname=“Person”> <p valstr=“Nathan”>CEO</p> <p valstr=“Nathan”>Nathan</p> <p valstr=“Nathan”>boss</p> <p valstr=“Albert”>programmer</p> …… </I> Voice: CEO, Generates: <Person text=“CEO”>Nathan</Person> <rule name=”MeetingProperties”/> <l> <ruleref name=”Date”/> <ruleref name=”Duration”/> <ruleref name=”Time”/> <ruleref name=”Person”/> <ruleref name=”Subject”/> .. .. </l> <o> <ruleref name=”Meeting”/> </o> <output> <Calendar:meeting> <DateTIme> <xsl:apply-templates name=“DayOfWeek”/> <xsl:apply-templates name=“Time”/> <xsl:apply-templates name=“Duration”/> </DateTIme> <PersonToMeet> <xsl:apply-templates name=“Person”/> </PersonToMeet> </Calendar:meeting> </output> </rule>

  50. XML Result <calendar:meeting text=”…”> <DateTime text=”…”> <DateOfWeek text=”…”>Monday</DateOfWeek> <Time text=”…”>2:00</Time> <Duration text=“…”>3600</Duration> </DateTime> <Person>Nathan</Person> </calendar:meeting>

More Related