1 / 16

Probabilistic Language Processing

Probabilistic Language Processing. Chapter 23. Probabilistic Language Models. Goal -- define probability distribution over set of strings Unigram, bigram, n-gram Count using corpus but need smoothing: add-one Linear interpolation Evaluate with Perplexity measure

Download Presentation

Probabilistic Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probabilistic Language Processing Chapter 23

  2. Probabilistic Language Models • Goal -- define probability distribution over set of strings • Unigram, bigram, n-gram • Count using corpus but need smoothing: • add-one • Linear interpolation • Evaluate with Perplexity measure • E.g. segmentwordswithoutspaces w/ Viterbi

  3. PCFGs • Rewrite rules have probabilities. • Prob of a string is sum of probs of its parse trees. • Context-freedom means no lexical constraints. • Prefers short sentences.

  4. Learning PCFGs • Parsed corpus -- count trees. • Unparsed corpus • Rule structure known -- use EM (inside-outside algorithm) • Rules unknown -- Chomsky normal form… problems.

  5. Information Retrieval • Goal: Google. Find docs relevant to user’s needs. • IR system has doc. Collection, query in some language, set of results, and a presentation of results. • Ideally, parse docs into knowledge base… too hard.

  6. IR 2 • Boolean Keyword Model -- in or out? • Problem -- single bit of “relevance” • Boolean combinations a bit mysterious • How compute P(R=true | D,Q)? • Estimate language model for each doc, computes prob of query given the model. • Can rank documents by P(r|D,Q)/P(~r|D,Q)

  7. IR3 • For this, need model of how queries are related to docs. Bag of words: freq of words in doc., naïve Bayes. • Good example pp 842-843.

  8. Evaluating IR • Precision is proportion of results that are relevant. • Recall is proportion of relevant docs that are in results • ROC curve (there are several varieties): standard is to plot false negatives vs. false positives. • More “practical” for web: reciprocal rank of first relevant result, or just “time to answer”

  9. IR Refinements • Case • Stems • Synonyms • Spelling correction • Metadata --keywords

  10. IR Presentation • Give list in order of relevance, deal with duplicates • Cluster results into classes • Agglomerative • K-means • How describe automatically-generated clusters? Word list? Title of centroid doc?

  11. IR Implementation • CSC172! • Lexicon with “stop list”, • “inverted” index: where words occur • Match with vectors: vectorof freq of words dotted with query terms.

  12. Information Extraction • Goal: create database entries from docs. • Emphasis on massive data, speed, stylized expressions • Regular expression grammars OK if stylized enough • Cascaded Finite State Transducers,,,stages of grouping and structure-finding

  13. Machine Translation Goals • Rough Translation (E.g. p. 851) • Restricted Doman (mergers, weather) • Pre-edited (Caterpillar or Xerox English) • Literary Translation -- not yet! • Interlingua-- or canonical semantic representation like Conceptual Dependency • Basic Problem != languages, != categories

  14. MT in Practice • Transfer -- uses data base of rules for translating small units of language • Memory -based. Memorize sentence pairs • Good diagram p. 853

  15. Statistical MT • Bilingual corpus • Find most likely translation given corpus. • Argmax_F P(F|E) = argmax_F P(E|F)P(F) • P(F) is language model • P(E|F) is translation model • Lots of interesting problems: fertility (home vs. a la maison). • Horrible drastic simplfications and hacks work pretty well!

  16. Learning and MT • Stat. MT needs: language model, fertility model, word choice model, offset model. • Millions of parameters • Counting , estimate, EM.

More Related