1 / 10

Final Projects

Final Projects. Please make an appointment to come talk to me (or office hours) What additional things should you add to your project? Are you on the right track with your method? Any time next week Next week: Invited speakers from Google: Wisam Dakka and Kevin Lerman

Download Presentation

Final Projects

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Final Projects • Please make an appointment to come talk to me (or office hours) • What additional things should you add to your project? • Are you on the right track with your method? • Any time next week • Next week: Invited speakers from Google: WisamDakka and Kevin Lerman • Stay tuned for room!

  2. Web-based Models for NLP • Can web-based counts be used to develop good bigram models? • Verified for same task on web vs. corpus • British National Corpus (BNC) • Unsupervised ngram models • Many tasks. Let’s take a sample: • Candidate target word selection for MT • Compound noun bracketing • PP attachment

  3. Obtaining web counts • The number of hits for queries generated for this n-gram • Literal queries: “airplanes flew” • Inflected queries: “airplane flew” “airplanes fly” “airplane flies”…. Etc. • Google and Altavista • Noise • False positives when punctuation ignored • Matches may include links, filenames, etc.

  4. Combining Web and Corpus Counts • Backoff model: if ngram count falls below threshold, use web count • Interpolation model: count is combination of web and corpus: • Threshold and λ set through tuning on development corpus

  5. Candidate selection for MT • A.Die Geschichte andert sich, nicht jedoch die Geographie. • b. {History, story, tale, saga, strip} changes but geography does not. • Translate verb object pairs where translation of verb is known, select the most likely noun that goes with it • Web counts obtained for all verb/object translations and selected using • Collocation frequency: f(v,n) • Conditional probability: f(v,n)/f(n)

  6. Results

  7. Bracketing of Compound Nouns • [backup compiler] disk vs. backup [compiler disk] • Lauer: sophisticated dependency model using Thesauris: 77.5

  8. Interpretation of compound nouns • Pet spray: a spray for pets • Onion tears: tears caused by onions • Lauer: avoids hand labeling with semantic tags by using prepositions: • War story: must find “story about the war” in corpus over “story for the war” • Uses 8 prepositions • Model: Argmax(p) P(p|n1,n2) • To avoid sparse data:

More Related