1 / 28

Eugene Agichtein and Silviu Cucerzan Microsoft Research

Predicting Accuracy of Extracting Information from Unstructured Text Collections. Eugene Agichtein and Silviu Cucerzan Microsoft Research. Web Documents. Information Extraction System. Extracting and Managing Information in Text. Text Document Collections.

keaton
Download Presentation

Eugene Agichtein and Silviu Cucerzan Microsoft Research

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Predicting Accuracy of Extracting Information from Unstructured Text Collections Eugene Agichtein and Silviu Cucerzan Microsoft Research

  2. Web Documents Information Extraction System Extracting and Managing Information in Text TextDocument Collections Varying propertiesDifferent LanguagesVarying consistencyNoise/errors …. Blogs News Alerts … Complex problemUsually many parameters Often tuning required Relations Entities Events Success ~ Accuracy ------------------- E 1 E 2 ------------------- E 3 E 4 ------------------- ----------- E 4 E 1 ------------------- -----------

  3. The Goal: Predict Extraction Accuracy Estimate the expected success of an IE system that relies on contextual patterns before • running expensive experiments • tuning parameters • training the system Useful when adapting an IE system to • a new task • a new document collection • a new language

  4. Organization Specific Extraction Tasks • Named Entity Recognition (NER) • Relation Extraction (RE) Misc European champions Liverpool paved the way to the group stages of the Champions League taking a 3-1 lead over CSKA Sofia on Wednesday [...] Gerard Houllier's men started the match in Sofia on fire with Steven Gerrard scoring [...] Location Person Abraham Lincoln was born on Feb. 12, 1809, in a log cabin in Hardin (now Larue) County, Ky

  5. Contextual Clues NER … yesterday, MrsClintontold reporters the move to the East Room Right context Left context RE engineersOrville and Wilbur Wrightbuiltthefirstworkingairplanein 1903 . Middle context Right context Left context

  6. Approach: Language Modelling • Presence of contextual clues for a task appears related to extraction difficulty • The more “obvious” the clues, the easier the task • Can be modelled as “unexpectedness” of a word • Use Language Modelling (LM) techniques to quantify intuition

  7. Language Models (LM) • An LM is summary of word distribution in text • Can define unigram, bigram, trigram, n-gram models • More complex models exist • Distance, syntax, word classes • But: not robust for web, other languages, … • LMs used in IR, ASR, Text Classification, Clustering: • Query Clarity: Predicting query performance [Cronen-Townsend et al, SIGIR 2002] • Context Modelling for NER [Cucerzan et al., EMNLP 1999], [Klein et al. CoNLL 2003] …

  8. Document Language Models • A basic LM is a normalized word histogram for the document collection • Unigram (word) models commonly used • Higher-order n-grams (bigrams, trigrams) can be used

  9. Context Language Models • Senator Christopher Dodd, D-Conn., named general chairman of the Democratic National Committee last week by President BillClinton , said it was premature to talk about lifting the U.S. embargo against Cuba… • Although the Clinton ‘s health plan failed to make it through Congress this year , Mrs Clinton vowed continued support for the proposal. • A senior White House official, who accompanied Clinton , told reporters… NER: PERSON RE: INVENTIONS • By the fall of 1905, the Wrightbrothers ’ experimental period ended. With their third powered airplane , they now routinely made flights of several … • Against this backdrop, we see the Wright brothers efforts to develop an airplane …

  10. Key Observation • If normally rare words consistently appear in contexts around entities, extraction task tends to be “easier”. • Contexts for a task are an intrinsic property of collection and extraction task, and not restricted to a specific information extraction system.

  11. Divergence Measures • Cosine Divergence: • Relative entropy: KL Divergence

  12. Interpreting Divergence: Reference LM • Need to calibrate the observed divergence • Compute Reference Model LMR : • Pick K random non-stopwords R and compute the context language modelaround Ri.… the five-star Hotel Astoria is a symbol of elegance and comfort. With an unbeatable location in St Isaac's Square in the heart of St Petersburg, ... • Normalized KL(LMC)= • Normalization corrects for bias introduced by small sample size

  13. Reference LM (cont) • LMRconverges to LMBGfor large sample sizes • Divergence of LMR substantial for small samples

  14. Predicting Extraction Accuracy: The Algorithm • Start with a small sample S of entities (or relation tuples) to be extracted • Find occurrences of S in given collection • Compute LMBG for the collection • Compute LMC for S and the collection • Pick |S| random words R from LMBG • Compute context LM for R LMR • Compute KL(LMC || LMBG), KL(LMR || LMBG) • Return normalized KL(LMC)

  15. Experimental Evaluation • How to measure success? • Compare predicted ease of task vs. observed extraction accuracy • Extraction Tasks: NER and RE • NER: Datasets from the CoNLL 2002, 2003 evaluations • RE: Binary relations between NEs and generic phrases

  16. Extraction Task Accuracy NER RE

  17. Document Collections Note that Spanish and Dutch corpus sizes are much smaller

  18. Predicting NER Performance (English) Reuters 1/10, Context = 3 words, discard stopwords, avg LOC exception: Large overlap between locations in the training and test collections (i.e., simple gazetteers effective). Absolute and Normalized KL-divergence

  19. NER – Robustness / Different Dimensions • Counting stopwords (w) or not (w/o) • Context Size • Corpus size Reuters 1/100, context ±3, avg Reuters 1/100, no stopwords, avg Reuters, context ±3, no stopwords, avg

  20. Other Dimensions: Sample Size • Normalized divergence of LMC remains high • - Contrast with LMRfor larger sample sizes

  21. Other Dimensions: N-gram size Higher order n-grams may help in some cases.

  22. Other Languages • Spanish • Dutch Problem: very small collections

  23. Predicting RE Performance (English) • 2- and 3- word contexts correctly distinguish between “easy” tasks (BORN, DIED), and “difficult” tasks (INVENT, WROTE). • 1-word context size appears not sufficient for predicting RE

  24. Other Dimensions: Sample Size • Divergence increases w/ sample size

  25. Results Summary • Context models can be effective in predicting the success of information extraction systems • Even a small sample of available entities can be sufficient for making accurate predictions • Available large collection size most important limiting factor

  26. Other Applications and Future Work • Could use results for • Active learning/training IE • Improved boundary detection for NER • Improved confidence estimation of extraction • e.g.: Culotta and McCallum [HLT 2004] • For better results, could incorporate: • Internal contexts, gazeteers (e.g., for LOC entities) • e.g.: Agichtein & Ganti [KDD 2004], Cohen & Sarawagi [KDD 2004] • Syntax/logical distance • Coreference Resolution • Word classes

  27. Summary • Presented the first attempt to predict information extraction accuracy for a given task and collection • Developed a general, system-independent method utilizing Language Modelling techniques • Estimates for extraction accuracy can help • Deploy information extraction systems • Port Information Extraction systems to new tasks, domains, collections, and languages

  28. For More Information Text Mining, Search, and Navigation Grouphttp://research.microsoft.com/tmsn/

More Related