Download
information extraction from web documents n.
Skip this Video
Loading SlideShow in 5 Seconds..
Information Extraction from Web Documents PowerPoint Presentation
Download Presentation
Information Extraction from Web Documents

Information Extraction from Web Documents

169 Views Download Presentation
Download Presentation

Information Extraction from Web Documents

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Information Extractionfrom Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding

  2. IR and IE • IR (Information Retrieval) • Retrieves relevant documents from collections • Information theory, probabilistic theory, and statistics • IE (Information Extraction) • Extracts relevant information from documents • Machine learning, computational linguistics, and natural language processing

  3. History of IE • Large amount of both online and offline textual data. • Message Understanding Conference (MUC) • Quantitative evaluation of IE systems • Tasks • Latin American terrorism • Joint ventures • Microelectronics • Company management changes

  4. Evaluation Metrics • Precision • Recall • F-measure

  5. Web Documents • Unstructured (Free) Text • Regular sentences and paragraphs • Linguistic techniques, e.g., NLP • Structured Text • Itemized information • Uniform syntactic clues, e.g., table understanding • Semistructured Text • Ungrammatical, telegraphic (e.g., missing attributes, multi-value attributes, …) • Specialized programs, e.g., wrappers

  6. Approaches to IE • Knowledge Engineering • Grammars are constructed by hand • Domain patterns are discovered by human experts through introspection and inspection of a corpus • Much laborious tuning and “hill climbing” • Machine Learning • Use statistical methods when possible • Learn rules from annotated corpora • Learn rules from interaction with user

  7. Knowledge Engineering • Advantages • With skills and experience, good performing systems are not conceptually hard to develop. • The best performing systems have been hand crafted. • Disadvantages • Very laborious development process • Some changes to specifications can be hard to accommodate • Required expertise may not be available

  8. Machine Learning • Advantages • Domain portability is relatively straightforward • System expertise is not required for customization • “Data driven” rule acquisition ensures full coverage of examples • Disadvantages • Training data may not exist, and may be very expensive to acquire • Large volume of training data may be required • Changes to specifications may require reannotation of large quantities of training data

  9. Wrapper • A specialized program that identifies data of interest and maps them to some suitable format (e.g. XML or relational tables) • Challenge: recognizing the data of interest among many other uninterested pieces of text • Tasks • Source understanding • Data processing

  10. Free Text • AutoSlog • Liep • Palka • Hasten • Crystal • WebFoot • WHISK

  11. AutoSlog [1993] The Parliament building was bombed by Carlos.

  12. LIEP [1995] The Parliament building was bombed by Carlos.

  13. PALKA [1995] The Parliament building was bombed by Carlos.

  14. HASTEN [1995] The Parliament building was bombed by Carlos. Egraphs (SemanticLabel, StructuralElement)

  15. CRYSTAL [1995] The Parliament building was bombed by Carlos.

  16. CRYSTAL + Webfoot [1997]

  17. WHISK [1999] The Parliament building was bombed by Carlos. • WHISK Rule: *(PhyObj)*@passive *F ‘bombed’ * {PP ‘by’ *F (Person)} • Context-based patterns

  18. Web Documents • Semistructured and Unstructured • RAPIER (E. Califf, 1997) • SRV (D. Freitag, 1998) • WHISK (S. Soderland, 1998) • Semistructured and Structured • WIEN (N. Kushmerick, 1997) • SoftMealy (C-H. Hsu, 1998) • STALKER (I. Muslea, S. Minton, C. Knoblock, 1998)

  19. Inductive Learning • Task • Inductive Inference • Learning Systems • Zero-order • First-order, e.g., Inductive Logic Programming (ILP)

  20. RAPIER [1997] • Inductive Logic Programming • Extraction Rules • Syntactic information • Semantic information • Advantage • Efficient learning (bottom-up) • Drawback • Single-slot extraction

  21. RAPIER Rule

  22. SRV [1998] • Relational Algorithm (top-down) • Features • Simple features (e.g., length, character type, …) • Relational features (e.g., next-token, …) • Advantages • Expressive rule representation • Drawbacks • Single-slot rule generation • Large-volume of training data

  23. SRV Rule

  24. WHISK [1998] • Covering Algorithm (top-down) • Advantages • Learn multi-slot extraction rules • Handle various order of items-to-be-extracted • Handle document types from free text to structured text • Drawbacks • Must see all the permutations of items • Less expressive feature set • Need large volume of training data

  25. WHISK Rule

  26. WIEN [1997] • Assumes • Items are always in fixed, known order • Introduces several types of wrappers • Advantages • Fast to learn and extract • Drawbacks • Can not handle permutations and missing items • Must label entire pages • Does not use semantic classes

  27. WIEN Rule

  28. SoftMealy [1998] • Learns a transducer • Advantages • Learns order of items • Allows item permutations and missing items • Allows both the use of semantic classes and disjunctions • Drawbacks • Must see all possible permutations • Can not use delimiters that do not immediately precede and follow the relevant items

  29. SoftMealy Rule

  30. STALKER [1998,1999,2001] • Hierarchical Information Extraction • Embedded Catalog Tree (ECT) Formalism • Advantages • Extracts nested data • Allows item permutations and missing items • Need not see all of the permutations • One hard-to-extract item does not affect others • Drawbacks • Does not exploit item order

  31. STALKER Rule

  32. Web IE Tools (main technique used) • Wrapper languages (TSIMMIS, Web-OQL) • HTML-aware (X4F, XWRAP, RoadRunner, Lixto) • NLP-based (RAPIER, SRV, WHISK) • Inductive learning (WIEN, SoftMealy, Stalker) • Modeling-based (NoDoSE, DEByE) • Ontology-based (BYU ontology)

  33. Degree of Automation • Trade-off: page lay-out dependent • RoadRunner • Assume target pages were automatically generated from some data sources • The only fully automatic wrapper generator • BYU ontology • Manually created with graphical editing tool • Extraction process fully automatic

  34. Support of Complex Objects • Complex objects: nested objects, graphs, trees, complex tables, … • Earlier tools do not support extracting from complex objects, like RAPIER, SRV, WHISK, and WIEN. • BYU ontology • Support

  35. Page Contents • Semistructured data (table type, richly tagged) • Semistructured text (text type, rarely tagged) • NLP-based tools: text type only • Other tools (except ontology-based): table type only • BYU ontology: both types

  36. Ease of Use • HTML-aware tools, easiest to use • Wrapper languages, hardest to use • Other tools, in the middle

  37. Output • XML is the best output format for data sharing on the Web.

  38. Support for Non-HTML Sources • NLP-based and ontology-based, automatically support • Other tools, may support but need additional helper like syntactical and semantic analyzer • BYU ontology • support

  39. Resilience and Adaptiveness • Resilience: continuing to work properly in the occurrence of changes in the target pages • Adaptiveness: working properly with pages from some other sources but in the same application domain • Only BYU ontology has both the features.

  40. Summary of Qualitative Analysis

  41. Graphical Perspective of Qualitative Analysis

  42. X means the information extraction system has the capability; X* means the information extraction system has the ability as long as the training corpus can accommodate the required training data; ? Shows that the systems can has the ability in somewhat degree; * means that the extraction pattern itself doesn’t show the ability, but the overall system has the capability.

  43. Meaning Information Extraction • Knowledge Source Target • Information • Data Problem of IE (unstructured documents)

  44. Meaning Information Extraction • Knowledge Source Target • Information • Data Problem of IE (structured documents)

  45. Meaning Information Extraction • Knowledge Source Target • Information • Data Problem of IE (semistructured documents)

  46. Meaning Information Extraction • Knowledge Source Target • Information • Data Solution of IE (the Semantic Web)