1 / 47

The Web’s Many Models

The Web’s Many Models. ?. Michael J. Cafarella University of Michigan AKBC May 19, 2010. Web Information Extraction. Much recent research in information extractors that operate over Web pages Snowball (Agichtein and Gravano, 2001) TextRunner (Banko et al, 2007)

ianna
Download Presentation

The Web’s Many Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Web’s Many Models ? Michael J. Cafarella University of Michigan AKBC May 19, 2010

  2. Web Information Extraction • Much recent research in information extractors that operate over Web pages • Snowball (Agichtein and Gravano, 2001) • TextRunner (Banko et al, 2007) • Yago (Suchanek et al, 2007) • WebTables (Cafarella et al, 2008) • DBPedia, ExDB, Freebase (make use of IE data) • Web crawl + domain-independent IE should allow comprehensive Web KBs with: • Very high, “web-style” recall • “More-expressive-than-search” query processing • But where is it?

  3. Web Information Extraction • Omnivore • “Extracting and Querying a Comprehensive Web Database.” Michael Cafarella. CIDR 2009. Asilomar, CA. • Suggested remedies for data ingestion, user interaction • This talk says why ideas in that paper might already be out of date, gives alternative ideas • If there are mistakes here, then you have a chance to save me years of work!

  4. Outline • Introduction • Data Ingestion • Previously: Parallel Extraction • Alternative: The Data-Centric Web • User Interaction • Previously: Model Generation for Output • Alternative: Data Integration as UI • Conclusion

  5. Parallel Extraction • Previous hypothesis • Many data models for interesting data, e.g., relational tables and E/R graphs, etc. • Should build large integration infrastructure to consume many extraction streams

  6. Database Construction (1) • Start with a single large Web crawl

  7. Database Construction (2) • Each of k extractors emits output that: • Has an extractor-dependent model • Has an extractor-and-Web-page-dependent schema

  8. Database Construction (3) • For each extractor output, unfold into common entity-relation model

  9. Database Construction (4) • Unify results

  10. Database Construction (5) • Emit final database

  11. Potential Problems • Pressing problems: • Recall • Simple intra-source reconciliation • Time • Tables, entities probably OK for now • Many data sources (DBPedia, Facebook, IMDB) already match one of these two pretty well • One possible different direction: the Data-Centric Web • Addresses recall only

  12. The Data-Centric Web

  13. The Data-Centric Web

  14. The Data-Centric Web

  15. The Data-Centric Web

  16. The Data-Centric Web

  17. The Data-Centric Web

  18. The Data-Centric Web

  19. The Data-Centric Web

  20. The Data-Centric Web

  21. The Data-Centric Web

  22. The Data-Centric Web

  23. The Data-Centric Web

  24. Data-Centric Lists • Lists of Data-Centric Entities give hints: • About what the target entity contains • That all members of set are DCEs, or not • That members of set belong to a class or type (e.g., program committee members)

  25. Build the Data-Centric Web • Download the Web • Train classifiers to detect DCEs, DCLs • Filter out all pages that fail both tests • Use lists to fix up incorrect Data-Centric Entity classifications • Run attr/val extractors on DCEs • Yields E/R dataset, for insertion into DBPedia, YAGO, etc • In progress now… with student Ashwin Balakrishnan, entity detector >95% acc.

  26. Research Question 1 • How many useful entities… • Lack a page in the Data-Centric Web? • (That means no homepage, no Amazon page, no public Facebook page, etc.) • AND are otherwise well-described enough online that IE can recover an entity-centric view? • Put differently: • Does every entity worth extracting already have a homepage on the Web?

  27. Research Question 2 • Does a single real-world entity have more than one “authoritative” URL? • Note that Wikipedia provides pretty minimal assistance in choosing the right entity, but does a good job

  28. Outline • Introduction • Data Ingestion • Previously: Parallel Extraction • Alternative: The Data-Centric Web • User Interaction • Previously: Model Generation for Output • Alternative: Data Integration as UI • Conclusion

  29. Model Generation for Output • Previous hypothesis • Many different user applications built against single back-end database • Difficult task is translating from back-end data model to the application’s data model

  30. Query Processing (1) • Query arrives at system

  31. Query Processing (2) • Entity-relation database processor yields entity results

  32. Query Processing (3) • Query Renderer chooses appropriate output schema

  33. Query Processing (4) • User corrections are logged and fed into later iterations of db construction

  34. Potential Problems • Many plausible front-end applications, none yet totally compelling and novel • Ad- and search-driven ones not novel • Freebase, Wolfram Alpha not compelling • Raw input to learners: useful, not an end-user application • Need to explore possible applications rather than build multi-app infrastructure • One possible different direction: data integration as user primitive

  35. Data Integration as UI • Can we combine tables to create new data sources? • Many existing “mashup” tools, which ignore realities of Web data • A lot of useful data is not in XML • User cannot know all sources in advance • Transient integrations • Dirty data

  36. Interaction Challenge • Try to create a database of all“VLDB program committee members”

  37. Octopus • Provides “workbench” of data integration operators to build target database • Most operators are not correct/incorrect, but high/low quality (like search) • Also, prosaic traditional operators • Originally ran on WebTable data • [VLDB 2009, Cafarella, Khoussainova, Halevy]

  38. Walkthrough - Operator #1 • SEARCH(“VLDB program committee members”)

  39. Walkthrough - Operator #2 • Recover relevant data CONTEXT() CONTEXT()

  40. Walkthrough - Operator #2 • Recover relevant data CONTEXT() CONTEXT()

  41. Walkthrough - Union • Combine datasets Union()

  42. Walkthrough - Operator #3 • Add column to data • Similar to “join” but join target is a topic “publications” EXTEND( “publications”, col=0) • User has integrated data sources with little effort • No wrappers; data was never intended for reuse

  43. CONTEXT Algorithms • Input: table and source page • Output: data values to add to table • SignificantTerms sorts terms in source page by “importance” (tf-idf)

  44. Related View Partners • Looks for different “views” of same data

  45. CONTEXT Experiments

  46. Data Integration as UI • Compelling for db researchers, but will large numbers of people use it?

  47. Conclusion • Automatic Web KBs rapidly progressing • Recall still not good enough for many tasks, but progress is rapid • Not clear what those tasks should be, and progress is much slower • Difficult to predict what’s useful • Sometimes difficult to write a “new app” paper • Omnivore’s approach not wrong, but did not directly address these problems

More Related