1 / 45

-- MetaQuerier and Beyond –- A Trilogy of Search, Integration, and Mining

-- MetaQuerier and Beyond –- A Trilogy of Search, Integration, and Mining. Kevin C. Chang Joint work with : Bin He, Zhen Zhang, Joe Kelley, Tao Cheng, Bill Davis, Shui-Lung Chuang. Do you believe it? Google is only the start of search. Web search still full of challenges and opportunities.

ranae
Download Presentation

-- MetaQuerier and Beyond –- A Trilogy of Search, Integration, and Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. -- MetaQuerier and Beyond –-A Trilogy of Search, Integration, and Mining Kevin C. Chang Joint work with: Bin He, Zhen Zhang, Joe Kelley, Tao Cheng, Bill Davis, Shui-Lung Chuang

  2. Do you believe it? Google is only the start of search. Web search still full of challenges and opportunities. In terms of problems: • Dual challenges we must tackle. In terms of solutions: • Trio techniques we must develop.

  3. The Dual Challenges on the Web:Getting structure data from … • The “deep” Web • semantic-rich, structured data hidden “deeply” inside databases on the Web • structure ready; access non-trivial. • The “surface” Web • semantic-rich, structured data hidden “implicitly” on the surface Web • access ready; structure non-trivial.

  4. I am inspired: Good stories must go in “trio.” Sociology. Science. History.

  5. The Web “Trilogy” (My three circles...) Search Integration Mining

  6. First: When we started… Search Integration Mining On the Internet, search must eventually resort to integration.

  7. The previous Web: Search used to be “crawl and index”

  8. The current Web: Search must eventually resort to integration

  9. How to enable effective access to the deep Web? Cars.com Amazon.com Biography.com Apartments.com 411localte.com 401carfinder.com

  10. Amy is a new graduate, just moving to her new career • Finding sources: • Wants to upgrade her car– Where can she study for her options? (cars.com, edmunds.com) • Wants to buy a house – Where can she look for houses in her town? (realtor.com) • Wants to write a grant proposal. (NSF Award Search) Wants to check for patents. (uspto.gov) • Querying sources: • Then, she needs to learn the grueling details of querying

  11. MetaQuerier: Exploring and integrating the deep Web • Explorer • source discovery • source modeling • source indexing FIND sources Amazon.com Cars.com db of dbs • Integrator • source selection • schema integration • query mediation Apartments.com QUERYsources 411localte.com unified query interface

  12. Toward large scale integration: MetaQuerier for the deep Web We are facing very different “large scale” scenarios! • Many sources on the Web, order of 105 Such integration must be dynamic and ad-hoc: • Dynamic discovery: • Sources are dynamically changing • On-the-fly integration: • Queries are ad-hoc and need different sources • Our proposal: MetaQuerier for the deep Web

  13. Second: Then we realized… Search Integration Mining Large scale integration must essentially resort to mining of semantics.

  14. The challenge boils down to –How to deal with “deep” semantics across a large scale? “Semantics” is the key in integration! • How to understand a query interface? • Where is the first condition? What’s its attribute? • How to match query interfaces? • What does “author” on this source match on that? • How to translate queries? • How to ask this query on that source?

  15. Survey the frontier before going to the battle. We found… • Challenge reassured: • 450,000 online databases • 1,258,000 query interfaces • 307,000 deep web sites • 3-7 times increase in 4 years • Insight revealed: • Web sources are not arbitrarily complex • “Amazon effect” – convergence and regularity naturally emerge

  16. “Amazon effect” in action… Attributes converge in a domain! Condition patterns converge even across domains!

  17. Unified insight: Holistic integration • Holistic integration: • Take a holistic view to account for many sources together in integration • Globally exploit clues across all sources for resolving the ``semantics'' of interest • A conceptually unifying framework: • Many of our tasks implicitly share this framework

  18. Large-scale itself presents opportunity -- Shallow integration across holistic sources • Shallow observable clues: • ``underlying'' semantics often relates to the ``observable'' presentations in some way of connection. • Holistic hidden regularities: • Such connections often follow some implicit properties, which will reveal holistically across sources Some Way of Connection Presentations (observed) Semantics: (to be discovered) Hidden Regularities Reverse Analysis

  19. attribute operator value Some evidences for “holistic integration” • Evidence 1: [SIGMOD04] Query Interface Understanding Hidden-syntax parsing • Evidence 2: [SIGMOD03, KDD04] Matching Query Interfaces Hidden-model discovery

  20. Demo. Knocking the Door to the Deep Web

  21. Interface Understanding:A hidden syntactic-model exist?

  22. Tokenizer HTML Layout Engine Our Paradigm: Best-Effort Visual Language Parsing Framework Input: HTML query form 2P Grammar Preferences Productions BE-Parser Ambiguity Resolution Error Handling X Output: semantic structure

  23. Interface Matching:A hidden statistical model exists? Instantiation probability:P(QI1|M) • Our view: • Now the problem is: P M QI1 QIs Finite Vocabulary Statistical Model Generate QIs with different probabilities P M Given , can we discover ? QIs

  24. Towards hidden model discovery: Statistical schema matching (MGS) M 1. Define an abstract Model structure M to solve the target question P(QI|M) = … 2. Given the observed QIs, Generate the model candidates M1 M2 P(QIs|M) > 0 AA BB CC SS TT PP 3. Select the model candidate with highest confidence M1 What is the confidence of given ? AA BB CC

  25. Evidences for holistic integration • Evidence 1: [SIGMOD04] Query Interface Understanding by Hidden-syntax parsing • Evidence 2: [SIGMOD03, KDD04] Query Interfaces Matching by Hidden-model discovery Syntactic Composer Statistic Generator Hidden Syntax (Grammar) Hidden Generative Model Visual Patterns Query Capabilities Attribute Occurrences Attribute Matchings Syntactic Analyzer Statistic Analyzer

  26. MetaQuerier Front-end: Query Execution Type Patterns Result Compilation Query Translation Source Selection Query Web databases Find Web databases Deep Web Repository Query Interfaces Query Capabilities Subject Domains Unified Interfaces Back-end: Semantics Discovery The Deep Web Grammar Database Crawler Interface Extraction Source Clustering Schema Matching Putting together: The MetaQuerier system

  27. MetaQuerier: Where we are… • Completed several key subtasks: • Query-interface understanding[SIGMOD’04] • Schema matching[SIGMOD’03, KDD’04] • Source clustering[CIKM’04] • Query translation[VLDB-IIWeb’04] • DB search [ICDE-WIRI’05] • Deep Web survey [SIGMOD-Record Sep’04] • Shallow, holistic integration approach [VLDB-IIWeb’04, SIGMOD-Record Dec’04] • System demo[SIGMOD’04, ICDE’05, SIGMOD’05] • System integration[CIDR’05] • Moving forward to exciting system issues: • System integration for building an integration system • Scale up by deploying actual crawling

  28. Third: What next? The Web trio. Search Integration Mining

  29. So here we are… Now, from mining to search? Ask not what you can do with Google; ask what Google should do for you.

  30. Creative Mining Application Creative Mining Application Heavy Logic Heavy Logic keywords pages, count What can you do with Google? You are very creative, and the only limit is … After all, Google is designed for page retrieval. Search Engine The Web

  31. Your creativity is amazing: A few examples • WSQ/DSQ at Stanford • use page counts to rank term associations • QXtract at Columbia • generate keywords to retrieve docs useful for extract • KnowItAll at Washington • both ideas in one framework • And there must be many I don’t know yet… • Time to distill to build a better “mining” engine?

  32. Mining Application Mining Application Mining Application The WISDM Goal WISDM: Web Indexing and Search for Dynamic Mining The Web • To begin with, what functions to provide?

  33. First step. Entity-Relation discovery: Tag basic entities; weave them into relations prof phone email WISDM-ER <prof, phone, email> David DeWitt 608-263-5489 dewitt@cs.wisc.edu R1 Marianne Winslett 333-3536 winslett@cs.uiuc.edu Entity-Relation Discovery … … … … … … <prof, univ, research> R2 prof univ research David DeWitt U. Wisconsin database systems Chris Clifton Purdue U. data mining … … … … … … The Web

  34. Demo. We decided to quickly build Ver. 0.1, to understand the promises and issues.

  35. Current testbed– A small corpus to peek the potential • Data pages: 6 “US-Central” CS departments • Basic entities: prof, email, phone, univ, research, state

  36. Entity-Relation Discovery: How to define the function conceptually? Our view: An ERD Query = (S, E, F, C)

  37. System: Page retrieval to relation discovery

  38. ?? Promises of the ERD Concept • From IR to a mining engine • not only page retrieval but also construction • From offline to online query processing • enable large scale ad-hoc mining over the web • From tuple at a time to table at a time • global relation construction by “constraints” • From Web to controlled corpus • enhance not only efficiency but also effectiveness • From passive to active application-driven indexing • enable mining applications

  39. Issues? Where is the science? • Tagging of basic entities? • Powerful pattern language • Linguistic; visual • Advanced statistical analysis • correlation; sampling • Scalable query processing • new components scale?

  40. Thank You! For more information: http://metaquerier.cs.uiuc.edu kcchang@cs.uiuc.edu

  41. -- MetaQuerier and Beyond –-A Trilogy of Search, Integration, and Mining Kevin C. Chang Joint work with: Bin He, Zhen Zhang, Joe Kelley, Tao Cheng, Bill Davis

  42. At MSRA, I am probably preaching to the choir: Google is only the start of search. Web search still full of challenges and opportunities. In terms of problems: • Dual challenges we must tackle. In terms of solutions: • Trio techniques we must develop.

  43. Thank You! And a team of excellent students… Bin He Zhen Zhang Joe Kelley Tao Cheng Bill Davis Shui-Lung Chuang For more information: http://metaquerier.cs.uiuc.edu kcchang@cs.uiuc.edu

  44. Example applications:“Relation” is the essence of many info search • CSContact: By weaving R1 = <prof, phone, email>: • What is the phone and email of, say, Marianne Winslett? • What are the email of all profs at Illinois? • CSResearch: By weaving R2 = <prof, univ, research>: • What is the research area of Winslett? • Who are database professors at various universities? • Which area has the most faculty at Illinois?

  45. ……… e2… e1… en ……… ……… e2… e1… en ……… ……… e2… e1… en ……… Is this possible? Our Hypotheses: “Tuple” patterns will not only emerge but also converge S Page Creation H Entity Occurrences Tuple Semantics Cooccurrence Patterns Pattern-based Cooccurrence Analysis

More Related