-- MetaQuerier and Beyond –- A Trilogy of Search, Integration, and Mining

-- MetaQuerier and Beyond –-A Trilogy of Search, Integration, and Mining Kevin C. Chang Joint work with: Bin He, Zhen Zhang, Joe Kelley, Tao Cheng, Bill Davis, Shui-Lung Chuang

Do you believe it? Google is only the start of search. Web search still full of challenges and opportunities. In terms of problems: • Dual challenges we must tackle. In terms of solutions: • Trio techniques we must develop.

The Dual Challenges on the Web:Getting structure data from … • The “deep” Web • semantic-rich, structured data hidden “deeply” inside databases on the Web • structure ready; access non-trivial. • The “surface” Web • semantic-rich, structured data hidden “implicitly” on the surface Web • access ready; structure non-trivial.

I am inspired: Good stories must go in “trio.” Sociology. Science. History.

The Web “Trilogy” (My three circles...) Search Integration Mining

First: When we started… Search Integration Mining On the Internet, search must eventually resort to integration.

The previous Web: Search used to be “crawl and index”

The current Web: Search must eventually resort to integration

How to enable effective access to the deep Web? Cars.com Amazon.com Biography.com Apartments.com 411localte.com 401carfinder.com

Amy is a new graduate, just moving to her new career • Finding sources: • Wants to upgrade her car– Where can she study for her options? (cars.com, edmunds.com) • Wants to buy a house – Where can she look for houses in her town? (realtor.com) • Wants to write a grant proposal. (NSF Award Search) Wants to check for patents. (uspto.gov) • Querying sources: • Then, she needs to learn the grueling details of querying

MetaQuerier: Exploring and integrating the deep Web • Explorer • source discovery • source modeling • source indexing FIND sources Amazon.com Cars.com db of dbs • Integrator • source selection • schema integration • query mediation Apartments.com QUERYsources 411localte.com unified query interface

Toward large scale integration: MetaQuerier for the deep Web We are facing very different “large scale” scenarios! • Many sources on the Web, order of 105 Such integration must be dynamic and ad-hoc: • Dynamic discovery: • Sources are dynamically changing • On-the-fly integration: • Queries are ad-hoc and need different sources • Our proposal: MetaQuerier for the deep Web

Second: Then we realized… Search Integration Mining Large scale integration must essentially resort to mining of semantics.

The challenge boils down to –How to deal with “deep” semantics across a large scale? “Semantics” is the key in integration! • How to understand a query interface? • Where is the first condition? What’s its attribute? • How to match query interfaces? • What does “author” on this source match on that? • How to translate queries? • How to ask this query on that source?

Survey the frontier before going to the battle. We found… • Challenge reassured: • 450,000 online databases • 1,258,000 query interfaces • 307,000 deep web sites • 3-7 times increase in 4 years • Insight revealed: • Web sources are not arbitrarily complex • “Amazon effect” – convergence and regularity naturally emerge

“Amazon effect” in action… Attributes converge in a domain! Condition patterns converge even across domains!

Unified insight: Holistic integration • Holistic integration: • Take a holistic view to account for many sources together in integration • Globally exploit clues across all sources for resolving the ``semantics'' of interest • A conceptually unifying framework: • Many of our tasks implicitly share this framework

Large-scale itself presents opportunity -- Shallow integration across holistic sources • Shallow observable clues: • ``underlying'' semantics often relates to the ``observable'' presentations in some way of connection. • Holistic hidden regularities: • Such connections often follow some implicit properties, which will reveal holistically across sources Some Way of Connection Presentations (observed) Semantics: (to be discovered) Hidden Regularities Reverse Analysis

attribute operator value Some evidences for “holistic integration” • Evidence 1: [SIGMOD04] Query Interface Understanding Hidden-syntax parsing • Evidence 2: [SIGMOD03, KDD04] Matching Query Interfaces Hidden-model discovery

Demo. Knocking the Door to the Deep Web

Interface Understanding:A hidden syntactic-model exist?

Tokenizer HTML Layout Engine Our Paradigm: Best-Effort Visual Language Parsing Framework Input: HTML query form 2P Grammar Preferences Productions BE-Parser Ambiguity Resolution Error Handling X Output: semantic structure

Interface Matching:A hidden statistical model exists? Instantiation probability:P(QI1|M) • Our view: • Now the problem is: P M QI1 QIs Finite Vocabulary Statistical Model Generate QIs with different probabilities P M Given , can we discover ? QIs

Towards hidden model discovery: Statistical schema matching (MGS) M 1. Define an abstract Model structure M to solve the target question P(QI|M) = … 2. Given the observed QIs, Generate the model candidates M1 M2 P(QIs|M) > 0 AA BB CC SS TT PP 3. Select the model candidate with highest confidence M1 What is the confidence of given ? AA BB CC

Evidences for holistic integration • Evidence 1: [SIGMOD04] Query Interface Understanding by Hidden-syntax parsing • Evidence 2: [SIGMOD03, KDD04] Query Interfaces Matching by Hidden-model discovery Syntactic Composer Statistic Generator Hidden Syntax (Grammar) Hidden Generative Model Visual Patterns Query Capabilities Attribute Occurrences Attribute Matchings Syntactic Analyzer Statistic Analyzer

MetaQuerier Front-end: Query Execution Type Patterns Result Compilation Query Translation Source Selection Query Web databases Find Web databases Deep Web Repository Query Interfaces Query Capabilities Subject Domains Unified Interfaces Back-end: Semantics Discovery The Deep Web Grammar Database Crawler Interface Extraction Source Clustering Schema Matching Putting together: The MetaQuerier system

MetaQuerier: Where we are… • Completed several key subtasks: • Query-interface understanding[SIGMOD’04] • Schema matching[SIGMOD’03, KDD’04] • Source clustering[CIKM’04] • Query translation[VLDB-IIWeb’04] • DB search [ICDE-WIRI’05] • Deep Web survey [SIGMOD-Record Sep’04] • Shallow, holistic integration approach [VLDB-IIWeb’04, SIGMOD-Record Dec’04] • System demo[SIGMOD’04, ICDE’05, SIGMOD’05] • System integration[CIDR’05] • Moving forward to exciting system issues: • System integration for building an integration system • Scale up by deploying actual crawling

Third: What next? The Web trio. Search Integration Mining

So here we are… Now, from mining to search? Ask not what you can do with Google; ask what Google should do for you.

Creative Mining Application Creative Mining Application Heavy Logic Heavy Logic keywords pages, count What can you do with Google? You are very creative, and the only limit is … After all, Google is designed for page retrieval. Search Engine The Web

Your creativity is amazing: A few examples • WSQ/DSQ at Stanford • use page counts to rank term associations • QXtract at Columbia • generate keywords to retrieve docs useful for extract • KnowItAll at Washington • both ideas in one framework • And there must be many I don’t know yet… • Time to distill to build a better “mining” engine?

Mining Application Mining Application Mining Application The WISDM Goal WISDM: Web Indexing and Search for Dynamic Mining The Web • To begin with, what functions to provide?

First step. Entity-Relation discovery: Tag basic entities; weave them into relations prof phone email WISDM-ER <prof, phone, email> David DeWitt 608-263-5489 dewitt@cs.wisc.edu R1 Marianne Winslett 333-3536 winslett@cs.uiuc.edu Entity-Relation Discovery … … … … … … <prof, univ, research> R2 prof univ research David DeWitt U. Wisconsin database systems Chris Clifton Purdue U. data mining … … … … … … The Web

Demo. We decided to quickly build Ver. 0.1, to understand the promises and issues.

Current testbed– A small corpus to peek the potential • Data pages: 6 “US-Central” CS departments • Basic entities: prof, email, phone, univ, research, state

Entity-Relation Discovery: How to define the function conceptually? Our view: An ERD Query = (S, E, F, C)

System: Page retrieval to relation discovery

?? Promises of the ERD Concept • From IR to a mining engine • not only page retrieval but also construction • From offline to online query processing • enable large scale ad-hoc mining over the web • From tuple at a time to table at a time • global relation construction by “constraints” • From Web to controlled corpus • enhance not only efficiency but also effectiveness • From passive to active application-driven indexing • enable mining applications

Issues? Where is the science? • Tagging of basic entities? • Powerful pattern language • Linguistic; visual • Advanced statistical analysis • correlation; sampling • Scalable query processing • new components scale?

Thank You! For more information: http://metaquerier.cs.uiuc.edu kcchang@cs.uiuc.edu

-- MetaQuerier and Beyond –-A Trilogy of Search, Integration, and Mining Kevin C. Chang Joint work with: Bin He, Zhen Zhang, Joe Kelley, Tao Cheng, Bill Davis

At MSRA, I am probably preaching to the choir: Google is only the start of search. Web search still full of challenges and opportunities. In terms of problems: • Dual challenges we must tackle. In terms of solutions: • Trio techniques we must develop.

Thank You! And a team of excellent students… Bin He Zhen Zhang Joe Kelley Tao Cheng Bill Davis Shui-Lung Chuang For more information: http://metaquerier.cs.uiuc.edu kcchang@cs.uiuc.edu

Example applications:“Relation” is the essence of many info search • CSContact: By weaving R1 = <prof, phone, email>: • What is the phone and email of, say, Marianne Winslett? • What are the email of all profs at Illinois? • CSResearch: By weaving R2 = <prof, univ, research>: • What is the research area of Winslett? • Who are database professors at various universities? • Which area has the most faculty at Illinois?

……… e2… e1… en ……… ……… e2… e1… en ……… ……… e2… e1… en ……… Is this possible? Our Hypotheses: “Tuple” patterns will not only emerge but also converge S Page Creation H Entity Occurrences Tuple Semantics Cooccurrence Patterns Pattern-based Cooccurrence Analysis

-- MetaQuerier and Beyond –- A Trilogy of Search, Integration, and Mining

-- MetaQuerier and Beyond –- A Trilogy of Search, Integration, and Mining

Presentation Transcript

Regression for Data Mining

Opinion Mining A Short Tutorial

Chapter 2 Data Mining

Mining Billion-Node Graphs - Patterns and Algorithms

Data Mining

Data Mining Tools

What is Search Engine Optimization (SEO)?

INTRODUCTION TO DATA MINING

Web Mining : A Bird ’ s Eye View

CS590D: Data Mining Prof. Chris Clifton

Integration by Parts Integration Using Tables of Integrals Numerical Integration

Mining Billion-node Graphs: Patterns, Generators and Tools

Mining Complex Types of Data

UNIT-1 Introduction

Data Mining Tutorial

Integration - Application Systems

CS590D: Data Mining Chris Clifton

Data Mining

CSE 634 Data Mining Concepts and Techniques Association Rule Mining