peer data management systems plumbing for the semantic web n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Peer Data-Management Systems: Plumbing for the Semantic Web PowerPoint Presentation
Download Presentation
Peer Data-Management Systems: Plumbing for the Semantic Web

Loading in 2 Seconds...

play fullscreen
1 / 33

Peer Data-Management Systems: Plumbing for the Semantic Web - PowerPoint PPT Presentation


  • 135 Views
  • Uploaded on

Peer Data-Management Systems: Plumbing for the Semantic Web. Alon Halevy University of Washington Joint work with Anhai Doan, Jayant Madhavan, Phil Bernstein, and Pedro Domingos. Agenda. Elements of the Semantic Web Piazza: a peer data-management system

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Peer Data-Management Systems: Plumbing for the Semantic Web' - betha


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
peer data management systems plumbing for the semantic web

Peer Data-Management Systems:Plumbing for the Semantic Web

Alon Halevy

University of Washington

Joint work with Anhai Doan, Jayant Madhavan,

Phil Bernstein, and Pedro Domingos

agenda
Agenda
  • Elements of the Semantic Web
  • Piazza: a peer data-management system
    • A database guy’s contribution to the semantic web
  • The key issue: mapping between different models:
    • Some recent progress and current directions.
  • The critical issue: crossing the structure chasm.
  • The talk I’m not giving today:
    • A critique of the Semantic Web.
  • Work and thoughts are in progress
the semantic web my view
The Semantic Web (my view)
  • Web sites include structural annotations
    • You can pose meaningful queries on them.
    • Ontologies provide the semantic glue.
    • Internal implementation of web sites left open.
  • Agents perform tasks:
    • Query one or more web sites
    • Perform updates (e.g., set schedules)
    • Coordinate actions
    • Trust each other (or not).
  • I.e., agents operating on a gigantic heterogeneous distributed database.
getting there
Getting there
  • Robust infrastructure for querying
    • Peer data management systems.
  • Facilitate mapping between different structures. Need tools for:
    • Locating relevant structures
    • Easily joining the semantic web.
  • Get data into structured form
    • Should we worry about the legacy web?
agenda1
Agenda
  • Elements of the Semantic Web (personal view)
  • Piazza: a peer data-management system
    • A database guy’s contribution to the semantic web
  • The key issue: mapping between different models:
    • Some recent progress and current directions.
  • The critical issue: crossing the structure chasm.
piazza peer data management
Piazza: Peer Data-Management

Goal: To enable users to share data across local or wide area networks in an ad-hoc, highly dynamic distributed architecture.

  • Peers can:
    • Export base data
    • Provide views on base data
    • Serve as logical mediators for other peers
  • Every peer can be both a server and a client.
  • Peers join and leave the PDMS at will.
relationship of pdms to
Relationship of PDMS to…
  • P2P overlay networks (the “S” word)
  • Data integration systems (no central logical mediated schema)
  • Federated databases (scale, ad-hoc nature)
  • Distributed databases (no central administration)
representing data
Representing Data
  • A spectrum of possibilities:
    • Relational tables, some integrity constraints
    • XML: can encode relational, hierarchical, OO
      • Xquery – emerging standard query language (SQL for XML)
    • RDF: “XML on drugs”.
      • Sees only the logic; ignores other aspects.
    • DAML+OIL
      • Full blown Knowledge representation language.
  • They all have semantics; just different expressive powers.
  • We keep the data simple. Mappings between data at different peers are more complex.
piazza querying
Piazza Querying
  • Semantic mappings between peers provide glue:

LH:CritBed(bed, hosp, room, PID, status) 

H:CritBed(bed, hosp, room) & H:Patient(PID, bed, status)

9DC:SkilledPerson(PID, "Doctor") :- H:Doctor(SID, h, l, s, e)

9DC:SkilledPerson(PID, "EMT") :- H:EMT(SID, h, vid, s, e)

  • Query processing phases:
    • Reformulate a query into queries over stored data.
      • Minicon algorithm (++) for answering queries using views.
      • Extensions in Piazza enable chaining multiple peer mappings.
    • Find best plan for the query and execute it:
      • Tukwila data integration engine – an efficient processor for network bound XML/relational data.
efficiency issues in piazza
Efficiency Issues in Piazza
  • Intelligent data placement:
    • We may want to place views over data at key points in the PDMS:
      • Save work for frequently asked queries.
      • Increase availability in cases of failures.
    • Akamai for structured data
    • A form of automated reformulation.
    • Large search space of possibilities
    • Surprising lower bounds on very simple cases [Chirkova et al, VLDB 2001].
  • Efficient propagation of updates:
    • Approach: publish updategrams as first-class citizens.
additional piazza issues
Additional Piazza Issues
  • The catalog of data sources
    • What does a catalog of structured data sources look like?
    • How can it be browsed by humans?
    • How do we facilitate joining a PDMS?
    • How can the catalog be distributed physically?
  • Systems issues:
    • Architecture of a Piazza node: what are the components?
    • Naming issues
    • Security
  • Piazza collaborators: Etzioni,Gribble, Ives, Levy, Suciu, Mork, Rodrig, Tatarinov.
agenda2
Agenda
  • Elements of the Semantic Web
  • Piazza: a peer data-management system
    • A database guy’s contribution to the semantic web
  • The key issue: mapping between different models:
    • Some recent progress and current directions.
  • The critical issue: crossing the structure chasm.
it s all about the mappings
It’s All About the Mappings

It’s not about understandingthe data:

It’s about understanding each other.

  • Whenever you see a model for some domain, there is another one hiding around the corner.
  • Mappings provide semantic relationships between different peers.
  • Specifying mappings: inherently a human-assisted task.
  • Goal: make it easy, fast, incremental.
  • Not a new problem!
example semantic mapping
Example Semantic Mapping
  • Mapping between XML DTDs

house

address

contact-info

num-baths

agent-nameagent-phone

1-1 mapping

non 1-1 mapping

house

location contact

full-baths

half-baths

name phone

desiderata from proposed solutions
Desiderata from Proposed Solutions
  • Accuracy, efficiency, ease of use.
  • Extensible: accommodate in a principled fashion:
    • User feedback
    • Domain constraints
    • General heuristics
  • “Memory”, knowledge reuse:
    • System should exploit knowledge from previous matching tasks [LSD].
  • Some underlying semantics.
why matching is difficult
Why Matching is Difficult
  • Structures represent same entity differently
    • different names => same entity:
      • area & address => location
    • same names => different entities:
      • area => location or square-feet
  • Intended semantics is typically subjective!
    • IBM Almaden Lab = IBM?
  • Schema, data and rules never fully capture semantics!
    • not adequately documented, certainly not for machine consumption.
  • Often hard for humans (committees are formed!)
learning for mapping
Learning for Mapping
  • We started simple: generating semantic mappings between a mediated schema and a large set of data source schemas.
  • Key idea: generate the first mappings manually, and learn from them to generate the rest.
  • Technique: multi-strategy learning (extensible!)
  • L(earning) S(ource) D(escriptions) [SIGMOD 2001].
  • Recent and current work:
    • (simple) Ontology mapping [WWW-02]
    • Complex mappings [COMAP]
    • Semantics [Madhavan et al., AAAI-02]
data integration a simple pdms
Data Integration (a simple PDMS)

Find houses with four bathrooms priced under $500,000

mediated schema

Query reformulation

and optimization.

source schema 1

source schema 2

source schema 3

realestate.com

homeseekers.com

homes.com

Applications: WWW, enterprises, science projects

Techniques: virtual data integration, warehousing, custom code.

learning from the manual mappings
Learning from the Manual Mappings

Mediated schema

price agent-name agent-phone office-phone description

listed-pricecontact-namecontact-phoneofficecomments

Schema of realestate.com

If “office” occurs in the name

=> office-phone

realestate.com

listed-price contact-name contact-phone office comments

$250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house

$320K Mike Doan (617) 253 1429 (617) 112 2315 Great location

If “fantastic” & “great” occur frequently in data instances

=> description

homes.com

sold-at contact-agent extra-info

$350K (206) 634 9435 Beautiful yard

$230K (617) 335 4243 Close to Seattle

$190K (512) 342 1263 Great lot

multi strategy learning
Multi-Strategy Learning
  • Use a set of baselearners:
    • Name learner, Naïve Bayes, Whirl, XML learner
  • And a set of recognizers:
    • County name, zip code, phone numbers.
  • Each base learner produces a prediction weighted by confidence score.
  • Combine base learners with a meta-learner, using stacking.
base learners

Observed label

(X1,C1)

(X2,C2)

...

(Xm,Cm)

Object

Classification model

(hypothesis)

Training

examples

Base Learners
  • Training
  • Matching
  • Name Learner
    • training: (“location”, address) (“contactname”, name)
    • matching: agent-name => (name,0.7),(phone,0.3)
  • Naive Bayes Learner
    • training: (“Seattle, WA”,address) (“250K”,price)matching: “Kent, WA” => (address,0.8),(name,0.2)

labels weighted by confidence score

X

meta learner stacking wolpert 92 ting witten99
Meta-Learner: Stacking[Wolpert 92,Ting&Witten99]
  • Training
    • uses training data to learn weights
    • one for each (base-learner,mediated-schema element) pair
    • weight (Name-Learner,address) = 0.2
    • weight (Naive-Bayes,address) = 0.8
  • Matching: combine predictions of base learners
    • computes weighted average of base-learner confidence scores

area

Name Learner

Naive Bayes

(address,0.4)

(address,0.9)

Seattle, WA

Kent, WA

Bend, OR

Meta-Learner

(address, 0.4*0.2 + 0.9*0.8 = 0.8)

the lsd architecture
The LSD Architecture

Training Phase

Matching Phase

Mediated schema

Source schemas

Training data

for base learners

Base-Learner1 .... Base-Learnerk

Meta-Learner

Base-Learner1

Base-Learnerk

Predictions for instances

Hypothesis1

Hypothesisk

Prediction Combiner

Domain

constraints

Predictions for elements

Constraint Handler

Weights for

Base Learners

Meta-Learner

Mappings

domain constraints
Domain Constraints
  • Encode user knowledge about the domain
  • Specified by examining mediated schema
  • Examples
    • at most one source-schema element can match address
    • if a source-schema element matches house-id then it is a key
    • avg-value(price) > avg-value(num-baths)
  • Given a mapping combination
    • can verify if it satisfies a given constraint

area: address

sold-at: price

contact-agent: agent-phone

extra-info: address

empirical evaluation
Empirical Evaluation
  • Four domains
    • Real Estate I & II, Course Offerings, Faculty Listings
  • For each domain
    • create mediated DTD & domain constraints
    • choose five sources
    • extract & convert data listings into XML (faithful to schema!)
    • mediated DTDs: 14 - 66 elements, source DTDs: 13 - 48
  • Ten runs for each experiment - in each run:
    • manually provide 1-1 mappings for 3 sources
    • ask LSD to propose mappings for remaining 2 sources
    • accuracy = % of 1-1 mappings correctly identified
matching accuracy
Matching Accuracy

Average Matching Acccuracy (%)

LSD’s accuracy: 71 - 92%

Best single base learner: 42 - 72%

+ Meta-learner: + 5 - 22%

+ Constraint handler: + 7 - 13%

+ XML learner: + 0.8 - 6%

sensitivity to amount of available data
Sensitivity to Amount of Available Data

Average matching accuracy (%)

Number of data listings per source (Real Estate I)

contribution of schema vs data
Contribution of Schema vs. Data

LSD with only schema info.

LSD with only data info.

Complete LSD

Average matching accuracy (%)

  • More experiments in the paper [Doan et. al. 01]
contribution of each component
Contribution of Each Component

Average Matching Acccuracy (%)

Without Name Learner

Without Naive Bayes

Without Whirl Learner

Without Constraint Handler

The complete LSD system

the next steps
The Next Steps
  • Learning is a useful component. But it needs to be combined with:
    • User feedback
    • Domain constraints
    • General heuristics
  • Need a representation of mappings:
    • First step – see [Madhavan et al., AAAI-02]
      • Also defines key inference problems for such a representation,
      • Provides answers for the mapping language used in Piazza.
    • Ultimately, some first-order probabilistic representation.
  • Need benchmarks to measure progress.
agenda3
Agenda
  • Elements of the Semantic Web
  • Piazza: a peer data-management system
    • A database guy’s contribution to the semantic web
  • The key issue: mapping between different models:
    • Some recent progress and current directions.
  • The critical issue: crossing the structure chasm.
can we cross the structure chasm
Can We Cross the Structure Chasm?
  • There are two worlds:
    • U-world: the current web, keyword search, google
    • S-world: databases, knowledge bases, structured queries
  • The web succeeded because it’s in the u-world.
  • For the semantic web to succeed, we need to make it dead simple for people to:
    • Structure data, locate relevant data and data sets, query.
  • However:
    • People have a hard time structuring their data
    • It’s harder to query structured data: need to know a terminology.
    • It’s harder to understand each other in the S-world.
  • DB and KR people have no clue how to deal with this.
  • More expressive power in the languages won’t help.