1 / 49

Rapidly Constructing Integrated Applications from Online Sources

Rapidly Constructing Integrated Applications from Online Sources. Craig A. Knoblock Information Science Institute University of Southern California. BiddingForTravel.com. Priceline. Map. Orbitz. Motivating Example. ?. Outline. Extracting data from unstructured and ungrammatical sources

barbie
Download Presentation

Rapidly Constructing Integrated Applications from Online Sources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California

  2. BiddingForTravel.com Priceline Map Orbitz Motivating Example ?

  3. Outline • Extracting data from unstructured and ungrammatical sources • Automatically discovering models of sources • Dynamically building integration plans • Efficiently executing the integration plans

  4. Outline • Extracting data from unstructured and ungrammatical sources • Automatically discovering models of sources • Dynamically building integration plans • Efficiently executing the integration plans

  5. Ungrammatical & Unstructured Text

  6. Ungrammatical & Unstructured Text For simplicity  “posts” Goal: <hotelArea>univ. ctr.</hotelArea> <price>$25</price> <hotelName>holiday inn sel.</hotelName> Wrapper based IE does not apply (e.g. Stalker, RoadRunner) NLP based IE does not apply (e.g. Rapier)

  7. Reference Sets IE infused with outside knowledge “Reference Sets” • Collections of known entities and the associated attributes • Online (offline) set of docs • CIA World Fact Book • Online (offline) database • Comics Price Guide, Edmunds, etc.

  8. Algorithm Overview – Use of Ref Sets

  9. Our Record Linkage Problem • Posts not yet decomposed attributes. • Extra tokens that match nothing in Ref Set. Post: “$25 winning bid at holiday inn sel. univ. ctr.” hotel name hotel area Reference Set: hotel name hotel area

  10. Our Record Linkage Solution P = “$25 winning bid at holiday inn sel. univ. ctr.” Record Level Similarity + Field Level Similarities VRL= < RL_scores(P,“Hyatt Regency Downtown”), RL_scores(P,“Hyatt Regency”), RL_scores(P,“Downtown”)> Binary Rescoring SVM Best matching member of the reference set for the post

  11. Extraction Algorithm Post: $25 winning bid at holiday inn sel. univ. ctr. VIE= <common_scores(token), IE_scores(token, attr1), IE_scores(token, attr2), … > Generate VIE Multiclass SVM $25 winning bid at holiday inn sel. univ. ctr. price hotel name hotel area $25 holiday inn sel. univ. ctr. Clean Whole Attribute

  12. Experimental Data Sets Hotels • Posts • 1125 posts from www.biddingfortravel.com • Pittsburgh, Sacramento, San Diego • Star rating, hotel area, hotel name, price, date booked • Reference Set • 132 records • Special posts on BFT site. • Per area – list any hotels ever bid on in that area • Star rating, hotel area, hotel name

  13. Comparison to Existing Systems Record Linkage • WHIRL • RL allows non-decomposed attributes Information Extraction • Simple Tagger (CRF) • State-of-the-art IE • Amilcare • NLP based IE

  14. Record linkage results 10 trials – 30% train, 70% test

  15. Token level Extraction results: Hotel domain Not Significant

  16. Outline • Extracting data from unstructured and ungrammatical sources • Automatically discovering models of sources • Dynamically building integration plans • Efficiently executing the integration plans

  17. lowestFare(“MXP”,“PIT”) Query Reformulated Query Reformulated Query SELECT MIN(price) FROM flight WHERE depart=“MXP” AND arrive=“PIT” calcPrice(“MXP”,“PIT”,”economy”) new service Alitalia Discovering Models of Sources Required for Integration • Provide uniform access to heterogeneous sources • Source definitions are used to reformulate queries • New service, no source model, no integration! • Can we discover models automatically? Web Services United Mediator Lufthansa • Source • Definitions: • United • Lufthansa • - Qantas Qantas ?

  18. known source rate LatestRates($country1,$country2,rate):- exchange(country1,country2,rate) new source currency RateFinder($fromCountry,$toCountry,val):- ? Inducing Source Definitions:A Simple Example • Step 1: use metadata to classify input types • Step 2: invoke service and classify output types Mediator Semantic Types: currency  {USD, EUR, AUD} rate {1936.2, 1.3058, 0.53177} Predicates: exchange(currency,currency,rate) {<EUR,USD,1.30799>,<USD,EUR,0.764526>,…}

  19. rate match currency Inducing Source Definitions:A Simple Example • Step 3: generate plausible source definitions • Step 4: reformulate in terms of other sources • Step 5: invoke service and compare output new source RateFinder($fromCountry,$toCountry,val):- ? def_1($from, $to, val) :- exchange(from,to,val) def_2($from, $to, val) :- exchange(to,from,val) Mediator def_1($from, $to, val) :- LatestRates(from,to,val) def_2($from, $to, val) :- LatestRates(to,from,val) Predicates: exchange(currency,currency,rate)

  20. The Framework Intuition:Services often have similar semantics, so we should be able to use what we know to induce that which we don’t Two phase algorithm For each operation provided by the new service: • Classify its input/output data types • Classify inputs based on metadata similarity • Invoke operation & classify outputs based on data • Induce a source definition • Generate candidates via Inductive Logic Programming • Test individual candidates by reformulating them

  21. Use Case: Zip Code Data • Single real zip-code service with multiple operations • The first operation is defined as: • Goal is to induce definition for a second operation: • Same service so no need to classify inputs/outputs or match constants! getDistanceBetweenZipCodes($zip1, $zip2, distance) :- centroid(zip1, lat1, long1), centroid(zip2, lat2, long2), distanceInMiles(lat1, long1, lat2, long2, distance). getZipCodesWithin($zip1, $distance1, zip2, distance2) :- centroid(zip1, lat1, long1), centroid(zip2, lat2, long2), distanceInMiles(lat1, long1, lat2, long2, distance2), (distance2 ≤ distance1), (distance1 ≤ 300).

  22. INVALID d2 unbound! #d is a constant UNCHECKABLE lt1 inaccessible! contained in defs 2 & 4 Generating definitions: ILP • Want to induce source definition for: • Predicates available for generating definitions: {centroid, distanceInMiles,≤,=} • New type signature contains that of known source • Use known definition as starting point for local search: getZipCodesWithin($zip1, $distance1, zip2, distance2) getDistanceBetweenZipCodes($zip1, $zip2, distance) :- centroid(zip1, lat1, long1), centroid(zip2, lat2, long2), distanceInMiles(lat1, long1, lat2, long2, distance).

  23. Preliminary Results Settings: • Number of zip code constants initially available: 6 • Number of samples performed per trial: 20 • Number of candidate definitions in search space: 5 Results: • Converged on “almost correct’’ definition!!! • Number of iterations to convergence: 12 getZipCodesWithin($zip1, $distance1, zip2, distance2) :- centroid(zip1, lat1, long1), centroid(zip2, lat2, long2), distanceInMiles(lat1, long1, lat2, long2, distance2), (distance2 ≤ distance1), (distance1 ≤ 243).

  24. Related Work • Classifying Web Services (Hess & Kushmerick 2003), (Johnston & Kushmerick 2004) • Classify input/output/services using metadata/data • We learn semantic relationships between inputs & outputs • Category Translation (Perkowitz & Etzioni 1995) • Learn functions describing operations available on internet • We concentrate on a relational modeling of services • CLIO (Yan et. al. 2001) • Helps users define complex mappings between schemas • They do not automate the process of discovering mappings • iMAP (Dhamanka et. al. 2004) • Automates discovery of certain complex mappings • Our approach is more general (ILP) & tailored to web sources • We must deal with problem of generating valid input tuples

  25. Outline • Extracting data from unstructured and ungrammatical sources • Automatically discovering models of sources • Dynamically building integration plans • Efficiently executing the integration plans

  26. (1). SwissProtein: P36246 (2). GeneBank: AAS60665.1 ……… Find information about all proteins that participate in Transcription process Dynamically Building Integration Plans Traditional Data Integration Techniques Mediator

  27. Create a web service that accepts a name of a biological process, <bname>, and returns information about proteins that participate in it New web service Dynamically Building Integration Plans (Cont’d) Problem Solved Here Mediator

  28. Problem Statement (Cont’d) • Assumption • Information-producing web service operations • Applicability • Biological data web services • Geospatial services (WMS, WFS) • Other applications that do not focus on transactions

  29. Query-based Web Service Composition • Query-based approach • View web service operations as source relations with binding restrictions • Can be inferred from WSDL • Create domain ontology • Describe source relations in terms of domain relations • Combined Global-as-View / Local-as-View approach • Use data integration system to answer user queries

  30. Template-based Web Service Composition • Our goal is to compose new web services • We need to answer template queries, not specific queries • Template-based Query Approach • Generate plans to take into account general parameter values, • i.e. Universal Plan [Schoppers, et. al.] • Easy to generate universal plan • Plans that answer template query as oppose to specific query • But, plans can be very inefficient • Need to generate optimized “universal integration plans”

  31. Example Scenario • Sources HSProtein($id, name, location, function, seq, pubmedid) MMProtein($id, name, location, function, seq, pubmedid) Protein TranducerProtein($id, name, location, taxonid, seq, pubmedid) MembraneProtein($id, name, location, taxonid, seq, pubmedid) DipProtein($id, name, location, taxonid, function) Protein-Protein Interactions MMProteinInteractions($fromid, toid, source, verified) HSProteinInteractions($fromid, toid, source, verified)

  32. Example Rules and Query ProteinProteinInteractions(fromid, toid, taxonid, source, verified):- HSProteinInteractions(fromid, toid, source, verified),(taxonid=9606) ProteinProteinInteractions(fromid, toid, taxonid, source, verified):- MMProteinInteractions(fromid, toid, source, verified), (taxonid=10090) ProteinProteinInteractions(fromid, toid, taxonid, source, verified):- ProteinProteinInteractions(fromid, itoid, taxonid, source, verified), ProteinProteinInteractions(itoid, toid, taxonid, source, verified) Q(fromid, toid, taxonid, source, verified):- fromid = !fromid, taxonid = !taxonid, ProteinProteinInteractions(fromid, toid, taxonid, source, verified)

  33. Unoptimized Plan

  34. Optimized Plan • Exploit constraints in source description to filter queries to sources

  35. Example Scenario • Q1(fromid, fromname, fromseq, frompubid, toid, toname, toseq, topubid):- fromid = !fromproteinid, Protein(fromid, fromname, loc1, f1, fromseq, frompubid, taxonid1), ProteinProteinInteractions(fromid, toid, taxonid, source, verified), Protein(toid, toname, loc2, f2, toseq, topubid, taxonid2) Output Input Fromproteinid, fromseq, Toproteinid, toseq Fromproteinid ComposedPlan Protein Fromproteinid, fromseq Join Fromproteinid, Toproteinid, toseq Protein-Protein Interactions Fromproteinid, Toproteinid Protein

  36. Example Integration Plan

  37. Adding Sensing Operations for Tuple-level Filtering • Compute original plan for a template query • For each constraint on the sources • Introduce constraint into the query • Rerun inverse rules algorithm • Compare cost of new plan to original plan • Save plan with lowest cost

  38. Optimized Universal Integration Plan

  39. Outline • Extracting data from unstructured and ungrammatical sources • Automatically discovering models of sources • Dynamically building integration plans • Efficiently executing the integration plans

  40. Dataflow-style, Streaming Execution • Map datalog plans into streaming, dataflow execution system (e.g., network query engine) • We use the Theseus execution system since it supports recursion • Key challenges • Mapping non-recursive plans • Mapping recursive plans • Data processing • Loop detection • Query results update • Termination check • Recursive callback

  41. Example Translation ProteinProteinInteractions(fromid, toid, taxonid, source, verified):- HSProteinInteractions(fromid, toid, source, verified),(taxonid=9606) ProteinProteinInteractions(fromid, toid, taxonid, source, verified):- MMProteinInteractions(fromid, toid, source, verified), (taxonid=10090) ProteinProteinInteractions(fromid, toid, taxonid, source, verified):- ProteinProteinInteractions(fromid, itoid, taxonid, source, verified), ProteinProteinInteractions(itoid, toid, taxonid, source, verified) Q(fromid, toid, taxonid, source, verified):- ProteinProteinInteractions(fromid, toid, taxonid, source, verified), (fromid = !fromproteinid), (taxonid = !taxonid)

  42. Example Theseus Plan

  43. Bio-informatics Domain Results • Experiments in Bio-informatics domain where we have 60 real web services provided by NCI • We varied number of domain relations in a query from 1-30 and report composition time with execution time

  44. Tuple-level Filtering • Tuple-level filtering can improve the execution time of the generated integration plan by up to 53.8%

  45. Improvement due to Theseus • Theseus can improve the execution time of the generated web service with complex plans by up to 33.6%

  46. Discussion • Huge number of sources available • Need tools and systems that support the dynamic integration of these sources • In this talk, I described techniques for: • Extracting data from unstructured and ungrammatical sources • Discovering models of online sources required for integration • Dynamic and efficient integration of web sources • Efficient execution of integration plans • Much work still left to be done…

  47. More information… • http://www.isi.edu/~knoblock • Matthew Michelson and Craig A. Knoblock.Semantic Annotation of Unstructured and Ungrammatical TextIn Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI), Edinburgh, Scotland, 2005 • Mark James Carman and Craig A. Knoblock. Inducing source descriptions for automated web service composition, In Proceedings of the AAAI 2005 Workshop on Exploring Planning and Scheduling for Web Services, Grid, and Autonomic Computing, 2005. • Snehal Thakkar, Jose Luis Ambite, and Craig A. Knoblock. Composing, optimizing, and executing plans for bioinformatics web services, VLDB Journal, Special Issue on Data Management, Analysis and Mining for Life Sciences, 14(3):330--353, Sep 2005.

More Related