1 / 24

Web Service Composition and Record Linking

Web Service Composition and Record Linking. Mark Cameron, Kerry Taylor & Rohan Baxter CSIRO Information & Communication Technologies Centre. Problem…. As more and more data sources become available to an information integration system, it becomes more difficult to track an individual entity.

Download Presentation

Web Service Composition and Record Linking

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Service Composition and Record Linking Mark Cameron, Kerry Taylor & Rohan BaxterCSIRO Information & Communication Technologies Centre

  2. Problem… • As more and more data sources become available to an information integration system, it becomes more difficult to track an individual entity. • This is especially true where there is no unique global identifier for an entity • There are many causes • Data sources contain inconsistent information • Collected for different purpose • Quality control of data entry • Currency of data New Patients All Patients

  3. Problem & One Solution… • How can we reconcile entity identity across multiple inconsistent data sources? • Apply record-linking technique(s) • Record linking is the task of quickly and accurately identifying records corresponding to the same entity from one or more data sources.

  4. Record Linking Link Table New Patients All Patients

  5. Record Linking partial match Y-Axis no match match Classification • But record linking is a difficult task for non-specialists • Need to know how to go about record linking to avoid worst-case n2 record comparison scenario • Access (multiple) specialist applications • Commercial systems cost big $$$$ • Academic systems free but built for research: • Freely extensible biomedical record linkage (Python) http://sourceforge.net/projects/febrl • Second String (Java string comparison library) • Write task specifications for each application • Bewildering array of parameters for most functions

  6. Record Linking • Wouldn’t it be nice if… • Non-specialists could use record-linking services for their domain-specific purposes • We could mix and match different linking service implementations as required • Have flexibility of both high-level (non-specialist) and low-level (specialist) use of record linking services • Non-specialists just use a virtual process definition, supplying data and parameters as needed • Specialists can set low-level parameters • Specialists can modify virtual process specifications

  7. Our Approach… • Aim to support service composition to deliver knowledge intensive applications and data products • rapidly • flexibly/adaptively • scalably (in complexity, numbers of resources, data volumes) • knowledgeably • Consequentially, need • Data-level interoperability • Data integration through views (where standards may not apply) • Fine-grained services • Machine-readable, fine grained description • Sensible data management • And • Declarative, goal-directed composition (like SQL) • Tools for extracting, merging and displaying the knowledge • Put knowledge tools in the hands of the domain expert • like a spreadsheet

  8. Information Integration Theory… • I = <G,M,S> (Lenzerini 2002): • Source schemas (S) • A (local) representation of the data and services available to or known about by the IIS. • Mapping statements (M) • Expressions that map schemas and services between G and S in one of four styles: P2P; GAV; LAV; GLAV. • A global schema (G) • This is a (global) representation of the (integrated) domain, including real and virtual data schemas, real and virtual transformation services and domain constraints, against which actors can address queries: Q::q(G) • The system essentially turns a user query Q::q(G) into a union of conjunctive queries against individual resources Q::q’(S)

  9. Our Approach… RuntimeEnvironment CompositionCompiler WorkflowEngine DomainGUI DomainModel Mappings IntegrationDatabase Data Resources Transformation Resources

  10. Our Approach… Query Generation Declarative Resource Model Web Service «call» User Query Integration Compiler Workflow Execution Delegated Call Evaluation Service Execution Monitoring Workflow • Process flow computed at compile-time from declarative resource model and user query • Runtime infrastructure invokes services

  11. Disease Registry Example • Will our approach work? • How do we build web services? • extract fine-grained functional components from existing packages • Is compile-time composition feasible? • Is our runtime-environment going to perform acceptably? • Experiment to see what impact our choices have • Service/operation interface for packages • SQL generation technique • Composition (ie workflow) vs stand-alone application performance

  12. Disease Registry Example • Problem specification written in domain terms • To link New Patients with Existing Patients • Get new patient data • Get existing patient data • Use probabilistic linking to identify individuals who match in both datasets • To probabilistically link datasets A and B • Standardise A and Standardise B • Index standardised A and Index standardised B • Compare each in A with each in B when indexes match • Classify compared into (match; partial match; no match)

  13. A Process Model for Record Linking link(A,B,C)  standardize(A,As), standardize(B,Bs), index(As,AsI), index(Bs,BsI), comparison(AsI,BsI,Cs), classify(Cs,C).

  14. Process Template link(A,B,C)  standardize(A,As), standardize(B,Bs), index(As,AsI), index(Bs,BsI), comparison(AsI,BsI,Cs), classify(Cs,C). Virtual Process Specification index( standard(Id, Gname, Sname, Dob, …), indexed(ClusterId, standard(Id, Gname, Sname, Dob, …))) truncate(Gname,4,GnameTrunc), truncate(Sname,4,SnameTrunc), block_index(GnameTrunc, SnameTrunc, ClusterId). Specify Virtual Service(s)

  15. Query The Process Model probabilistic_link(classified(Z)) ’newpatients@laboratoryDS’(A,B,C,D,E,F,G),’allpatients@registryDS’(H,I,J,K,L,M),link( (’newpatients’(A,B,C,D,E,F,G)), (’allpatients’(H,I,J,K,L,M)), classified(Z)).

  16. Unfolded Query

  17. Compiler Generates Dependency Graph

  18. SQL Generation • Leaf nodes are data generators • Each non-terminal node is a web service call • SQL for input generated from backward closure of dependency graph • Call results stored in table • Cartesian products have disjoint backward closures • Heuristic optimization delays CP evaluation

  19. Performance • We were disappointed • On 5k run, one large join taking over ½ hour and consuming all available disk space • SQL statements not being optimised • Use a temporary table to speed join • 30 sec to process join! • We were surprised

  20. Why Did We Get Non-linear Improvement? • Distribution of work between database and record linking service machines • Lots of data parallelism • Task level (eg date_standardize; truncate; name_standardize) • Message level (eg time to process 500 record message block) • Bounded memory requirement for linking service to process 500 record message block • Non-linear virtual memory requirement a known issue for Febrl • Limited synchronisation points • Execution time of implementation languages is significant! • Java vs Python between 1:200 and 1:1000

  21. Conclusions & Future Work • Simple process model for record linking • We plan to incorporate recursion & iteration • Array style invocation enables us to pass more (uniform) data in each web service message • Much more work needed for complex structured messages • Virtual service specifications enable flexible implementation choice • But someone or something must construct them! • Compiler automates the tedious work • We need to look more closely at join performance! • Incremental treatment of changes not addressed • Obvious application of incremental view maintenance techniques

  22. Web Service Composition and Record Linking Mark Cameron, Kerry Taylor & Rohan BaxterCSIRO Information & Communication Technologies Centre

More Related