Crossing the Structure Chasm

Crossing the Structure Chasm Alon Halevy University of Washington, Seattle UCLA, April 15, 2004

The Structure Chasm Authoring Writing text Creating a schema Using someone else’s schema Querying keywords Data sharing Easy Committees, standards But we can pose complex queries

Why is This a Problem? • Databases used to be isolated and administered only by experts. • Today’s applications call for large-scale data sharing: • Big science (bio-medicine, astrophysics, …) • Government agencies • Large corporations • The web (over 100,000 searchable data sources) • The vision: • Content authoring by anyone, anywhere • Powerful database-style querying • Use relevant data from anywhere to answer the query • The Semantic Web • Fundamental problem: reconciling different models of the world.

Outline • Two motivating scenarios: • A web of structured data • Personal data management • A tour of recent data sharing architectures • Data integration systems • Peer-data management systems • The algorithmic problems: • Query reformulation • Reconciling semantic heterogeneity • Reconsidering authoring and querying challenges

Large-Scale Scientific Data Sharing Swiss- Prot HUGO OMIM UW UW Microbiology UW Genome Sciences UCLA Genetics GeneClinics

Non-urgent Applications B of A Fidelity 1040 DB IRS UW California IRS NY IRS Employer Tax Reports County real-estate DB

Homepage Web Page Person Cached Organizer, Participants Document Author Author Event Sender, Recipients Softcopy Softcopy Paper Presentation Message Cites Personal Data Management [Semex: Sigurdsson, Nemes, H.] Data is organized by application Mail & calendar HTML Files Presentations Papers

Finding Publications Publication: What Can Peer-to-Peer Do for Databases, and Vice Versa Person: A. Halevy Person: Dan Suciu Person: Maya Rodrig Person: Steven Gribble Person: Zachary Ives

Publication Bernstein Following Associations (1)

Following Associations (2) “A survey of approaches to automatic schema matching” “Corpus-based schema matching” “Database management for peer-to-peer computing: A vision” “Matching schemas by learning from others” “A survey of approaches to automatic schema matching” “Corpus-based schema matching” “Database management for peer-to-peer computing: A vision” “Matching schemas by learning from others” Publication Bernstein

Following Associations (3) Cited by Publication Citations Publication Bernstein

Following Associations (4) Cited Authors Publication Bernstein

PIM Data Sharing Challenges • Need to combine data from multiple applications/ sources. • After initial set of concepts are given, • extend and personalize concept hierarchy, • share (parts) of our data with others, • incorporate external data into our view. • Need also Instance level reconciliation: • Alon Halevy, A. Halevy, Alon Y. Levy – same guy!

Outline • Two motivating scenarios: • A web of structured data • Personal data management • A tour of recent data sharing architectures • Data integration systems • Peer-data management systems • The algorithmic problems: • Query reformulation • Reconciling semantic heterogeneity • Reconsidering authoring and querying challenges

Data Integration • Goal: provide a uniforminterface to a set of autonomous data sources. • New abstraction layer over multiple sources. • Many research projects (DB & AI) • Mine: Information Manifold, Tukwila, BioMediator • Cal: Garlic (IBM), Ariadne (USC), XMAS (UCSD),… • Recent “Enterprise Information Integration” industry: • Startups: Nimble, Enosys, Composite, MetaMatrix • Products from big players: BEA, IBM

Relational Abstraction Layer Students: Takes: • Schema: the template for data. • Queries: Courses: SELECT C.name FROMStudents S, Takes T, Courses C WHERE S.name=“Mary” and S.ssn = T.ssn and T.cid = C.cid

Q Q1 Q2 Q3 Data Integration: Higher-level Abstraction Mediated Schema Semantic mappings … …

Mediated Schema Entity www.biomediator.org Tarczy-Hornoch, Mork Sequenceable Entity Structured Vocabulary Experiment Phenotype Gene Nucleotide Sequence Microarray Experiment Protein OMIM HUGO Swiss- Prot GO Gene- Clinics Locus- Link Entrez GEO Query: For the micro-array experiment I just ran, what are the related nucleotide sequences and for what protein do they code?

Semantic Mappings Books Title ISBN Price DiscountPrice Edition Authors ISBN FirstName LastName • Differences in: • Names in schema • Attribute grouping • Coverage of databases • Granularity and format of attributes BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords BookCategories ISBN Category CDCategories ASIN Category CDs Album ASIN Price DiscountPrice Studio Artists ASIN ArtistName GroupName Inventory Database A Inventory Database B

Q Q’ Q’ Q’ Key Issues • Formalism for mappings • Reformulation algorithms Mediated Schema • How will we create them? … …

Beyond Data Integration • Mediated schema is a bottleneck for large-scale data sharing • It’s hard to create, maintain, and agree upon.

Q3 Q1 Q4 Q5 Q6 Q Q2 Peer Data Management Systems Piazza: [Tatarinov, H., Ives, Suciu, Mork] • Mappings specified locally • Map to most convenient nodes • Queries answered by traversing semantic paths. CiteSeer Stanford UW DBLP UCSD UCLA UC Berkeley

PDMS-Related Projects • Hyperion (Toronto) • PeerDB (Singapore) • Local relational models (Trento) • Edutella (Hannover, Germany) • Semantic Gossiping (EPFL Zurich) • Raccoon (UC Irvine) • Orchestra (U. Penn)

A Few Comments about Commerce • Until 5 years ago: • Data integration = Data warehousing. • Since then: • A wave of startups: • Nimble, MetaMatrix, Calixa, Composite, Enosys • Big guys made announcements (IBM, BEA). • [Delay] Big guys released products. • Success: analysts have new buzzword – EII • New addition to acronym soup (with EAI). • Lessons: • Performance was fine. Need management tools.

Q Q’ Q’ Q’ Q’ Q’ Source Source Source Source Source Data Integration: Before Mediated Schema

Data Integration: After XML Query XML Relational Data Warehouse/ Mart Legacy Flat File Web Pages Front-End Lens Builder™ User Applications Lens™ File InfoBrowser™ Software Developers Kit NIMBLE™ APIs Management Tools Integration Layer Nimble Integration Engine™ Metadata Server Cache Compiler Executor Security Tools Common XML View Integration Builder Concordance Developer Data Administrator

Sound Business Models • Explosion of intranet and extranet information • 80% of corporate information is unmanaged • By 2004 30X more enterprise data than 1999 • The average company: • maintains 49 distinct enterprise applications • spends 35% of total IT budget on integration-related efforts Source: Gartner, 1999

Outline • Two motivating scenarios: • A web of structured data • Personal data management • A tour of recent data sharing architectures • Data integration systems • Peer-data management systems • The algorithmic problems • Query reformulation • Reconciling semantic heterogeneity • Reconsidering authoring and querying challenges

Q Q’ Q’ Q’ Q’ Q’ Source Source Source Source Source Languages for Schema Mapping Mediated Schema GAV LAV GLAV

R1a(isbn, title,n), R1b(isbn, genre,n) Í Book(isbn, title, genre, year), Author(isbn, n), year < 1970 GLAV Mappings Book: ISBN, Title, Genre, Year Author: ISBN, Name R1a R1b R5 R2 R3 R4 Books before 1970

R5(x,y) :- Book(x,y,”Humor”) Query Reformulation Query: Find authors of humor books Book: ISBN, Title, Genre, Year Plan: R1 Join R5 Author: ISBN, Name R1 R2 R3 R4 R5 Books before 1970 Humor books

Answering Queries Using Views • Formal Problem: can we use previously answered queries to answer a new query? • Challenge: need to invert query expression. • Results depend on: • Query language used for sources and queries, • Open-world vs. Closed-world assumption • Allowable access patterns to the sources • MiniCon [Pottinger and H., 2001]: scales to thousands of sources. • Every commercial DBMS implements some version of answering queries using views.

Some Open Research Issues • Managing large networks of mappings: • Consistency • Trust • Improving networks: finding additional mappings • Indexing: • Heterogeneous data across the network • Caching: • Where? What? CiteSeer Stanford UW DBLP UCSD UCLA UC Berkeley

Outline • Two motivating scenarios: • A web of structured data • Personal data management • A tour of recent data sharing architectures • Data integration systems • Peer-data management systems • The algorithmic problems • Query reformulation • Reconciling semantic heterogeneity • Reconsidering authoring and querying challenges

Semantic Mappings Books Title ISBN Price DiscountPrice Edition Authors ISBN FirstName LastName • Need mappings in every data sharing architecture • “Standards are great, but there are too many.” BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords BookCategories ISBN Category CDCategories ASIN Category CDs Album ASIN Price DiscountPrice Studio Artists ASIN ArtistName GroupName Inventory Database A Inventory Database B

Why is it so Hard? • Schemas never fully capture their intended meaning: • We need to leverage any additional information we may have. • A human will always be in the loop. • Goal is to improve designer’s productivity. • Solution must be extensible. • Two cases for schema matching: • Find a map to a common mediated schema. • Find a direct mapping between two schemas.

Typical Matching Heuristics • We build a model for every element from multiple sources of evidences in the schemas • Schema element names • BooksAndCDs/Categories ~ BookCategories/Category • Descriptions and documentation • ItemID: unique identifier for a book or a CD • ISBN: unique identifier for any book • Data types, data instances • DateTime  Integer, • addresses have similar formats • Schema structure • All books have similar attributes In isolation, techniques are incomplete or brittle: Need principled combination. Models consider only the two schemas.

Mediated Schema Mediated Schema Using Past Experience • Matching tasks are often repetitive • Humans improve over time at matching. • A matching system should improve too! • LSD: • Learns to recognize elements of mediated schema. • [Doan, Domingos, H., SIGMOD-01, MLJ-03] • Doan: 2003 ACM Distinguished Dissertation Award. data sources

Example: Matching Real-Estate Sources Mediated schema address price agent-phone description locationlisted-pricephonecomments Learned hypotheses If “phone” occurs in the name => agent-phone Schema of realestate.com location Miami, FL Boston, MA ... listed-price $250,000 $110,000 ... phone (305) 729 0831 (617) 253 1429 ... comments Fantastic house Great location ... realestate.com If “fantastic” & “great” occur frequently in data values => description homes.com price $550,000 $320,000 ... contact-phone (278) 345 7215 (617) 335 2315 ... extra-info Beautiful yard Great beach ...

Learning Source Descriptions • We learn a classifier for each element of the mediated schema. • Training examples are provided by the given mappings. • Multi-strategy learning: • Base learners: name, instance, description • Combine using stacking. • Accuracy of 70-90% in experiments. • Learning about the mediated schema.

Music Books Authors Authors Items Artists Publisher Information Litreture CDs Categories Artists Corpus of Schemas and Matches Corpus-Based Schema Matching[Madhavan, Doan, Bernstein, H.] • Can we use previous experience to match two newschemas? • Learn about a domain? Classifier for every corpus element Learn general purpose knowledge Reuse extracted knowledge to match new schemas

Exploiting The Corpus • Given an element s  S and t  T, how do we determine if s and t are similar? • The PIVOT Method: • Elements are similar if they are similar to the same corpus concepts • The AUGMENT Method: • Enrich the knowledge about an element by exploiting similar elements in the corpus.

Pivot: measuring (dis)agreement Compute interpretations w.r.t. corpus Pk= Probability (s ~ ck ) • Interpretation captures how similar an element is to each corpus concept • Compared using cosine distance. Interpretation I(s) = element sSchema S # concepts in corpus S T I(s) I(t) s t Similarity(I(s), I(t))

Augmenting element models S Schema Search similar corpus concepts s • Search similar corpus concepts • Pick the most similar ones from the interpretation • Build augmented models • Robust since more training data to learn from • Compare elements using the augmented models Corpus of known schemas and mappings e f M’s Name: Instances: Type: … Element Model Build augmented models

Experimental Results • Five domains: • Auto and real estate: webforms • Invsmall and inventory: relational schemas • Nameaddr: real xml schemas • Performance measure: • F-Measure: Precision and recall are measured in terms of the matches predicted.

Comparison over domains Corpus based techniques perform better in all the domains

“Tough” schema pairs Significant improvement in difficult to match schema pairs

Mixed corpus Corpus with schemas from different domains can also be useful

Other Corpus Based Tools • A corpus of schemas can be the basis for many useful tools: • Mirror the success of corpora in IR and NLP? • Back to the structure chasm: • Authoring and querying. • Auto-complete: • I start creating a schema (or show sample data), and the tool suggests a completion. • Formulating queries on new databases: • I ask a query using my terminology, and it gets reformulated appropriately.

schema mapping Conclusion • Vision: data authoring, querying and sharing by everyone, everywhere. • Need to make it easier to enjoy the benefits of structured data. • Challenge: reconciling semantic heterogeneity Corpus Of schemas

Crossing the Structure Chasm