1 / 18

Text Joins in an RDBMS for Web Data Integration

Text Joins in an RDBMS for Web Data Integration. Nick Koudas Divesh Srivastava. Luis Gravano Panagiotis G. Ipeirotis. Columbia University. AT&T Labs - Research. Why Text Joins?. Problem : Same entity has multiple textual representations. Matching Text Attributes.

kpierce
Download Presentation

Text Joins in an RDBMS for Web Data Integration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Joins in an RDBMS for Web Data Integration Nick Koudas Divesh Srivastava Luis Gravano Panagiotis G. Ipeirotis Columbia University AT&T Labs - Research

  2. Why Text Joins? Problem: Same entity has multiple textual representations. Columbia University

  3. Matching Text Attributes Need for a similarity metric! Many desirable properties: • Match entries with typing mistakes • Microsoft Windpws XP vs. Microsoft Windows XP • Match entries with abbreviated information • Zurich International Airport vs. Zurich Intl. Airport • Match entries with different formatting conventions • Dept. of Computer Science vs. Computer Science Dept. • …and combinations thereof Columbia University

  4. Matching Text Attributes using Edit Distance Edit Distance: Character insertions, deletions, and modifications to transform one string to the other Good for: spelling errors, short word insertions and deletions Problems: word order variations, long word insertions and deletions “Approximate String Joins” – VLDB 2001 Columbia University

  5. Infrequent token (high weight) Common token (low weight) Similarity = Σweight(token, t1) * weight(token, t2) token Matching Text Attributes using Cosine Similarity Similar entries should share “infrequent” tokens Different token choices result in similarity metrics with different properties Columbia University

  6. Using Words and Cosine Similarity Using words as tokens: Infrequent token (high weight) • Split each entry into words • Similar entries share infrequent words Common token (low weight) • Good forword order variations and common word insert./del. Computer Science Dept. ~ Dept. of Computer Science • Problems withmisspellings Biotechnology Department ≠ Bioteknology Dept. “WHIRL” – W.Cohen, SIGMOD’98 Columbia University

  7. Biotechnology Department Bio, iot, ote, tec, ech, chn, hno, nol, olo, log, ogy, …, tme, men, ent Bio, iot, ote, tek,ekn, kno, nol, olo, log, ogy, ... , tme, men, ent Bioteknology Department Using q-grams and Cosine Similarity Using q-grams as tokens: • Split each string into small substrings of length q (q-grams) • Similar entries share many, infrequent q-grams Handles naturally misspellings, word order variations, and insertions and deletions of common or short words Columbia University

  8. Similarity = Σweight(token, t1) * weight(token, t2) token Problem For two entries t1, t2 0 ≤ Similarity ≤ 1 Problem that we address: Given two relations, report all pairs with cosine similarity above threshold φ Columbia University

  9. Create in SQL relations RiWeights (token weights from Ri) • Compute similarity of each tuple pair Computes similarity for many useless pairs Expensive operation! R2Weights R1Weights Token W Token W 1 HATRONIC 0.98 1 EUROAFT 0.98 SELECTr1w.tid AS tid1, r2w.tid AS tid2 FROMR1Weights r1w, R2Weights r2w WHERE r1w.token = r2w.token GROUP BYr1w.tid, r2w.tid HAVINGSUM(r1w.weight*r2w.weight)≥ φ 1 CORP 0.02 1 CORP 0.02 2 EUROAFT 0.95 2 HATRONIC 0.98 2 INC 0.05 2 INC 0.01 3 EUROAFT 0.97 … 3 CORP 0.03 … Computing Text Joins in an RDBMS R1 R2 Columbia University

  10. Sampling 20 times Sampling Step for Text Joins Similarity = Σweight(token, t1) * weight(token, t2) • Similarity is a sum of products • Products cannot be high when weight is small • Can (safely) drop low weights from RiWeights (adapted from [Cohen & Lewis, SODA97] for efficient execution inside an RDBMS) RiWeights RiSample Eliminates low similarity pairs (e.g., “EUROAFT INC” with “HATRONIC INC”) Columbia University

  11. SELECT r1w.tid AS tid1, r2s.tid AS tid2 FROM R1Weights r1w, R2Sample r2s, R2sum r2sum WHERE r1w.token = r2s.token AND r1w.token = r2sum.token GROUP BY r1w.tid, r2s.tid HAVING SUM(r1w.weight*r2sum.total*r2s.c) ≥S*φ Fully implemented in pure SQL! Sampling-Based Text Joins in SQL R1Weights R2Sample R1 Columbia University

  12. Experimental Setup • 40,000 entries from AT&T customer database, split into R1 (26,000 entries) and R2 (14,000 entries) • Tokenizations: • Words • Q-grams, q=2 & q=3 • Methods compared: • Variations of sample-based joins • Baseline in SQL • WHIRL [SIGMOD98], adapted for handling q-grams Columbia University

  13. Metrics Execute the (approximate) join for similarity > φ • Precision: (measures accuracy) • Fraction of the pairs in the answer with real similarity > φ • Recall: (measures completeness) • Fraction of the pairs with real similarity > φ that are also in the answer • Execution time Columbia University

  14. Comparing WHIRL and Sample-based Joins • Sample-based Joins: Good recall across similarity thresholds • WHIRL: Very low recall (almost 0 recall for thresholds below 0.7) Columbia University

  15. Changing Sample Size • Increased sample size → Better recall, precision • Drawback: Increased execution time Columbia University

  16. Execution Time • WHIRL and Sample-based text joins ‘break-even’ at S~ 64, 128 Columbia University

  17. Contributions • “WHIRL [Cohen, SIGMOD98] inside an RDBMS”: Scalability, no data exporting/importing • Different tokens choices: • Words: Captures word swaps, deletion of common words • Q-grams: All the above, plus spelling mistakes, but slower SQL statements tested in MS SQL Server and available for download at: http://www.cs.columbia.edu/~pirot/DataCleaning/ Columbia University

  18. Questions? Columbia University

More Related