1 / 19

Search for Approximate Matches in Large Databases

Search for Approximate Matches in Large Databases. Eugene Fink Jaime Carbonell. Aaron Goldstein Philip Hayes. Motivation. Fast identification of approximate matches in large sets of records. Applications: Medical databases Customer records National security. Outline.

tavia
Download Presentation

Search for Approximate Matches in Large Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Search for Approximate Matchesin Large Databases Eugene Fink Jaime Carbonell Aaron Goldstein Philip Hayes

  2. Motivation Fast identification of approximatematches in large sets of records. Applications: • Medical databases • Customer records • National security

  3. Outline Records and queries Search for matches Experimental results

  4. Table of records We specify a table of records by a list of attributes. Example We can describe patients in a hospitalby their sex, age, and diagnosis.

  5. Example Record Sex: female Age: 30 Dx: asthma Records and queries A record includes a specificvalue for each attribute. A query may include lists ofvalues and numeric ranges. Query Sex: male, female Age: 20..40 Dx: asthma, flu

  6. A point query includes a specificvalue for each attribute. A region query includes lists of values or numeric ranges. Example Region query Sex: male, female Age: 20..40 Dx: asthma, flu Point query Sex: female Age: 30 Dx: asthma Query types

  7. Record Dx Age Query Sex Exact matches A record is an exact match for a query if every value in the record belongs tothe respective range in the query.

  8. Dx Age Query Sex Approximate matches A record is an approximate match for aquery if it is “close” to the query region. Record

  9. Approximate queries An approximate query includes: Point or region Distance function Number of matches Distance limit

  10. Outline Records and queries Search for matches Experimental results

  11. Group nodes into fixed-size disk blocks diagnosis diagnosis diagnosis diagnosis age age sex female, 30,fracture female, 50,flu female, 30,ulcer female, 30,asthma male, 30,asthma male, 40,flu Indexing structure Maintain a PATRICIA tree of records male female 30 50 40 30 asthma ulcer fracture flu asthma flu

  12. diagnosis diagnosis diagnosis diagnosis age age sex female, 30,fracture female, 50,flu female, 30,asthma female, 30,ulcer male, 30,asthma male, 40,flu Search for matches Depth-first search for exact matches Best-first search for approximate matches male female 30 50 40 30 asthma ulcer fracture flu asthma flu

  13. Outline Records and queries Search for matches Experimental results

  14. Performance Experiments with a database of all patientsadmitted to Massachusetts hospitals fromOctober 2000 to September 2002 : • Twenty-one attributes • 1.6 million records Use of a Pentium computer: • 2.4 GHz CPU • 1 Gbyte memory • 400 MHz bus

  15. Variables Control variables: • Number of records • Memory size • Query type Measurements: • Retrieval time

  16. 100 Approximatequeries Availablememory n0.5 Rangequeries Retrieval Time (msec) 10 n0.15 lg n Exact queries lg n 1 103 105 106 102 104 Number of Records Small memory Number of records: 100 to 1,672,016 Memory size: 4 MByte

  17. 10,000 1,000 Approximatequeries 100 Retrieval Time (msec) Range queries 10 Exact queries 1 128 512 1,024 64 256 Memory Size (MBytes) Large memory Number of records: 1,672,016 Memory size: 64 to 1,024 MByte

  18. Summary Retrieval time grows as fractional power (about 0.5) of database size If we extrapolate this growth rate, retrieval times are reasonable for very large databases

  19. Summary Retrieval time grows as fractional power (about 0.5) of database size If we extrapolate this growth rate, retrieval times are reasonable for very large databases:

More Related