1 / 15

Efficiently Querying Contradictory and Uncertain Genealogical Data

Efficiently Querying Contradictory and Uncertain Genealogical Data. Lars E. Olson and David W. Embley DEG Lab BYU Computer Science Dept. Supported by National Science Foundation Grant #0083127. Introduction. Integrating data from multiple sources Some data just doesn’t fit the data model

taryn
Download Presentation

Efficiently Querying Contradictory and Uncertain Genealogical Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficiently Querying Contradictory and Uncertain Genealogical Data Lars E. Olson and David W. Embley DEG Lab BYU Computer Science Dept. Supported by National Science Foundation Grant #0083127

  2. Introduction • Integrating data from multiple sources • Some data just doesn’t fit the data model • Multiple data sources conflicting data • Uncertain or imprecise data • Data that violates constraints • Sometimes it’s not possible to resolve the data • PAF / Gedcom 2

  3. Disjunctive Databases “OR-tables,” Imielinski and Vadaparty, 1989 3

  4. Shortcomings of “OR-tables” • Can’t correlate between possible values • Answering queries in general is CoNP-complete (Imielinski & Vadaparty) 4

  5. Sub-relation Data Construct • Solution: store the correlated data in its own relation 5

  6. Disjunctive Database Problems • How do we avoid the CoNP-completeness problem and answer queries efficiently? • If more than one value is possible, which one is the most likely? • Other questions to be solved: • Where are the constraint violations? • How do we map sub-relations to physical storage? • How do we efficiently update the database? 6

  7. Transitive Closure of Disjunctive Graphs Solving the CoNP-completeness problem [LYY95] Disjunctive graph Possible interpretation b b e e a c a c f f d d Transitive closure of a: {a, d, e} 7

  8. Using Disjunctive Graphs to Answer Queries Table Person: Table Place: 8

  9. State(ID=1Person Place) Using Disjunctive Graphs to Answer Queries Person John Doe Name 12 Mar 1840 ID# Birth Date 1 12 Mar 1841 Place Nauvoo ID# Marriage Date City 1 Birth Place Commerce 15 Jun 1869 16 Jun 1869 State Illinois State City ID# 2 Quincy 9

  10. City,State(ID=1Person Place) …meaning what? • Definitely known? • All possible values? • Most likely value? City City City Birth Place Birth Place Birth Place Using Disjunctive Graphs to Answer Queries Place Nauvoo ID# 1 Person Commerce ID# State 1 Illinois State City ID# 2 Quincy 10

  11. City,State(ID=1Person Place) Greedy Algorithm solution Using Disjunctive Graphs to Answer Queries …meaning what? • Definitely known? • All possible values? • Most likely value? Place Nauvoo 1.0 ID# City 1 Person Commerce 0.2 ID# State Birth Place 1 Illinois 0.8 State City ID# 2 Quincy 11

  12. P1.Name, P2.Name(Person P1P1.BirthDate = P2.BirthDate Person P2) Using Disjunctive Graphs to Answer Queries Person P1 John Doe 12 Mar 1840 Person P2 ID #1 ID #2 12 Mar 1841 James Doe 13 Mar 1840 12

  13. Limiting the Search Space • In genealogy, most disjunctions are mutually independent • Disjunctions that aren’t independent are limited to immediate family relations • Build a relation containing all immediate family members (Person P1 P1.parent = P2.ID Person P2 P2.ID = P3.parent Person P3) 13

  14. 0.4 0.7 0.7 0.4 0.6 0.3 0.6 0.3 Limiting the Search Space • Example constraints: • Each parent should be born before their children • Each child should be born at least 9 months apart (except multiple births) Person P1 Person P2 Person P3 ID #1 ID #1 ID #1 1.0 ID #2 ID #2 ID #2 1.0 ID #3 ID #3 ID #3 ID #4 ID #4 ID #4 parent child = parent-1 14

  15. Conclusions • Genealogical data can be stored in a disjunctive database format. • Many common queries can be computed in polynomial time. • We can detect intractable queries and limit the search space required, usually enough to get polynomial time. 15

More Related