Gazetteer database

Gazetteer database • 4.5 million items, each having: • 1+ names • fair to good discriminator • 1 geospatial footprint • 99.9% points, 0.1% boxes & polygons • anticipate getting more (overlapping) polygons • 1+ types • excellent to horrible discriminator • No obvious way to partition them Greg Janée • chit-chat with CS database folks • 10/26/01

Gazetteer database • Informix Dynamic Server 2000 • Indexes • names: Verity (external; blade interface) • footprint: MapInfo (R-tree; blade interface) • types: B-tree Greg Janée • chit-chat with CS database folks • 10/26/01

Gazetteer database • Queries • dynamic • general case: arbitrary boolean combinations • in practice: • [name] AND [footprint] AND [([type] OR [type] OR ...)] • Desired behavior • see some results immediately  queries • after seeing some results, ability to kill query Greg Janée • chit-chat with CS database folks • 10/26/01

Gazetteer database • Observed behavior: • Either: query answered quickly (< 2 minutes) • optimizer picks “right” index • Or: query takes forever (> 30 minutes) • optimizer picks “wrong” index • Complications: • no way to kill or interrupt database thread, through JDBC or otherwise Greg Janée • chit-chat with CS database folks • 10/26/01

Challenge #1 • Solve ADEPT’s query problem • Multiple, different data types • spatial, text, traditional linear types • Discriminability of any given index greatly depends on both data & query • Dynamic queries Greg Janée • chit-chat with CS database folks • 10/26/01

Load balancing • Why • distribute those queries-from-hell • increase reliability • How • multiple independent, identical databases • middleware directs query to “best” database • database executes query in its entirety Greg Janée • chit-chat with CS database folks • 10/26/01

Load balancing • What the middleware knows about a database: • current & maximum connection count • current queries • amount of time each query has been processing (QPT) • Idea: • score each DB inversely proportional to QPT2 Greg Janée • chit-chat with CS database folks • 10/26/01

Challenge #2 • Define metric • overall query processing time • response time • Do better • incorporate connection counts into formula • analyze queries • keep history Greg Janée • chit-chat with CS database folks • 10/26/01

Query translation • Gazetteer is an instance of a much more general problem • To wit: • how to describe the automatic translation of dynamic queries written in an abstract query language to SQL • in an easy, powerful, flexible way • making as few assumptions as possible about the underlying schema • and producing “reasonable” SQL • not so bad as to preclude database’s optimizer from working Greg Janée • chit-chat with CS database folks • 10/26/01

Query translation • Collection ( database) contains items having IDs • query should return IDs; duplicates OK • response time more important than overall QPT • Search bucket • abstract, typed thing against which constraints may be placed;  standard buckets • Each collection supports 1+ buckets in idiosyncratic ways • Query language • arbitrary boolean combinations of bucket constraints Greg Janée • chit-chat with CS database folks • 10/26/01

Query translation • Standard buckets • subject-related text • title • assigned term • originator • geographic location • coverage date • object type, feature type, ... • format • identifier Greg Janée • chit-chat with CS database folks • 10/26/01

Query translation • Buckets grouped by types • spatial • e.g., overlaps a given polygon • temporal • e.g., contains a given date range • hierarchical • e.g., is any kind of geographic work • textual, qualified textual • e.g., contains the phrase “luis obispo” • numeric • e.g., > 5.7 meters Greg Janée • chit-chat with CS database folks • 10/26/01

Query translation • Example: translating a spatial bucket • MapInfo datablade • ST_Contains(table.column, “HG_Box(coords)”) • Geodetic datablade • Inside(“GeoBox(coords)”, table.column) • four bounding coordinate columns • table.northcolumn >= ... and table.southcolumn <= ... and ... Greg Janée • chit-chat with CS database folks • 10/26/01

Query translation • Existing Python-based scripting system • easy to configure & extend • comes with library of standard translation techniques • "geographic-location" :Bucket( "spatial", standardSpatialOperators,spatialToInformixMapInfo, ["j_holding", "footprint"]) • How to extend to boolean queries? Greg Janée • chit-chat with CS database folks • 10/26/01

Query translation • Too-easy solution: • given constraints C1, C2 that translate into SQL constraints (T1, S1), (T2, S2) then constraintC1op C2 where op is AND, OR, or AND NOT becomes • select id from T1 where S1op id in (select id from T2 where C2) • But: Informix appears to execute subqueries in their entirety before considering the outer query Greg Janée • chit-chat with CS database folks • 10/26/01

Query translation • The problems with JOINs • handling ANDs • tables may have {1, 1?, 1+, 0+} rows per item • handling ORs • may require UNION • but UNION is not nestable Greg Janée • chit-chat with CS database folks • 10/26/01

Query translation • The problem with disjunctive normal form: • may be inefficient • select id from table where S1 and (S2 or S3) • versus • select id from table where S1 and S2unionselect id from table where S1 and S3 Greg Janée • chit-chat with CS database folks • 10/26/01

Challenge #3 • Design a translation description system • easy to configure & extend • makes as few assumptions as possible about the underlying schema • should produce reasonable SQL by default • supports customization of translation process • supports pattern-based overrides Greg Janée • chit-chat with CS database folks • 10/26/01

Gazetteer database