180 likes | 310 Views
The Gazetteer database comprises 4.5 million items with diverse names and geospatial footprints. It faces challenges such as query efficiency, dynamic processing, and diverse data types. With limited partitioning options, queries often suffer from slow response times, typically taking less than 2 minutes or exceeding 30 minutes based on index selection. Implementing a load balancing middleware can enhance reliability by distributing complex queries across multiple identical databases. Additionally, automating query translation from an abstract language to SQL could simplify database interactions, improving overall performance.
E N D
Gazetteer database • 4.5 million items, each having: • 1+ names • fair to good discriminator • 1 geospatial footprint • 99.9% points, 0.1% boxes & polygons • anticipate getting more (overlapping) polygons • 1+ types • excellent to horrible discriminator • No obvious way to partition them Greg Janée • chit-chat with CS database folks • 10/26/01
Gazetteer database • Informix Dynamic Server 2000 • Indexes • names: Verity (external; blade interface) • footprint: MapInfo (R-tree; blade interface) • types: B-tree Greg Janée • chit-chat with CS database folks • 10/26/01
Gazetteer database • Queries • dynamic • general case: arbitrary boolean combinations • in practice: • [name] AND [footprint] AND [([type] OR [type] OR ...)] • Desired behavior • see some results immediately queries • after seeing some results, ability to kill query Greg Janée • chit-chat with CS database folks • 10/26/01
Gazetteer database • Observed behavior: • Either: query answered quickly (< 2 minutes) • optimizer picks “right” index • Or: query takes forever (> 30 minutes) • optimizer picks “wrong” index • Complications: • no way to kill or interrupt database thread, through JDBC or otherwise Greg Janée • chit-chat with CS database folks • 10/26/01
Challenge #1 • Solve ADEPT’s query problem • Multiple, different data types • spatial, text, traditional linear types • Discriminability of any given index greatly depends on both data & query • Dynamic queries Greg Janée • chit-chat with CS database folks • 10/26/01
Load balancing • Why • distribute those queries-from-hell • increase reliability • How • multiple independent, identical databases • middleware directs query to “best” database • database executes query in its entirety Greg Janée • chit-chat with CS database folks • 10/26/01
Load balancing • What the middleware knows about a database: • current & maximum connection count • current queries • amount of time each query has been processing (QPT) • Idea: • score each DB inversely proportional to QPT2 Greg Janée • chit-chat with CS database folks • 10/26/01
Challenge #2 • Define metric • overall query processing time • response time • Do better • incorporate connection counts into formula • analyze queries • keep history Greg Janée • chit-chat with CS database folks • 10/26/01
Query translation • Gazetteer is an instance of a much more general problem • To wit: • how to describe the automatic translation of dynamic queries written in an abstract query language to SQL • in an easy, powerful, flexible way • making as few assumptions as possible about the underlying schema • and producing “reasonable” SQL • not so bad as to preclude database’s optimizer from working Greg Janée • chit-chat with CS database folks • 10/26/01
Query translation • Collection ( database) contains items having IDs • query should return IDs; duplicates OK • response time more important than overall QPT • Search bucket • abstract, typed thing against which constraints may be placed; standard buckets • Each collection supports 1+ buckets in idiosyncratic ways • Query language • arbitrary boolean combinations of bucket constraints Greg Janée • chit-chat with CS database folks • 10/26/01
Query translation • Standard buckets • subject-related text • title • assigned term • originator • geographic location • coverage date • object type, feature type, ... • format • identifier Greg Janée • chit-chat with CS database folks • 10/26/01
Query translation • Buckets grouped by types • spatial • e.g., overlaps a given polygon • temporal • e.g., contains a given date range • hierarchical • e.g., is any kind of geographic work • textual, qualified textual • e.g., contains the phrase “luis obispo” • numeric • e.g., > 5.7 meters Greg Janée • chit-chat with CS database folks • 10/26/01
Query translation • Example: translating a spatial bucket • MapInfo datablade • ST_Contains(table.column, “HG_Box(coords)”) • Geodetic datablade • Inside(“GeoBox(coords)”, table.column) • four bounding coordinate columns • table.northcolumn >= ... and table.southcolumn <= ... and ... Greg Janée • chit-chat with CS database folks • 10/26/01
Query translation • Existing Python-based scripting system • easy to configure & extend • comes with library of standard translation techniques • "geographic-location" :Bucket( "spatial", standardSpatialOperators,spatialToInformixMapInfo, ["j_holding", "footprint"]) • How to extend to boolean queries? Greg Janée • chit-chat with CS database folks • 10/26/01
Query translation • Too-easy solution: • given constraints C1, C2 that translate into SQL constraints (T1, S1), (T2, S2) then constraintC1op C2 where op is AND, OR, or AND NOT becomes • select id from T1 where S1op id in (select id from T2 where C2) • But: Informix appears to execute subqueries in their entirety before considering the outer query Greg Janée • chit-chat with CS database folks • 10/26/01
Query translation • The problems with JOINs • handling ANDs • tables may have {1, 1?, 1+, 0+} rows per item • handling ORs • may require UNION • but UNION is not nestable Greg Janée • chit-chat with CS database folks • 10/26/01
Query translation • The problem with disjunctive normal form: • may be inefficient • select id from table where S1 and (S2 or S3) • versus • select id from table where S1 and S2unionselect id from table where S1 and S3 Greg Janée • chit-chat with CS database folks • 10/26/01
Challenge #3 • Design a translation description system • easy to configure & extend • makes as few assumptions as possible about the underlying schema • should produce reasonable SQL by default • supports customization of translation process • supports pattern-based overrides Greg Janée • chit-chat with CS database folks • 10/26/01