1 / 64

Universal, Composable Indexing Queries, Text, Spatial Data, XML Structure, and XML Semantics

Universal, Composable Indexing Queries, Text, Spatial Data, XML Structure, and XML Semantics. Chris Biow, Federal CTO Wayne Feick, Lead Engineer Mark Logic. Universal Indexing: Agenda. MarkLogic Server Text XML Structure XML Semantics Scalars and aggregation Spatial location

kaspar
Download Presentation

Universal, Composable Indexing Queries, Text, Spatial Data, XML Structure, and XML Semantics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Universal, Composable IndexingQueries, Text, Spatial Data, XML Structure, and XML Semantics Chris Biow, Federal CTO Wayne Feick, Lead Engineer Mark Logic

  2. Universal Indexing: Agenda • MarkLogic Server • Text • XML Structure • XML Semantics • Scalars and aggregation • Spatial location • Alerting: query indexing • Search/query composed with Alerting

  3. MarkLogic XML Server DBMS Search Application Server

  4. What is MarkLogic Server? • An XML Server • A hybrid (integrated parts) • Special purpose DBMS for XML, with enterprise expectations: • ACID transactions • DBA, backup, replication • Search engine kernel, with enterprise expectations: • Full text • Faceted navigation, at massive scale • Boolean, proximity, stemming, tokenization, decompounding, case, diacritics • Application Server • HTTP • XCC Java/.NET • WebDav

  5. MarkLogic as Special DBMS • Not relational (RDBMS) • Schema agnostic • Text a first-class citizen among data types • XQuery (SQL) • Search engine algorithms for many DB queries • Order(1) in number of docs • O(log(n)) for range indexing • O(n) in results • Very low DBA overhead (0.5 FTE / 100 hosts) • 5-minute install • 5-minute scale-out • Database and search engine are the same

  6. MarkLogic as Special Search Engine timeline • Understands document structure • Transactional: high CRUD load • Unicode • Diacritics • Collations • Languages • Holds the documents • Reindexing • Delivery • Geospatial: Box, Point/radius, polygon • Alerting: Profile, alert, filter, tipping, selector, “trigger,” … • Analytics: Facets, Co-occurrence, word lexicons, … • Everything composes (e.g. geo-alerting, geo-text-data, search-alerting) • Processing near the data • Relational joins and inferencing • Database and search engine are the same message message @id 3 @id 5 status status Testing XProc Oh boy… Element node Attribute Node Text Node

  7. MarkLogic as Special App Server • Native HTTP(S) server • RESTfulXML by default • Transform to HTML • Transform to PDF, MS Office, etc. • PKI with no dependencies • Optionally with external Auth • HTTP(S) client • XCC Java / .NET server • Similar to JDBC / ADO.NET • WebDAV • Folder on the user’s desktop RESTful Architecture user/ Representations + get .json.xqy .xml.xqy + … user.xqy + Get + Put + Post + Delete Resources notes/ URL Rewriter note.xqy Routes.xqy

  8. MarkLogic at Scale • Scale up: typically 1TB or more per server • Scale out: low hundreds ++ • Commodity hardware • Typically ~$5K HW budget per server • 2-CPU x 4-core • 32+ GB RAM • 3x disk: local mount with failover • OS • Linux RHEL 5 • Solaris 10 • Windows 2003/8 (XP/Vista/7 for dev)

  9. Universal Indexing: Agenda • MarkLogic Server • Text • XML Structure • XML Semantics • Scalars and aggregation • Spatial location • Alerting: query indexing • Search/query composed with Alerting

  10. Full-text Indexing (Search) Find all documents that contain the phrase “uniquely identify” 129 122 127 123 130 126 124 125 0 0 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 1 1 1 can uniquely identify however

  11. Full-text Indexing (Search) Find all documents that contain the phrase “uniquely identify” UNIVERSAL INDEX “which” 123, 127, 129, 152, 344, 791 . . . “uniquely” 122, 125, 126, 129, 130, 167 . . . “identify” 123, 126, 130, 142, 143, 167 . . . Document References “each” 123, 130, 131, 135, 162, 177 . . . “uniquely identify” 126, 130, 167, 212, 219, 377 . . . “which uniquely” . . . 126, 130, 167, 212, 219, 377 . . . “identify each” . . .

  12. Word Lexicons • Revertible index of words • Word-level search (wildcarding) • Corpus analysis • Most common words • Document-frequency of words • Result-set intersection • Most common words in result set • Document-frequency of chosen words in result

  13. Universal Indexing: Agenda • MarkLogic Server • Text • XML Structure • XML Semantics • Scalars and aggregation • Spatial location • Alerting: query indexing • Search/query composed with Alerting

  14. XML Structure Find all articles that have an abstract <article> <title>A Relational Model of Data for Large Shared Data Banks</title> <author><first-name>Edgar</first-name><last-name>Codd</last-name></author> <abstract> Future users of data banks must be protected from having to know how the data is organized in the machine (the internal representation). . . . Changes in data representation will often be needed . . . </abstract> <body> <section> <section> … has values which uniquely identify each element ... </section> </section> <section>… version of <product>IMS</product> provides the user . . . </section> </body> <metadata><vol>13</vol><number>6</number><year>1970</year></metadata> </article>

  15. XML Structure Find all articles that have an abstract UNIVERSAL INDEX “which” 123, 127, 129, 152, 344, 791 . . . “uniquely” 122, 125, 126, 129, 130, 167 . . . “identify” 123, 126, 130, 142, 143, 167 . . . Document References “each” 123, 130, 131, 135, 162, 177 . . . “uniquely identify” 126, 130, 167, 212, 219, 377 . . . <article> . . . 126, 130, 167, 212, 219, 377 . . . <article>/<abstract> . . .

  16. Universal Indexing: Agenda • MarkLogic Server • Text • XML Structure • XML Semantics • Scalars and aggregation • Spatial location • Alerting: query indexing • Search/query composed with Alerting

  17. XML Semantics Find all documents that mention the product “IMS” <article> <title>A Relational Model of Data for Large Shared Data Banks</title> <author><first-name>Edgar</first-name><last-name>Codd</last-name></author> <abstract> Future users of data banks must be protected from having to know how the data is organized in the machine (the internal representation). . . . Changes in data representation will often be needed . . . </abstract> <body> <section> <section> … has values which uniquely identify each element ... </section> </section> <section>… version of <product>IMS</product> provides the user . . . </section> </body> <metadata><vol>13</vol><number>6</number><year>1970</year></metadata> </article>

  18. XML Semantics Find all documents that mention the product “IMS” UNIVERSAL INDEX “which” 123, 127, 129, 152, 344, 791 . . . “uniquely” 122, 125, 126, 129, 130, 167 . . . “identify” 123, 126, 130, 142, 143, 167 . . . Document References “each” 123, 130, 131, 135, 162, 177 . . . “uniquely identify” 126, 130, 167, 212, 219, 377 . . . <article> . . . 126, 130, 167, 212, 219, 377 . . . <article>/<abstract> . . . <product>IMS</product>

  19. Universal Indexing: Agenda • MarkLogic Server • Text • XML Structure • XML Semantics • Scalars and aggregation • Spatial location • Alerting: query indexing • Search/query composed with Alerting

  20. Range Indexing Bi-directionally index ordered values (YEAR:int, Volume:text) to document (fragment) IDs UNIVERSAL INDEX “which” 123, 127, 129, 152, 344, 791 . . . “uniquely” 122, 125, 126, 129, 130, 167 . . . YEAR “identify” 123, 126, 130, 142, 143, 167 . . . Document References “each” 123, 130, 131, 135, 162, 177 . . . “data base” 126, 130, 167, 212, 219, 377 . . . <article> . . . 126, 130, 167, … <article>/<abstract> . . . <product>IMS</product> Volume

  21. Range indexes / lexicons • Scalar or ordered discrete • XS data types • Inequalities • Range queries • Aggregations • Corpus or Result-specific • Discrete • Bucketed (query-time bounds) • Faceted Navigation • Fundamentally O(log(n)) • Memory-mapped

  22. Aggregate Queries How many of the articles that contain “data base” were written in each of the last 5 decades? <article> <title>A Relational Model of Data for Large Shared Data Banks</title> <author><first-name>Edgar</first-name><last-name>Codd</last-name></author> <abstract> Future users of data banks must be protected from having to know how the data is organized in the machine (the internal representation). . . . Changes in data representation will often be needed . . . </abstract> <body> <section> <section> … has values which uniquely identify each element ... </section> </section> <section>… version of <product>IMS</product> provides the user . . . </section> </body> <metadata><vol>13</vol><number>6</number><year>1970</year></metadata> </article>

  23. Aggregates How many of the articles that contain “data base” were written in each of the last 5 decades? UNIVERSAL INDEX “which” 123, 127, 129, 152, 344, 791 . . . “uniquely” 122, 125, 126, 129, 130, 167 . . . YEAR “identify” 123, 126, 130, 142, 143, 167 . . . Document References “each” 123, 130, 131, 135, 162, 177 . . . “data base” 126, 130, 167, 212, 219, 377 . . . <article> . . . 126, 130, 167, … <article>/<abstract> . . . <product>IMS</product> Volume

  24. Composition Count all articles that contain “data” in the title and mention the product IMS in a section, grouping by year. <article> <title>A Relational Model of Data for Large Shared Data Banks</title> <author><first-name>Edgar</first-name><last-name>Codd</last-name></author> <abstract> Future users of data banks must be protected from having to know how the data is organized in the machine (the internal representation). . . . Changes in data representation will often be needed . . . </abstract> <body> <section> <section> … has values which uniquely identify each element ... </section> </section> <section>… version of <product>IMS</product> provides the user . . . </section> </body> <metadata><vol>13</vol><number>6</number><year>1970</year></metadata> </article>

  25. The Universal Index Range Indexes UNIVERSAL INDEX Term Term List “which” 123, 127, 129, 152, 344, 791 . . . “uniquely” 122, 125, 126, 129, 130, 167 . . . “identify” 123, 126, 130, 142, 143, 167 . . . “each” 123, 130, 131, 135, 162, 177 . . . Document References “data base” 126, 130, 167, 212, 219, 377 . . . <article> . . . <article>/<abstract> . . . 126, 130, 167, … <product>IMS</product>

  26. Back to Discrete Indexing • Directories • Exclusive, hierarchical, analogous to file system, map to URI • Collections • Set-based, N:N relationship • Security • Invisible to your app

  27. The Universal Index Range Indexes UNIVERSAL INDEX Term Term List “which” 123, 127, 129, 152, 344, 791 . . . “uniquely” 122, 125, 126, 129, 130, 167 . . . “identify” 123, 126, 130, 142, 143, 167 . . . “each” 123, 130, 131, 135, 162, 177 . . . Document References “data base” 126, 130, 167, 212, 219, 377 . . . <article> . . . <article>/<abstract> . . . 126, 130, 167, … <product>IMS</product> Directory(“/articles/”) Collection(Red) Role:Editor + Action:Read

  28. Universal Indexing: Agenda • MarkLogic Server • Text • XML Structure • XML Semantics • Scalars and aggregation • Spatial location • Alerting: query indexing • Search/query composed with Alerting

  29. Scalar Pairs • Data examples • Latitude / Longitude • Any other pair (e.g. volume / closing price) • Query types • Point (exact value) • Point-Radius (circle) • Lat/Lon bound (Mercator “rectangle”) • Polygon (10K+ vertices) • Aggregations: co-occurrence • Discrete facets (symptom : treatment) • Bucketed continuous • Frequency-ordered • Heat Map

  30. Geospatial Indexing Points ordered in latitude major order; special scan operators apply geospatial query constraints GEOSPATIAL INDEX 130 0, -124 123 -10,10.5 126 127 0,0 -10,10.5 167 0,0 126 0,-167 ... 130 0, -29 113 0,-12 126 0,0 167 0,0 113 10.1, 35.553 …

  31. Query Registration • Canonicalize a cts:query() • Hash for an ID • Resolve the query • Persist the term list in memory • Reuse as materialized sub-query • (AKA Topics, Concepts, Macros, etc.)

  32. Universal Indexing: Agenda • MarkLogic Server • Text • XML Structure • XML Semantics • Scalars and aggregation • Spatial location • Alerting: query indexing • Search/query composed with Alerting

  33. Query Indexing • “Alerting” • Selectors, tippers, standing query, filters, “triggers*”, real-time search, content-based routing, stream DBMS, etc. • Search(query, Index) -> docs • Alert(doc, Index) -> queries • Common sub-expression factoring • O(1) in number of alerts • O(n) in results returned * MarkLogic has pre- and post-commit DBMS triggers, which are unrelated Alert New patient matching your study profile

  34. The Reverse Index REVERSE INDEX Document References Query Unified Expression Trees year >= 1970 and (“data” and “size”) “data” and and 437 year > 2003 and (“data” and “web”) “size” year < 2000 and (“data” and “web”) “web” and and 562 (2000 <= year <= 2010) and “web” and 597 . . . year >= 1970 and 623 year < 2000 year >= 2000 and year > 2003 year <= 2010

  35. Alerting in Composition • Scalar • Search is on scalar data with range queries • Alert on range data with scalar reverse-queries • Geospatial • Search on point data with box, circle, polygon query constraint • Compose with text, structure, semantic query • Alert on box, circle, polygon data with point reverse-query • Compose with text, structure, semantic data • Search • Queries compose with Boolean operations (AND, OR, NOT, ((())) • Reverse- and forward-query compose (AND, OR, NOT, ((())) • Why would you ever want to do that?

  36. Universal Indexing: Agenda • MarkLogic Server • Text • XML Structure • XML Semantics • Scalars and aggregation • Spatial location • Alerting: query indexing • Search/query composed with Alerting

  37. Search Composed with Alerting • In Soviet Russia, the document queries YOU! • If you are also a document • Documents are XML • Elements • Typed data • Text • Serialized XQuery against documents • Attributes • Typed data • Composition • XQuery incorporating document and Booleans

  38. Matchmaking • Constraints upon each other. • Matching pairs or one-to-many • Suitable date • man / woman • mixed pool • Carpool ride: driver / rider • Employment: job / resume • Medication: patient / drug • Search security: document / user • Battle: target / shooter

  39. Carpool • Driver • Non-smoking woman driving from San Ramon to San Carlos, leaving at 8AM, listens to rock, pop, hip-hop, wants $10 for gas • Requires female passenger within five miles of start and end • Passenger • Woman will pay up to $20 • From: 3001 Summit View Dr, San Ramon, CA 94582 • To: 300 Oracle Pkwy, Redwood City, CA 94065 • Requires non-smoking car • Won’t listen to country music

  40. Driver let $from := cts:point(37.751658,-121.898387) (: San Ramon :) let $to := cts:point(37.507363, -122.247119) (: San Carlos :) return xdmp:document-insert( "/driver.xml", <driver> <from>{$from}</from> <to>{$to}</to> <when>2010-01-20T08:00:00-08:00</when> <gender>female</gender> <smoke>no</smoke> <music>rock, pop, hip-hop</music> <cost>10</cost> <preferences> {cts:and-query(( cts:element-value-query(xs:QName("gender"), "female"), cts:element-geospatial-query(xs:QName("from"), cts:circle(5, $from)), cts:element-geospatial-query(xs:QName("to"), cts:circle(5, $to)) )) } </preferences> </driver>)

  41. Driver let $from := cts:point(37.751658,-121.898387) (: San Ramon :) let $to := cts:point(37.507363, -122.247119) (: San Carlos :) return xdmp:document-insert( "/driver.xml", <driver> <from>{$from}</from> <to>{$to}</to> <when>2010-01-20T08:00:00-08:00</when> <gender>female</gender> <smoke>no</smoke> <music>rock, pop, hip-hop</music> <cost>10</cost> <preferences> {cts:and-query(( cts:element-value-query(xs:QName("gender"), "female"), cts:element-geospatial-query(xs:QName("from"), cts:circle(5, $from)), cts:element-geospatial-query(xs:QName("to"), cts:circle(5, $to)) )) } ...

  42. Driver let $from := cts:point(37.751658,-121.898387) (: San Ramon :) let $to := cts:point(37.507363, -122.247119) (: San Carlos :) return xdmp:document-insert( "/driver.xml", <driver> <from>{$from}</from> <to>{$to}</to> <when>2010-01-20T08:00:00-08:00</when> <gender>female</gender> <smoke>no</smoke> <music>rock, pop, hip-hop</music> <cost>10</cost> <preferences> {cts:and-query(( cts:element-value-query(xs:QName("gender"), "female"), cts:element-geospatial-query(xs:QName("from"), cts:circle(5, $from)), cts:element-geospatial-query(xs:QName("to"), cts:circle(5, $to)) )) } ...

  43. Driver let $from := cts:point(37.751658,-121.898387) (: San Ramon :) let $to := cts:point(37.507363, -122.247119) (: San Carlos :) return xdmp:document-insert( "/driver.xml", <driver> <from>{$from}</from> <to>{$to}</to> <when>2010-01-20T08:00:00-08:00</when> <gender>female</gender> <smoke>no</smoke> <music>rock, pop, hip-hop</music> <cost>10</cost> <preferences> {cts:and-query(( cts:element-value-query(xs:QName("gender"), "female"), cts:element-geospatial-query(xs:QName("from"), cts:circle(5, $from)), cts:element-geospatial-query(xs:QName("to"), cts:circle(5, $to)) )) } </preferences> </driver>)

  44. Driver let $from := cts:point(37.751658,-121.898387) (: San Ramon :) let $to := cts:point(37.507363, -122.247119) (: San Carlos :) return xdmp:document-insert( "/driver.xml", <driver> <from>{$from}</from> <to>{$to}</to> <when>2010-01-20T08:00:00-08:00</when> <gender>female</gender> <smoke>no</smoke> <music>rock, pop, hip-hop</music> <cost>10</cost> <preferences> {cts:and-query(( cts:element-value-query(xs:QName("gender"), "female"), cts:element-geospatial-query(xs:QName("from"), cts:circle(5, $from)), cts:element-geospatial-query(xs:QName("to"), cts:circle(5, $to)) ...

  45. Driver let $from := cts:point(37.751658,-121.898387) (: San Ramon :) let $to := cts:point(37.507363, -122.247119) (: San Carlos :) return xdmp:document-insert( "/driver.xml", <driver> <from>{$from}</from> <to>{$to}</to> <when>2010-01-20T08:00:00-08:00</when> <gender>female</gender> <smoke>no</smoke> <music>rock, pop, hip-hop</music> <cost>10</cost> <preferences> {cts:and-query(( cts:element-value-query(xs:QName("gender"), "female"), cts:element-geospatial-query(xs:QName("from"), cts:circle(5, $from)), cts:element-geospatial-query(xs:QName("to"), cts:circle(5, $to)) )) } </preferences> </driver>)

  46. Driver xdmp:document-insert( "/passenger.xml", <passenger> <from>37.739976,-121.915821</from> <to>37.53107,-122.264122</to> <gender>female</gender> <preferences> { cts:and-query(( cts:not-query(cts:element-word-query( xs:QName("music"), "country")), cts:element-range-query(xs:QName("cost"), "<=", 20), cts:element-value-query(xs:QName("smoke"), "no"), cts:element-value-query(xs:QName("gender"), "female") )) } </preferences> </passenger>)

  47. Driver xdmp:document-insert( "/passenger.xml", <passenger> <from>37.739976,-121.915821</from> <to>37.53107,-122.264122</to> <gender>female</gender> <preferences> { cts:and-query(( cts:not-query(cts:element-word-query( xs:QName("music"), "country")), cts:element-range-query(xs:QName("cost"), "<=", 20), cts:element-value-query(xs:QName("smoke"), "no"), cts:element-value-query(xs:QName("gender"), "female") )) } </preferences> </passenger>)

  48. Driver xdmp:document-insert( "/passenger.xml", <passenger> <from>37.739976,-121.915821</from> <to>37.53107,-122.264122</to> <gender>female</gender> <preferences> { cts:and-query(( cts:not-query(cts:element-word-query( xs:QName("music"), "country")), cts:element-range-query(xs:QName("cost"), "<=", 20), cts:element-value-query(xs:QName("smoke"), "no"), cts:element-value-query(xs:QName("gender"), "female") )) } </preferences> </passenger>)

  49. Driver xdmp:document-insert( "/passenger.xml", <passenger> <from>37.739976,-121.915821</from> <to>37.53107,-122.264122</to> <gender>female</gender> <preferences> { cts:and-query(( cts:not-query(cts:element-word-query( xs:QName("music"), "country")), cts:element-range-query(xs:QName("cost"), "<=", 20), cts:element-value-query(xs:QName("smoke"), "no"), cts:element-value-query(xs:QName("gender"), "female") )) } </preferences> </passenger>)

  50. Driver xdmp:document-insert( "/passenger.xml", <passenger> <from>37.739976,-121.915821</from> <to>37.53107,-122.264122</to> <gender>female</gender> <preferences> { cts:and-query(( cts:not-query(cts:element-word-query( xs:QName("music"), "country")), cts:element-range-query(xs:QName("cost"), "<=", 20), cts:element-value-query(xs:QName("smoke"), "no"), cts:element-value-query(xs:QName("gender"), "female") )) } </preferences> </passenger>)

More Related