1 / 46

QUASAR query language and system

QUASAR query language and system. Luying Chen and Michael Benedikt Computer Science Department, University of Oxford Evgeny Kharlamov KRDB research centre , Free University of Bozen -Bolzano. QUASAR system [Quasar’12]. QUASAR system is about QU erying

guy
Download Presentation

QUASAR query language and system

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. QUASAR query language and system Luying Chen and Michael BenediktComputer Science Department, University of Oxford Evgeny KharlamovKRDB research centre, Free University of Bozen-Bolzano

  2. QUASAR system [Quasar’12] • QUASAR system is about • QUerying • Annotations • Structure And • Reasoning • QUASAR is • a query answering system • to query annotated data • and exploit the structure of the data • together with logical reasoning over annotations

  3. QUASAR system • QUASAR = Querying Annotations Structure And Reasoning • Annotations come from annotated data • What is this data? • What is the source of this data? • Structure is data / documents’ structure • Which documents? • Why are they annotated? • Reasoning over annotations to improve quality of query answering • Why reasoning is possible? • Why reasoning is beneficial?

  4. Outline • Sources of annotated data • Semantic markup • Document annotators • How to query annotated data? • QUASAR data model and query language • QUASAR challenges • Summary

  5. Semantically annotated Web • Goal: • to nest semantics within existing content on web pages • to help search engines, crawlers and browsersfind the right data • Person: • name • photo • URL • ... text anno-tatedtext

  6. Standards for semantic markup • Microformats • started in 2003 • small data islandswithin HTML pages • Small set of fixed formats • hcard: people, companies, organizations, and places • XFN : relationships between people • hCalendar: calendaring and events • RDFa: Resource Description Framework – in – attributes • proposed in 2004, W3C recommendation • serializationformatforembedding RDF data into HTML pages • canbeusedtogetherwithanyvocabulary, e.g. FOAF • Microdata • alternativetechniquesforembeddingstrucuted data • proposed in 2009, comeswith HTML 5

  7. Is semantic markup important? • Schema.orginitiative: • started on June 2011 • initiated by Bing, Google, Yahoo! • they propose: to mark up / annotate websites with metadata • they support: Microdata

  8. Is semantic markup important? • Metadata by Schema.org: • Person • Organization • Event • Place • Product • ... • 200+ types

  9. Who uses semantic markup? • Common Crawl foundation • goal: building and maintaining an open crawl of the Web • WebDataCommons.orgproject • goal: extracting Microformats, Microdata,RDFa from Common Crawl corpus • Feb 2012: • processed 1.4 billion HTML pages of CC corpus • 20.9 Terabyte of compressed data • this is a big fraction of the Web

  10. Who uses semantic markup? • 1.4 billion HTML pages processes • 188 millions of them contain structural datain Microformat, Microdata, RDFa [CB’12] • This data is 3.2 billions RDF triples 13% of the HTML pages contain structured (meta) data

  11. Outline • Sources of annotated data • Semantic markup • Document annotators • How to query annotated data? • QUASAR data model and query language • QUASAR challenges • Summary

  12. Automatic documents annotation • There are more and more systems that do automatic text annotation • OpenCalais, Evri API, Alchemy API, Zemanta, ... • How they work: • intelligent processing of textual data • use of machine learning • use of natural language processing • ... Goal of annotation: transforming text and webpages into knowledge

  13. Annotated documents: examples

  14. What are annotated documents? Annotated document, Screenshot from OpenCalais • Annotated document is • a sequence of tokens • with annotations overlaying them • Each annotation has • a span: start and end toke • type: • concept (e.g., Person), • sentiment (e.g., positive), etc. • canonical name (e.g., Ferdinand Magelan) • URI (e.g., a link to DBpedia) • accuracy of recognition • ...

  15. Annotated doc = bag of annotations Annotated document, Screenshot from OpenCalais • ABox statements + metadata

  16. Types of Annotations • Concept annotations: • Person(Ferdinand Magellan) • Continent(Europe) • (n-ary) Relationship annotation, i.e., events and facts: • Person_Career(John II, King, Political, current) • General_Relation(Another expl., name, Magellan) • Person_Travel(Henry Hudson, Delaware, past) • Born_In(Magellan, Portugal) • Travel(Magellan, Spain, September Past) • Sentimental annotations: • Positive(the first in Europe) • Neutral(Mediterranean Sea) • Negative(died last year)

  17. Outline • Sources of annotated data • Semantic markup • Document annotators • How to query annotated data? • QUASAR data model and query language • QUASAR challenges • Summary

  18. Naive approach to QA

  19. Issues with missing information • Return places visited by Magellan • Available triples: • Magellan Type Person • Siberia Type Place • Philippines Type Country • Charles Type City • Tipples are missing lots of information: • Who discovered the triple? -- reliability of triples • In which corpus, paragraphit appears? -- coordinates of triples • What is the URI of an annotated object? -- disambiguation • .... We claim that missing information is vitalfor answering queries

  20. Issues with sets vs. (ordered) bags • Return places visited by Magellan • Available triples: • Magellan Type Person -- occurs 50 times (in the corpus) • Siberia Type Place -- occurs 2 times • Philippines Type Country -- occurs 240 times • Charles Type City -- occurs 1time • Triple store is a set of triples • Every triple has the same “weight” or “importance” • document order and distance between triples is ignored • Some triples are in the triple set • due to annotators mistake or • they are noise To avoid irrelevant triples: ordered bags or triples with weights are needed

  21. Issues with joins • Return places visited by Magellan • Available triples: • Magellan Type Person • Siberia Type Place • Philippines Type Country • Charles Type City • triple (Magellan Visited Philippines) is absent • Correlations between triples are missing • How can we join triples? • Standard way: using values • It does not help in our case Structure (same paragraph), names of annotations, etc. is a way to join triples

  22. Issues with “being in the box” • Return places visited by Magellan • Available triples: • Magellan Type Person • Siberia Type Place • Philippines Type Country • Charles Type City • How can we find out that (Country SubClassOf Place)? • Schema might not be available • How can we be sure that Charles is indeed a city? • annotators make mistakes Using ngexternal sources of knowledge is a way to go: DBpedia, Yago, ...

  23. Outline • Sources of annotated data • Semantic markup • Document annotators • How to query annotated data? • QUASAR data model and query language • QUASAR challenges • Summary

  24. QUASAR data model • Nested objects: • annotation • snippet • assertion • arg[i] • naive name (string) • canonical name (string) • a list of URI (list of strings) • Strings: • corpus document, paragraph, sentence nr. • predicate • annotator • annotation type

  25. Query answering over annotated docs • We want to retrieve annotations with specified • doc location • type • predicates • entities • sentiments • URIs • annotators • confidence • ... • QUASAR is an annotation orientedquery language • Return annotations about places visited by Magellan

  26. Example QUASAR queries • Return annotations from the first 2 par-s of the corpusSELECT a FROM explorer_corpus.Annotation aWHERE a.snippet.paraNum <= 2 • Return annotations found by Open CalaisSELECT a FROM explorer_corpus.Annotation aWHERE a.annotator = "OpenCalais"

  27. Example QUASAR queries • Return event annotations SELECT a FROM explorer_corpus.Annotation aWHERE a.annotationType= "event" • Return annotations about personsSELECT a FROM explorer_corpus.Annotation aWHERE a.assertion.predicate= "Person"The samequery in atom-basednotation:...WHERE a.assertion= Person(?x)

  28. Example QUASAR queries • Return annotations about Magellan • Fuzzy match:SELECT a FROM explorer_coprus.AnnotationaWHERE a.assertion.arg[0]like”Magellan” • Exact match:SELECT a FROM explorer_coprus.Annotation aWHERE a.assertion.arg[0] = ”Magellan”

  29. Example QUASAR queries • Which places has Magellan visited? • Return (annotations about)countries located in the same paragraph as(annotations with the assertion) Person(Magellan)SELECT a FROM explorer_corpus.Annotationa, explorer_corpus.Annotationb WHERE a.assertion = Country(?x) and a.snippet.docNum = b.snippet.docNum and a.snippet.paraNum= b.snippet.paraNumand b.assertion.predicate= "Person"and b.assertion.arg[0] like "Magellan"

  30. QUASAR queries: general form • QUASAR query language combines • SQL syntax with • object oriented navigation • At this moment we support conjunctive queries only • Three clauses of queries: SELECT annotation | attribute of annotation FROM annotation set a, ..., annotation set n WHERE conditions on annotations output annotations from one of the annotation sets output attribute of annotations list of sets of annotations filter on annotations: conditions on annot. attributes, joins of annotations

  31. Are we happy with quality of answers? • There are too fewexpectedanswers • Country(Philippines) – where are the cities? • Place(Atlantic Ocean) – how can we avoid oceans? • How to find all relevant places visited by Magellan?a.assertion.predicate= ”Country”|“Province“|”City”| ... • How to get read of Oceans?a.assertion.predicate = not”Ocean” • Annot. vocab.s (concepts, roles ...) are flat • Annotators cannot expand queries automatically=> User has to do it and write many or complex queries. • How can it be avoided? • We use ontologies to • address the “too few expected” answers probl. • by expanding queries

  32. TBox reasoning in QUASAR • Return all (explicit and implicit) places which are not OceansSELECT a FROM explorer_corpus.AnnotationaWHERE a.assertion= ?X(?y) [ontologyFilter: subClassOf(?X,"Place") anddisjointWith(?X,"Ocean") ] • There are many tools for TBox reasoning • Pellet, Racer, Jena, ... • We use Jena for TBox reasoning and support • subclass of • disjoint with • We allow to upload ontologies for reasoning • For the demo we use the ontology of DBpediaextended with disjointness assertions

  33. Are we happy with quality of answers? • There are too many wrong answers:Country(John II) and Person(Strait of Magelan ) • Annotators make errors • How can we do semantic check on the results of annotators? • Knowledge bases (KBs) can be used to check quality of answers • We use available Knowledge Bases to • address the “too many wrong” answers problem • by exploiting them as filters

  34. Query answering over KBs in QUASAR • Return all places known by DBpediato be populatedSELECTa FROM explorer_corpus.AnnotationaWHERE a.assertion= Place(?y) [ontologyFilter: Populated(?y)] • Return all organizations known by DBpediato be educat. institutes located in BolzanoSELECT a FROM explorer_corpus.Annotation aWHERE a.assertion = Organization(?y) [ontologyFilter: EducationInstitute(?y) andlocatedIn(?y, "Bolzano") ]

  35. Query answering over KBs in QUASAR • There are tools to support query answering over KBs • Quest [Quest] • Owlim[Owlim] • ... • QUASAR uses REQUIEM [RQ] and supports • conjunctive queries • over KBs with ontologies that have • subclass-of • disjointness • In QUASAR users can choose KBsto be used for reasoning • For demo we use DBpedia

  36. QUASAR philosophy of query answering • QUASAR = Querying Annotations Structure and Reasoning • External ontologies are used to • do query expansion • increase number of answers • Annotated documents are used to • retrieve annotations, their attributes • filter annotations using • structure of documents • metadata of annotations • External KBs are used to • filter out wrong instances • reason over instances

  37. QUASAR Architecture

  38. Outline • Sources of annotated data • Semantic markup • Document annotators • How to query annotated data? • QUASAR data model and query language • QUASAR challenges • Summary

  39. Top-k • Answer set comparable to corpus size • annotators are able to discover a lot of entities • the same entity can be annotated with several annotations • combinations of several annotators make it even worse • Ranking is needed • Directions: • deterministic ranking based on • document structure (closer to a reference point, higher in ranking) • frequency (higher the frequency of an annot., higher the ranking) • reliability of annotators • ... • probabilistic ranking based on statistics of mistakes

  40. Visualization • What would be the best way to display answers? • to make the answers more intuitive • good aggregation mechanisms due to large answer sets • System does not support projections of annotations at the moment • How to visualize different projections? • User studies are needed

  41. Users feedback • A form of an (indirect) feedback look is currently present: • user asks query • used observes answer set • user refine query • go to 1. • This is a standard feedback approach in search engines: • users sends to Google keywords • user refines keywords if the result is not good • We want to incorporate a direct feedback: • user should be able to rate answers • good – bad • keep – dismiss • based on the feedback the system should adjust the answer set

  42. Probabilistic query answers • Now the answers are deterministic • We want answers to be: • (Magellan Visited Philippines) with probability 0.7 • We work on a probabilistic model • it combines several annotators based on their reliability • the model is an annotator itself • it is a form of probabilistic transducer • it produces annotations with probabilities • probabilities are based on an aggregate opinion of other annotators

  43. Outline • Sources of annotated data • Semantic markup • Document annotators • How to query annotated data? • QUASAR data model and query language • QUASAR challenges • Summary

  44. Summary • Annotations are an important reality of the today’s Web • 13% of the Web data has it • any text can be easily annotated using automatic annotators • There is a need in query answering • techniques and • tools in order leverage annotated data for intelligent information search • Current approaches to query answering over triple storesare not adequate or at least hard to adopt directly • QUASAR response to the problem: • a data model • a query language • a demo system

  45. References • [CB’12] C. Bizer: Topology of the Web of Data. Joined keynote talk at LWDM2012 and BEWEB2012, EDBT workshops, Berlin, Germany, March, 2012. • [Quest] http://obda.inf.unibz.it/protege-plugin/quest/quest.html • [Owlim] http://www.ontotext.com/owlim • [RQ] http://www.cs.ox.ac.uk/projects/requiem/index.html • [Quasar’12] QUASAR: Querying Annotation, Structure, and Reasoning. L. Chen, M. Benedikt and E. Kharlamov. In Proc. of EDBT, Berlin, March 2012. Demonstration. • [Alchemympi] www.alchemyapi.com/api/entity/ • [Evriapi] www.evri.com/ • [Jenaapi] jena.sourceforge.net • [Opencalais]www.opencalais.com

  46. References • [KIM’04] A. Kiryakov, B. Popov, I. Terziev, D. Manov, and D. Ognyanoff. Semantic annotation, indexing, and retrieval. J. Web Semantics, 2(1):49 – 79, 2004. • [Docqs’10] [11] M. Zhou, T. Cheng, and K. C.-C. Chang. DoCQS: a prototype system for supporting data-oriented content query. In SIGMOD, 2010.

More Related