QUASAR query language and system

QUASAR query language and system Luying Chen and Michael BenediktComputer Science Department, University of Oxford Evgeny KharlamovKRDB research centre, Free University of Bozen-Bolzano

QUASAR system [Quasar’12] • QUASAR system is about • QUerying • Annotations • Structure And • Reasoning • QUASAR is • a query answering system • to query annotated data • and exploit the structure of the data • together with logical reasoning over annotations

QUASAR system • QUASAR = Querying Annotations Structure And Reasoning • Annotations come from annotated data • What is this data? • What is the source of this data? • Structure is data / documents’ structure • Which documents? • Why are they annotated? • Reasoning over annotations to improve quality of query answering • Why reasoning is possible? • Why reasoning is beneficial?

Outline • Sources of annotated data • Semantic markup • Document annotators • How to query annotated data? • QUASAR data model and query language • QUASAR challenges • Summary

Semantically annotated Web • Goal: • to nest semantics within existing content on web pages • to help search engines, crawlers and browsersfind the right data • Person: • name • photo • URL • ... text anno-tatedtext

Standards for semantic markup • Microformats • started in 2003 • small data islandswithin HTML pages • Small set of fixed formats • hcard: people, companies, organizations, and places • XFN : relationships between people • hCalendar: calendaring and events • RDFa: Resource Description Framework – in – attributes • proposed in 2004, W3C recommendation • serializationformatforembedding RDF data into HTML pages • canbeusedtogetherwithanyvocabulary, e.g. FOAF • Microdata • alternativetechniquesforembeddingstrucuted data • proposed in 2009, comeswith HTML 5

Is semantic markup important? • Schema.orginitiative: • started on June 2011 • initiated by Bing, Google, Yahoo! • they propose: to mark up / annotate websites with metadata • they support: Microdata

Is semantic markup important? • Metadata by Schema.org: • Person • Organization • Event • Place • Product • ... • 200+ types

Who uses semantic markup? • Common Crawl foundation • goal: building and maintaining an open crawl of the Web • WebDataCommons.orgproject • goal: extracting Microformats, Microdata,RDFa from Common Crawl corpus • Feb 2012: • processed 1.4 billion HTML pages of CC corpus • 20.9 Terabyte of compressed data • this is a big fraction of the Web

Who uses semantic markup? • 1.4 billion HTML pages processes • 188 millions of them contain structural datain Microformat, Microdata, RDFa [CB’12] • This data is 3.2 billions RDF triples 13% of the HTML pages contain structured (meta) data

Automatic documents annotation • There are more and more systems that do automatic text annotation • OpenCalais, Evri API, Alchemy API, Zemanta, ... • How they work: • intelligent processing of textual data • use of machine learning • use of natural language processing • ... Goal of annotation: transforming text and webpages into knowledge

Annotated documents: examples

What are annotated documents? Annotated document, Screenshot from OpenCalais • Annotated document is • a sequence of tokens • with annotations overlaying them • Each annotation has • a span: start and end toke • type: • concept (e.g., Person), • sentiment (e.g., positive), etc. • canonical name (e.g., Ferdinand Magelan) • URI (e.g., a link to DBpedia) • accuracy of recognition • ...

Annotated doc = bag of annotations Annotated document, Screenshot from OpenCalais • ABox statements + metadata

Types of Annotations • Concept annotations: • Person(Ferdinand Magellan) • Continent(Europe) • (n-ary) Relationship annotation, i.e., events and facts: • Person_Career(John II, King, Political, current) • General_Relation(Another expl., name, Magellan) • Person_Travel(Henry Hudson, Delaware, past) • Born_In(Magellan, Portugal) • Travel(Magellan, Spain, September Past) • Sentimental annotations: • Positive(the first in Europe) • Neutral(Mediterranean Sea) • Negative(died last year)

Naive approach to QA

Issues with missing information • Return places visited by Magellan • Available triples: • Magellan Type Person • Siberia Type Place • Philippines Type Country • Charles Type City • Tipples are missing lots of information: • Who discovered the triple? -- reliability of triples • In which corpus, paragraphit appears? -- coordinates of triples • What is the URI of an annotated object? -- disambiguation • .... We claim that missing information is vitalfor answering queries

Issues with sets vs. (ordered) bags • Return places visited by Magellan • Available triples: • Magellan Type Person -- occurs 50 times (in the corpus) • Siberia Type Place -- occurs 2 times • Philippines Type Country -- occurs 240 times • Charles Type City -- occurs 1time • Triple store is a set of triples • Every triple has the same “weight” or “importance” • document order and distance between triples is ignored • Some triples are in the triple set • due to annotators mistake or • they are noise To avoid irrelevant triples: ordered bags or triples with weights are needed

Issues with joins • Return places visited by Magellan • Available triples: • Magellan Type Person • Siberia Type Place • Philippines Type Country • Charles Type City • triple (Magellan Visited Philippines) is absent • Correlations between triples are missing • How can we join triples? • Standard way: using values • It does not help in our case Structure (same paragraph), names of annotations, etc. is a way to join triples

Issues with “being in the box” • Return places visited by Magellan • Available triples: • Magellan Type Person • Siberia Type Place • Philippines Type Country • Charles Type City • How can we find out that (Country SubClassOf Place)? • Schema might not be available • How can we be sure that Charles is indeed a city? • annotators make mistakes Using ngexternal sources of knowledge is a way to go: DBpedia, Yago, ...

QUASAR data model • Nested objects: • annotation • snippet • assertion • arg[i] • naive name (string) • canonical name (string) • a list of URI (list of strings) • Strings: • corpus document, paragraph, sentence nr. • predicate • annotator • annotation type

Query answering over annotated docs • We want to retrieve annotations with specified • doc location • type • predicates • entities • sentiments • URIs • annotators • confidence • ... • QUASAR is an annotation orientedquery language • Return annotations about places visited by Magellan

Example QUASAR queries • Return annotations from the first 2 par-s of the corpusSELECT a FROM explorer_corpus.Annotation aWHERE a.snippet.paraNum <= 2 • Return annotations found by Open CalaisSELECT a FROM explorer_corpus.Annotation aWHERE a.annotator = "OpenCalais"

Example QUASAR queries • Return event annotations SELECT a FROM explorer_corpus.Annotation aWHERE a.annotationType= "event" • Return annotations about personsSELECT a FROM explorer_corpus.Annotation aWHERE a.assertion.predicate= "Person"The samequery in atom-basednotation:...WHERE a.assertion= Person(?x)

Example QUASAR queries • Return annotations about Magellan • Fuzzy match:SELECT a FROM explorer_coprus.AnnotationaWHERE a.assertion.arg[0]like”Magellan” • Exact match:SELECT a FROM explorer_coprus.Annotation aWHERE a.assertion.arg[0] = ”Magellan”

Example QUASAR queries • Which places has Magellan visited? • Return (annotations about)countries located in the same paragraph as(annotations with the assertion) Person(Magellan)SELECT a FROM explorer_corpus.Annotationa, explorer_corpus.Annotationb WHERE a.assertion = Country(?x) and a.snippet.docNum = b.snippet.docNum and a.snippet.paraNum= b.snippet.paraNumand b.assertion.predicate= "Person"and b.assertion.arg[0] like "Magellan"

QUASAR queries: general form • QUASAR query language combines • SQL syntax with • object oriented navigation • At this moment we support conjunctive queries only • Three clauses of queries: SELECT annotation | attribute of annotation FROM annotation set a, ..., annotation set n WHERE conditions on annotations output annotations from one of the annotation sets output attribute of annotations list of sets of annotations filter on annotations: conditions on annot. attributes, joins of annotations

Are we happy with quality of answers? • There are too fewexpectedanswers • Country(Philippines) – where are the cities? • Place(Atlantic Ocean) – how can we avoid oceans? • How to find all relevant places visited by Magellan?a.assertion.predicate= ”Country”|“Province“|”City”| ... • How to get read of Oceans?a.assertion.predicate = not”Ocean” • Annot. vocab.s (concepts, roles ...) are flat • Annotators cannot expand queries automatically=> User has to do it and write many or complex queries. • How can it be avoided? • We use ontologies to • address the “too few expected” answers probl. • by expanding queries

TBox reasoning in QUASAR • Return all (explicit and implicit) places which are not OceansSELECT a FROM explorer_corpus.AnnotationaWHERE a.assertion= ?X(?y) [ontologyFilter: subClassOf(?X,"Place") anddisjointWith(?X,"Ocean") ] • There are many tools for TBox reasoning • Pellet, Racer, Jena, ... • We use Jena for TBox reasoning and support • subclass of • disjoint with • We allow to upload ontologies for reasoning • For the demo we use the ontology of DBpediaextended with disjointness assertions

Are we happy with quality of answers? • There are too many wrong answers:Country(John II) and Person(Strait of Magelan ) • Annotators make errors • How can we do semantic check on the results of annotators? • Knowledge bases (KBs) can be used to check quality of answers • We use available Knowledge Bases to • address the “too many wrong” answers problem • by exploiting them as filters

Query answering over KBs in QUASAR • Return all places known by DBpediato be populatedSELECTa FROM explorer_corpus.AnnotationaWHERE a.assertion= Place(?y) [ontologyFilter: Populated(?y)] • Return all organizations known by DBpediato be educat. institutes located in BolzanoSELECT a FROM explorer_corpus.Annotation aWHERE a.assertion = Organization(?y) [ontologyFilter: EducationInstitute(?y) andlocatedIn(?y, "Bolzano") ]

Query answering over KBs in QUASAR • There are tools to support query answering over KBs • Quest [Quest] • Owlim[Owlim] • ... • QUASAR uses REQUIEM [RQ] and supports • conjunctive queries • over KBs with ontologies that have • subclass-of • disjointness • In QUASAR users can choose KBsto be used for reasoning • For demo we use DBpedia

QUASAR philosophy of query answering • QUASAR = Querying Annotations Structure and Reasoning • External ontologies are used to • do query expansion • increase number of answers • Annotated documents are used to • retrieve annotations, their attributes • filter annotations using • structure of documents • metadata of annotations • External KBs are used to • filter out wrong instances • reason over instances

QUASAR Architecture

Top-k • Answer set comparable to corpus size • annotators are able to discover a lot of entities • the same entity can be annotated with several annotations • combinations of several annotators make it even worse • Ranking is needed • Directions: • deterministic ranking based on • document structure (closer to a reference point, higher in ranking) • frequency (higher the frequency of an annot., higher the ranking) • reliability of annotators • ... • probabilistic ranking based on statistics of mistakes

Visualization • What would be the best way to display answers? • to make the answers more intuitive • good aggregation mechanisms due to large answer sets • System does not support projections of annotations at the moment • How to visualize different projections? • User studies are needed

Users feedback • A form of an (indirect) feedback look is currently present: • user asks query • used observes answer set • user refine query • go to 1. • This is a standard feedback approach in search engines: • users sends to Google keywords • user refines keywords if the result is not good • We want to incorporate a direct feedback: • user should be able to rate answers • good – bad • keep – dismiss • based on the feedback the system should adjust the answer set

Probabilistic query answers • Now the answers are deterministic • We want answers to be: • (Magellan Visited Philippines) with probability 0.7 • We work on a probabilistic model • it combines several annotators based on their reliability • the model is an annotator itself • it is a form of probabilistic transducer • it produces annotations with probabilities • probabilities are based on an aggregate opinion of other annotators

Summary • Annotations are an important reality of the today’s Web • 13% of the Web data has it • any text can be easily annotated using automatic annotators • There is a need in query answering • techniques and • tools in order leverage annotated data for intelligent information search • Current approaches to query answering over triple storesare not adequate or at least hard to adopt directly • QUASAR response to the problem: • a data model • a query language • a demo system

References • [CB’12] C. Bizer: Topology of the Web of Data. Joined keynote talk at LWDM2012 and BEWEB2012, EDBT workshops, Berlin, Germany, March, 2012. • [Quest] http://obda.inf.unibz.it/protege-plugin/quest/quest.html • [Owlim] http://www.ontotext.com/owlim • [RQ] http://www.cs.ox.ac.uk/projects/requiem/index.html • [Quasar’12] QUASAR: Querying Annotation, Structure, and Reasoning. L. Chen, M. Benedikt and E. Kharlamov. In Proc. of EDBT, Berlin, March 2012. Demonstration. • [Alchemympi] www.alchemyapi.com/api/entity/ • [Evriapi] www.evri.com/ • [Jenaapi] jena.sourceforge.net • [Opencalais]www.opencalais.com

References • [KIM’04] A. Kiryakov, B. Popov, I. Terziev, D. Manov, and D. Ognyanoff. Semantic annotation, indexing, and retrieval. J. Web Semantics, 2(1):49 – 79, 2004. • [Docqs’10] [11] M. Zhou, T. Cheng, and K. C.-C. Chang. DoCQS: a prototype system for supporting data-oriented content query. In SIGMOD, 2010.

QUASAR query language and system

QUASAR query language and system

Presentation Transcript

Structured Query Language

Query Language

XML Query Language

Spatial Query Language

Language Integrated Query

Structured Query Language

Structured Query Language

Structured Query Language

Structured Query Language

Hibernate Query Language

STRUCTURED QUERY LANGUAGE

Structured Query Language

Structured Query Language

Structured Query Language

Structured Query Language

Structured/System Query Language -- SQL

Structured Query Language

Structured Query Language

Structured Query Language