Problem

Problem Web EnterpriseInformationSystems Intranet … Databases Search for… The inventor of the world wide web Gothic and romantic churches that are located in the same place Movies with an actor who is the governor of California

Arising Questions • What do we know about the structure of the data? • Why do we care what is the structure of the data? • Can we “structure” data? How? • Does John Doe also works in Apple? • How do we interpret links? Roy Smith works in Apple <worker> <name> Roy Smith</name> <company>Apple</company> </worker> XPath Query: //worker[company=“Apple”]/name <person> Roy Smith</person> works in <company>Apple</company> <company>Apple</company> employs <person> John Doe</person>

Example query #1 What are the publications of Max Planck? Max Planck should be instance of concept person, not of concept institute Concept Awareness

Example query #2 Information is not present on a single page, but distributed across linked pages VLDB Conference2005, Trondheim, Norway Call for Papers…XML… ? Conferences about XML in Norway 2005 Context Awareness

Example query #3 Which professors from the Technion do research on Theory of computer science Dr. Amir Shpilka …seniorlecturer… Abstraction Awareness Different terminology in query and Web pages

Unified search for unstructured, semistructured, structured data from heterogeneous sources Graph-based model, including links Annotation engines from NLP to recognize classes of named entities (persons, locations, dates, …) for concept-aware queries Flexible yet simple abstraction-aware query language with context-aware scoring Compactness-based scores SphereSearch Concepts Goal: Increase recall & precision for hard queries on linked and heterogeneous data

Challanges in search engines SphereSearch Concepts Transformation and Annotation Query Language and Scoring Experimental Evaluation Current and Future Work Outline

Unifying Search on Heterogeneous Data Web XML Intranet Heuristics, type-spec transformations EnterpriseInformationSystems … Databases

Headlines<h1>Experiments</h1><h2>Settings</h2>We evaluated...<h2>Results</h2>Our system... Heuristic Transformation of HTML <Experiments><Settings>...</Settings><Results>...</Results> </Experiments> <Topic>XML</Topic> Goal: Transform layout tagsto semantic annotations • Patterns<b>Topic:</b>XML • Rules for tables, lists, …

<Professor> Gerhard Weikum<Course> IR </Course> Saarbrücken<Research> XML </Research></Professor> Basic Data Model person location Tags annotate content with corresponding concept docid=1tag=“Professor“ content=“Gerhard Weikum Saarbrücken“ 1 docid=1tag=“Research“content=“XML“ docid=1tag=“Course“content=“IR“ 2 3 Automatic annotation of important concepts (persons, locations, dates, money amounts) with tools from Information Extraction

Named Entity Recognition Based on part-of-speech tagging and large dictionary containing names of cities, countries, common person names etc. Mature (out-of-the-box products, e.g. GATE/ANNIE) Extensible Information Extraction (IE) The Pelican Hotel in Salvador, operated by Roberto Cardoso, offers comfortable rooms starting at $100 a night, including breakfast. Please check in before 7pm. The <company>Pelican Hotel </company> in <location> Salvador </location>, operated by <person> Roberto Cardoso </person>, offers comfortable rooms starting at <price> $100 </price> a night, including breakfast. Please check in before <time> 7pm </time>.

<Professor> Gerhard Weikum <Course>IR</Course> Saarbrücken <Research>XML</Research></Professor> Annotation-Aware Data Model Annotation with GATE:„Saarbrücken“ of type „location“ docid=1tag=„Professor“content=“Gerhard Weikum“ 1 docid=1tag=“location“ content=“Saarbrücken“ docid=1tag=“Research“ content=“XML“ docid=1tag=“Course“content=“IR“ 3 4 2 docid=1tag=“Professor“content=“Gerhard Weikum Saarbrücken“ 1 docid=1tag=“Research“ content=“XML“ docid=1tag=“Course“content=“IR“ 3 2 Annotation introduces new tags

Unifying Search on Heterogeneous Data Web XML Intranet Heuristics, type-spec transformations Annotation of named entitieswith IE tools (e.g., GATE) EnterpriseInformationSystems … AnnotatedXML Databases

Given a collection of XML documents and links, we define the element-level graph V is the union of the elements of all documents E contains all parent child edges and links Attributes are considered as if they were elements G is undirected for it is easier to phrase queries without thinking about the direction of the edges For each element , denotes to the tag name and its content Each edge is assigned a nonnegative weight which is 1 for parent-child edge and for links takes two elements as input and computes the weight of the shortest path in G between them Data Model

Extended keyword queries: similarity conditions ~professor, ~Information retrieval concept-based conditions person=Max Planck, location=Paris grouping join conditions Ranked results with context-aware scoring SphereSearch Queries

SphereSearch Queries: Examples concept (tag) value (content) Query group R(professor, location=~Jordan) C(course, ~database) R(~seminar, ~XML) A(gothic, church) B(romantic, church) A.location=B.location • Supports traditional keyword search For each query group, a disjunction of basic conditions Join condition

Formal Query Language • A SphereSearch query consists of a set of query groups and a set of join conditions (possibly empty) • Each consists of (keywords conditions) and (concept value conditions) • A join has the form for exact match join and for similarity join( are query groups and are tag names) A result for Query is a list of g-tuples of elements sorted by score

Local Node Score Query: Course ~XML location=Max Planck • We compute a node score for each node based on (tf/idf, Okapi BM25 scoring model…) adapted for XML • How can we use such measure in order to score ~XML orlocation=Max Planck ?

Similarity Conditions Thesaurus/Ontology: concepts, relationships, glosses from WordNet, Gazetteers, Web forms & tables, Wikipedia alchemist primadonna director artist wizard investigator intellectual researcher professor educator HYPONYM (0.7) scientist scholar lecturer mentor teacher academic, academician, faculty member relationships quantified by statistical co-occurence measures Similarity conditions like ~professor, ~XML For ~K similarity we first compute exp(K); a set of all terms similar to K using the ontology Example: δ-exp(K)={w|sim(K,w)>δ} Local score: weighted max over all expansion terms: Abstraction awareness

Concept-based conditions docid=1tag=„location“content=“Jordan“ n concept value Goal: Exploit explicit (tags) and automatic annotations in documents location=Jordan Concept awareness concept-specific distance measure like 1970<date<1980

Spheres • Most existing retrieval systems consider only the contentof an element itself to asses its relevance for a query • In SphereSearch this type of score is provided by the Node Score (ns) • Local score may not be sufficient • In the presence of links • When content is spread over several elements

Score Aggregation: SphereScore Local score for each element e (tf/idf, BM25,…) 1 2 2 s(1): researchXML Weighted aggregation of local scores in environment of element (sphere score): 1 Context awareness

SphereScore computed for each group Group conditions that relate to the same „entity“professor teaching IR research XML professorT(teaching IR)R(research XML) Find compact sets with one result for each group Query Groups Goal: Related terms should occur in the same context

Scores for Query Results A X 1 2 B 3 A X 1 2 1 3 A X 4 1 2 5 B X 3 2 1 5 B X 6 1 2 query result R: one result per query group compactness ~ 1/size of a minimal spanning tree Context awareness

Join conditions Goal: Connect results of different query groups A(research, XML) B(VLDB 2005 paper) A.person=B.person B A VLDB 2005 research XML Ralf Schenkel 1.0 2004 2005 R.Schenkel 0.9 • Join conditions do not change the score for a node • Join conditions create a new link with a specific weight

Join conditionA.T=B.S: For all nodes n1 with type T, n2 with type S, add edge (n1,n2) with weight sim(n1,n2))-1 sim(n1,n2): content-based similarity Score for Join Conditions A X 1 2 B 2 B X 3 1 2

Architecture Crawler Client GUI Transformer Ontology Service Query Processor Index Annotator Indexer

Three corpora: Wikipedia extended Wikipedia with links to IMDB extended DBLP corpus with links to homepages 50 Queries like A(actor birthday 1970<date<1980) western G(California,governor) M(movie) A(Madonna,husband)B(director)A.person=B.director Opponent: keyword queries with standard TF/IDF-based score  „simplified Google“ Setup for Experiments No existing benchmark (INEX, TREC, …) fits

Incremental Language Levels SSE-Join(join conditions) SSE-QG(query groups) SSE-CV(concept-based conditions) SSE-basic(keywords, SphereScores)

Experimental Results on Wikipdia

Experimental Results on Wiki++ and DBLP++ • SphereScores better than local scores • New SSE features improve precision

Qualitative query examples • Concept-Value: (American, politician, rice) (person=rice, politician) ? (1970<date<1980, actor) • Query groups: (California, governor, movie) G(California, governor) M(movie) • Joins: A(Madonna, husband) B(director) A.person=B.director precision@10: 0 0.6 precision@10: 0 0.4 precision@10: 0.4 Elements pair: (1) movie directed by Guy Ritchie (2) information that Guy Ritchie is Madonna’s husband

We introduced SphereSearch for unified ranked retrieval of heterogenous data Transformation of heterogenous data into unified format Annotation of latent concepts Incorporate concept, context and abstraction features Query language that is More expresive then traditional keyword search Simpler then full-fledged XML query language / SPARQL The Spheres idea seems to be benificial Preliminary experiments show improvement in certain queries Conclusion

Further Experimentation is needed Integration with Semantic-Web Usage of existing tools (such as SPARQL) Ontology based inheritance aware search Standartization of annotated tags Easier integration with existing RDF data More Efficient query evaluation Better Similarity measure Parameter tuning with relevance feedback Deep Web search through automatic queries Future Work

Thank you!

Problem

Problem

Presentation Transcript

Problem

Problem

Problem

Problem

Problem

PROBLEM

Problem:

Problem

Problem

problem

Problem

Problem

Problem

Chapter 6 Problem 3 Problem 5 Problem 6 Problem 12

Problem