1 / 35

Ranking

Ranking. Classification of ranking approaches. Three main classification criteria: Ranking unit : semantic entity (object), semantic sub graph, semantic document, textual document, multimedia item (picture, video, audio),...

xenon
Download Presentation

Ranking

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ranking

  2. Classification of ranking approaches • Three main classification criteria: • Rankingunit : semantic entity (object), semantic sub graph, semantic document, textual document, multimedia item (picture, video, audio),... • Rankingfeatures: content features, link-structure features, usage features • Ranking models: set-theoretic models, algebraic models, probabilistic models

  3. Set-theoretic models • Represent documents as sets of words or phrases. Set-theoretic operations on those sets • Standard Boolean model • Extended Boolean model • Example of the standard Boolean model • q = retrieval  (text multimedia) = • qdnf= [(1,1,1)  (1,1,0)  (1,0,0) ]

  4. Algebraic models • Represent documents and queries usually as vectors, matrices, or tuples. The similarity of the query vector and document vector is represented as a scalar value • Vector space model • Latent semantic analysis • Example of the vector space model • dj = vector of terms in the doc (ti, dj) -> wi,j. • q= vector of terms in the query (ti, q) – >wi,q • Sim (q, dj) = cos (q, dj)

  5. Probabilistic models • Treat the process of document retrieval as a probabilistic inference. • Given a query q and a collection of documents D, a subset R of D is assumed to exist which contains exactly the relevant documents to q (the ideal answer set) • Probabilistic retrieval models rank documents in decreasing order of probability of belonging R, P (R |q, dj), where dj is a document in D • Examples • Language models • BM25 • BM25F -> consider a structure D q R

  6. Classification of ranking approaches (*) systems participating in the 2010 Ad-hoc object retrieval competition (Halpin et al., 2010) Semantic Search Workshop http://km.aifb.kit.edu/ws/semsearch10/

  7. Classification or ranking approaches (*) systems participating in the 2010 Ad-hoc object retrieval competition (Halpin et al., 2010) Semantic Search Workshop http://km.aifb.kit.edu/ws/semsearch10/

  8. History • Ranking semantic entities belonging to an individual ontology (similar to keyword search in DB or to vertical search) • RDQL/ SPARQL queries (Boolean retrieval models) • Answers are assumed to be 100% precise (no ranking) • Size: ontologies and KBs tend to be rather small. Just few answers as response to information needs • Use of non-structured queries (keyword, NL, ...) • Ambiguity: query-term – concept mappings • Size: KBs start to be rather big (TAP, KIM) (ranking need)

  9. History • Ranking semantic documents • Ambiguity • Query-term – concept mappings • Selection of relevant semantic information sources • Content analysis • Linked structured analysis • Human assessments • Size • Increasing number of ontologies • Generation of the first ontology repositories (Protege) and semantic search engines (Swoogle, Watson, ...)

  10. History • Ranking semantic entities / sub graphs within and across – semantic information sources • Ambiguity • Query-term – concept mappings • Selection of relevant semantic information sources • Combination of elements across data sources • Heterogeneity (schema and domains) • Size • Massive amounts of heterogeneous interlinked semantic information (The web of Data) • Adaptation of traditional IR models for ranking

  11. History • Ranking unstructured information sources by exploiting an individual ontology • Ambiguity: (query – concept mapping) / (Document – concept mapping) • Size: medium-large document repositories / small or medium-scale semantic data • Problem: semantic incompleteness • Ranking unstructured information sources by exploiting multiple semantic information sources • Problem: scalability / heterogeneity (balance between quality/quantity of annotations)

  12. An example of a semantic ranking approach for the Web of DataPowerAqua (Lopez et al., ASWC 2009)

  13. Architecture

  14. Search Approach • Step1: Linguistic analysis • From NL query to linguistic query triples: • Which are the languages in Islamic countries? • <languages, spoken, countries / Islamic> • <Islamic, ?, countries> • Step 2: Identify the relevant semantic sources • Associate each query term with a set of semantic entities from different ontologies COUNTRIES: ISLAMIC: Dbpedia State (property, synonym) DbpediaIslamic_Cairo (inst. heritageSite, approx) Islamic_University (inst. university, approx.) Islamic_Republic (inst.?, approx), etc KIM Islamic New Year (inst. festival, approx) Islamic_Jihad (inst. organizat., approx) Sweto Armed Islamic Group (class., approx) country (class, exact), etc. Sweto state (inst., synonym.) animals Land (inst., synonym) KIM Commonwealth (inst. festival, synonym) utexas State (class, synonym) country (class., approx), etc Etc.. Etc.. (more than 1000 mappings for country + lexically related words)

  15. Search Approach • Step 3: Map the linguistic query triples against the selected semantic sources to find ontology statements from which answers are extracted

  16. Search Approach • Step 4: merging • Scenario 1: A query translates into one query triple only • UNION of answers • e.g.: “Find me cities in Virginia”. • Scenario 2: A query translates into two query triples linked by the subject • INTERSECTION of answers • e.g.: “give me all movies with Morgan Freeman and Brad Pitt” • Scenario 3: A query translates into two query triples linked by the object • Answers of the first CONDITIONED to the second • E.g. <languages, ?, country> <country, ?, islamic> • Scenario 4: Multiple query triples that combines the previous scenarios

  17. Search Approach • Step 5: ranking • Ranking by semantic similarity • Ranking by confidence • Ranking by popularity

  18. Ranking by semantic similarity • Obtain the meaning of the query terms in each of the selected semantic sources and rank answers according to the popularity of its semantic interpretation • WN + word sense disambiguation Dissambiguation (Word Net) O1 Meaning = state “find me cities in Virginia” State -> popularity = 2 Person -> popularity = 1 O2 Meaning = state O3 Meaning = person

  19. Ranking by confidence • Based on the confidence of the mapping between the linguistic query triples and the ontology statements • Exact mappings are preferred over approximate mappings • Query = “find me capitals in USA” • QT <captital, ?, USA> • OT1 = <capital (exact), cityof, State> <State, stateOf, USA> • OT2= <city (hypernym), attribute_country, USA> • Ontology mappings with highest coverage of the query triple are preferred

  20. Ranking by popularity • Answers are ranked according to their popularity across ontologies France Popularity = 4 PowerAqua “Where is Paris” United States (city of Texas) Popularity = 1

  21. Conclusions • Evaluation • 3GB of data stored in 130 sesame repositories (more than 700 rdf/owl documents: SWETO, TAP, ATO, SWC, FAO, DBpedia , etc • 40 questions of the user collected from PowerAqua website • Human judgments • Results • Better results are get by the confidence measure • Semantic similarity affected by: • Heterogeneity: label of entities modelled at different levels of granularity • Entities not covered in WordNet • Not significant taxonomical information to elicit the meaning • Popularity overshadow by the sparseness of knowledge in the SW. • As the SW matures will results in direct improvements for the semantic and popularity measures

  22. Demo • http://technologies.kmi.open.ac.uk/poweraqua/demo.html

  23. An example of a semantic ranking approach for the Web of Documents (hybrid search space) (Fernandez et al., JWS, 2010)

  24. Limitations of semantic retrieval on the Web of Documents • Applying semantic retrieval on a decentralized, heterogeneous and massive repository of content such as the Web is still an open problem • Heterogeneity: Web contents span a potentially unlimited number of domains. Impossible to fully cover with a predefined set of ontologies and KBs • need to collect and provide fast access to the online-available semantic metadata • Scalability: Need to manage massive amounts of structured and unstructured information sources • Need to create scalable and flexible indexing methods • Usability: Provide users with usable query interface • Need to support user-friendly query interfaces (beyond SPARQL)

  25. Proposal: a semantic retrieval framework A previous annotation is not assumed

  26. Keyword vs. concept-based document indices Document 1 Document 1 The bluebutterfly.. The bluebutterfly.. Document 2 Document 2 The sky is blue but the sea is The sky is blue but the sea is Document 3 Document 3 The kids were playing in the sea The kids were playing in the sea blue sea butterfly

  27. The ranking model • Adapting the vector-space IR model

  28. wvifxinstantiatesv in sometuple in R 0 otherwise The ranking model • Building the query vector • Execute the query (using PowerAqua)  Result set R  O|V | • Variable weighs: for each variable vV in the query, wv [0,1] • For each x  O, • Building the document vector • Map concepts to keywords • Weight for an instance x  O that annotates a document d: TF-IDF freqx,d = number of occurrences of keywords of x in d nx = number of documents annotated by x N = total number of documents

  29. “Johnny Rogers and BerniTamames went yesterday through the medical revision required at the beginning of each season, which consisted of a thorough exploration and several cardiovascular and stress tests, that their team mates had already passed the day before. Both players passed without major problems the examinations carried through by the medical team of the club, which is now awaiting the arrival of the NorthamericansBramlett and Derrick Alston to conclude the revisioning.” An example Query vector: (…,0.87, 0.61, 0.55, 0.22) Found documents: 66 news articles ranked from 0.1 to 0.89. E.g., 1st result Document vector ( …, 0.73, …, 0.65, …) Semantic rank value: cos (d, q) = 0.65 Keyword rank value: cos (d, q) = 0.06 Combination with keyword rank to avoid the problem of knowledge incompleteness! Combined rank value: 0.47

  30. Conclusions • Evaluation benchmark • Document collection: • TREC WT10G • Data collection • 2GB of public available RDF and OWL stored and indexed • Queries and judgments • TREC 9 and TREC 2001 test corpora (100 queries with their corresponding judgments) • 20 queries selected and adapted to be used by PowerAqua (our QA query processing module) • Experimental conditions • Keyword-based search (Lucene) vs. Semantic-based search vs. Best TREC search • Evaluation metrics • P@10, MAP

  31. Conclusions • Results • By P@10, the semantic retrieval outperforms the other two approaches • By MAP, there is no clear winner • Bias in the MAP measure: More than half of the documents retrieved by the semantic retrieval approach have not been rated in the TREC judgments • Heterogeneity • Semantic coverage enhancement results in retrieval performance improvement • Scalability • Semantic indexing (annotation) methods proposed are able to manage large amounts of content • Trade-off needed between annotation quantity and quality • Knowledge incompleteness • If the query processing module does not find any answer, the ranking module ensures that the system degrades gracefully to behave as a traditional keyword-based retrieval approach

  32. Some advantages • Better precision by using structured semantic queries (more precise information needs) • E.g. a football player playing in the Juventus vs. playing against the Juventus • Better recall when querying for instances by class (query expansion) • E.g. “News about companies quoted on NASDAQ” • Better recall by using inference • E.g. “Watersports in Spain”  ScubaDiving, Windsurf, etc. in Cadiz, Valencia, Alicante, etc. • Ambiguity is easier to deal with at the level of concepts

  33. References

  34. References • Harry Halpin, Daniel M. Herzig, Peter Mika, Roi Blanco, Jeffrey Pound, Henry S. Thompson, DucThanh Tran . Evaluating Ad-Hoc Object Retrieval. Proceedings of the International Workshop on Evaluation of Semantic Technologies (IWEST 2010), 9th International Semantic Web Conference (ISWC2010), Shanghai, PR China, November, 2010 • Fernandez, M., Cantador, I., Lopez, V., Vallet, D., Castells, P. and Motta, E. (2010) Semantically enhanced Information Retrieval: an ontology-based approach, Journal of Web Semantics, Special Issue on Semantic Search. In Press. • Lopez, V., Nikolov, A., Fernandez, M., Sabou, M., Uren, V. and Motta, E. (2009) Merging and Ranking Answers in the Semantic Web: The Wisdom of Crowds, The 4th Asian Semantic Web Conference, Shanghai, China • I. Cantador, M. Fernández, and P. Castells. Improving Ontology Recommendation and Reuse in WebCORE by Collaborative Assessments. Workshop on Social and Collaborative Construction of Structured Knowledge at the 16th International World Wide Web Conference (WWW 2007). Banff, Canada, May 2007. • Fernandez, M., Lopez, V., Motta, E., Sabou, M., Uren, V., Vallet, D. and Castells, P. (2008) Semantic Search meets the Web, International Conference Semantic Computing ICSC 08, Santa Clara, USA • DucThanh Tran, Haofen Wang, Peter Haase. Hermes: Data Web search on a pay-as-you-go integration infrastructure. Journal of Web Semantics, 7(3), 2009

  35. References • DucThanh Tran, Haofen Wang, Sebastian Rudolph, Philipp Cimiano. Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-Shaped (RDF) Data. Proceedings of the 25th International Conference on Data Engineering (ICDE'09), Shanghai, China, März, 2009 • Shady Elbassuoni, Maya Ramanath, Ralf Schenkel, MarcinSydow, Gerhard WeikumLanguage-model-based Ranking for Queries on RDF-Graphs. Proceedings of the ACM 18th Conference on Information and Knowledge Management (CIKM 2009), Hong Kong • José R. Pérez-Agüera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias and Victor Fresno. Using BM25F for Semantic Search. Semantic Search Workshop at the 19th Int. World Wide Web Conference WWW2010 April 26, 2010 (Workshop Day), Raleigh, NC, USA • HarithAlani ,  Christopher Brewster ,  Nigel Shadbolt. Ranking Ontologies with AKTiveRank. n Proc. of the International Semantic Web Conference, (ISWC 2006)

More Related