A review on “Answering Relationship Queries on the Web”

A review on “Answering Relationship Queries on the Web” Bhushan Pendharkar ASU ID 993934582

Problem statement • Inability of existing search engines to answer relationship queries, although they excel in keyword matching and document ranking. • Focus of the paper on finding relationship between two entities given as queries, by finding top ranked Web pages for each query and matching them to form list of web page pairs. • Use of connecting terms for determining the relationship and ranking the Web page pairs. • Given two entities E1 and E2 , a Web search engine displays top pages which do not show any relationship between E1 and E2 • Attempt to overcome the shortcoming of current search engines , by providing a system and interface for relationship queries. • Proposed system dependent on Google search engine.

Solution Proposed • The proposed system accepts two entities as queries through its interface. • The top ranked pages of each entity E1 and E2 are retrieved separately from a search engine like Google. • These pages or documents are preprocessed: elimination of HTML tags, stemming of words, stop-word removal (Porter stemmer) and elimination of irrelevant words (noise removal). • Calculation of term weight for common term ‘t’ that shows relationship between P1 & P2 .( P1 is a result of query E1 , P2 of E2). • Connecting terms: terms having higher term weights • Use of cosine similarity (OKAPI method) to calculate similarity between P1 and P2—( Replacing ‘document’ and ‘query’ by P1 & P2 respectively) • Sorting the web-page pairs in descending order of similarity( or weights) and displaying them along with the connecting terms for each pair.

Criticism of the solution • Assumption: Top-ranked pages for E1 and top-ranked pages for E2 do not contain any relationship between E1 and E2. No ground truth provided. The fact might be the exact opposite. • Overview of the relationship between entities E1 and E2 given as a random term ‘Ec’. Explanation missing about ‘Ec’. • Less processing tasks , heavy dependence on Google results. If “Google” results are not perfect or correct (rarely…!!), the system fails. Explicit mention of “changes in results” if Google results vary. • Use of standard “Porter Stemmer”. This stemmer is not so perfect. Stemming (“ignition” is stemmed to “ignit”, “Monday” to “Mondai”) • Paper concluded by unnecessary explanation of the influence on results when the steps of the proposed approach are eliminated one at a time, although all steps are necessary for the proper implementation of the system.

Relevance to IRM • Significant relevance to the topics taught in the course. • The crux of the paper is similarity calculation between Web Page Pairs(P1,P2). Cosine similarity is used for the same. • The concept of TF-IDF is used for determining the term weights for terms present in the documents P1 and P2. • Use of stemming to obtain root words • Ranking done on the basis of the similarity values of the Web page pairs.

Inability of existing search engines to answer relationship queries, although they excel in keyword matching and document ranking. • Focus of the paper on finding relationship between two entities given as queries, by finding top ranked Web pages for each query and matching them to form list of web page pairs. • Use of connecting terms for determining the relationship and ranking the Web page pairs. • Given two entities E1 and E2 , a Web search engine displays top pages which do not show any relationship between E1 and E2 • Attempt to overcome the shortcoming of current search engines , by providing a system and interface for relationship queries. • Proposed system dependent on Google search engine. • Assumption: Top-ranked pages for E1 and top-ranked pages for E2 do not contain any relationship between E1 and E2. No ground truth provided. The fact might be the exact opposite. • Overview of the relationship between entities E1 and E2 given as a random term ‘Ec’. Explanation missing about ‘Ec’. • Less processing tasks , heavy dependence on Google results. If “Google” results are not perfect or correct (rarely…!!), the system fails. Explicit mention of “changes in results” if Google results vary. • Use of standard “Porter Stemmer”. This stemmer is not so perfect. Stemming (“ignition” is stemmed to “ignit”, “Monday” to “Mondai”) • Paper concluded by unnecessary explanation of the influence on results when the steps of the proposed approach are eliminated one at a time, although all steps are necessary for the proper implementation of the system. Problem statement (1) Solution Proposed (2) • The proposed system accepts two entities as queries through its interface. • The top ranked pages of each entity E1 and E2 are retrieved separately from a search engine like Google. • These pages or documents are preprocessed: elimination of HTML tags, stemming of words, stop-word removal (Porter stemmer) and elimination of irrelevant words (noise removal). • Calculation of term weight for common term ‘t’ that shows relationship between P1 & P2 .( P1 is a result of query E1 , P2 of E2). • Connecting terms: terms having higher term weights • Use of cosine similarity (OKAPI method) to calculate similarity between P1 and P2—( Replacing ‘document’ and ‘query’ by P1 & P2 respectively) • Sorting the web-page pairs in descending order of similarity( or weights) and displaying them along with the connecting terms for each pair. • Significant relevance to the topics taught in the course. • The crux of the paper is similarity calculation between Web Page Pairs(P1,P2). Cosine similarity is used for the same. • The concept of TF-IDF is used for determining the term weights for terms present in the documents P1 and P2. • Use of stemming to obtain root words • Ranking done on the basis of the similarity values of the Web page pairs. Criticism of the solution (3) Relevance to IRM (4)

A review on “Answering Relationship Queries on the Web”