1 / 41

CS533 Information Retrieval

CS533 Information Retrieval. Dr. Weiyi Meng Lecture #17 April 4, 2000. Metasearch Engine. Two observations about search engines: Web pages a user needs are frequently stored in multiple search engines. The coverage of each search engine is limited.

hoar
Download Presentation

CS533 Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS533 Information Retrieval Dr. Weiyi Meng Lecture #17 April 4, 2000

  2. Metasearch Engine Two observations about search engines: • Web pages a user needs are frequently stored in multiple search engines. • The coverage of each search engine is limited. • Combining multiple search engines may increase the coverage. A metasearch engine is a good mechanism for solving these problems.

  3. Metasearch Engine Solution query result user user interface query dispatcherresult merger search search search engine 1 engine 2 engine n . . . . . . text text text source 1 source 2 source n

  4. Some Observations When n is small (say < 10), we can afford to send each query to all local search engines When n is large (imagine n is in 1000s), then • most sources are not useful for a given query • sending a query to a useless source would • incur unnecessary network traffic • waste local resources evaluating query • increase the cost of merging the results

  5. A More Efficient Metasearch Engine query result user user interface database selector document selector query dispatcherresult merger search search search engine 1 engine 2 engine n . . . . . . text text text source 1 source 2 source n

  6. Introduction to Metasearch Engine (1) Database Selection Problem • Select potentially useful databases for a given query • essential if the number of local databases is large • reduce network traffic • avoid wasting local resources

  7. Introduction to Metasearch Engine (2) • Potentially useful database: a database that contains potentially useful documents. • Potentially useful document: • Its global similarity with the query is above a threshold. • Its global similarity with the query is among the m highest for some m.

  8. Introduction to Metasearch Engine (3) • Need some knowledge about each database in advance in order to perform database selection • Database Representative

  9. Introduction to Metasearch Engine (4) Document Selection Problem Select potentially useful documents from each selected local database efficiently • Retrieve all potentially useful documents while minimizing the retrieval of useless documents • from global similarity threshold to tightest local similarity threshold

  10. Introduction to Metasearch Engine (5) Result Merging Problem Objective: Merge returned documents from multiple sources into a single ranked list. DB1 d11, d12, ... . . . . . . Merger d12, d54, ... dN1, dN2, ... DBN

  11. Introduction to Metasearch Engine (6) An “Ideal” Metasearch Engine: • Retrieval effectiveness: same as that as if all documents were in the same collection. • Efficiency: optimize the retrieval process.

  12. Introduction to Metasearch Engine (7) Implications of ideal metasearch engine: should aimed at: • selecting only useful search engines • retrieving and transmitting only useful documents • ranking documents according to their degrees of relevance

  13. Database Selection: Basic Idea Goal: Identify potentially useful databases for each user query. General approach: • use representative to indicate approximately the content of each database • use these representatives to select databases for each query

  14. Solution Classification • Naive Approach: Select all databases (e.g. MetaCrawler, NCSTRL) • Qualitative Approaches: estimate the quality of each local database • based on rough representatives • based on detailed representatives

  15. Solution Classification (cont.) • Quantitative Approaches: estimate quantities that measure the quality of each local database more directly and explicitly • Learning-based Approaches: database representatives are obtained through training or learning

  16. Qualitative Approaches Using Rough Representatives (1) • typical representative: • a few words or a few paragraphs in certain format • manual construction often needed General remarks: • may work well for special-purpose local search engines • selection can be inaccurate

  17. Approaches Using Rough Representatives (2) Example: ALIWEB (Koster 94) Template-Type: DOCUMENT Title: Perl Description: Information on the Perl Programming Language. Includes a local Hypertext Perl Manual, and the latest FAQ in Hypertext. Keywords: perl, perl-faq, language

  18. Qualitative Approaches Using Detailed Representatives (1) • Use detailed statistics for each term. • Estimate the usefulness or quality of each search engine for each query. • The usefulness measures are less direct/explicit compared to those used in quantitative approaches. • Scalability starts to become an issue.

  19. Qualitative Approaches Using Detailed Representatives (2) Example: gGlOSS (generalized Glossary-Of-Servers Server, Gravano 95) • representative: (dfi , Wi) for term ti dfi -- document frequency of ti Wi -- the sum of weights of ti in all documents

  20. gGlOSS (continued) • database usefulness: sum of high similarities usefulness(q, D, T) = where D is a database and T is a threshold.

  21. gGlOSS (continued) Suppose for query q , we have D1 d11: 0.6, d12: 0.5 D2 d21: 0.3, d22: 0.3, d23: 0.2 D3 d31: 0.7, d32: 0.1, d33: 0.1 usefulness(q, D1, 0.3) = 1.1 usefulness(q, D2, 0.3) = 0.6 usefulness(q, D3, 0.3) = 0.7

  22. gGlOSS (continued) Usefulness is estimated based on two cases: • high-correlation case: if dfi dfj , then every document having ti also has tj . • disjoint case: for any two query terms ti and tj , no document contains both ti and tj .

  23. gGlOSS (continued) Example (high-correlation case) : Consider q = (1, 1, 1) with df1 = 2, df2= 3, df3 = 4, W1 = 0.6, W2 = 0.6 and W3 = 1.2. t1 t2 t3 t1 t2 t3 d1 0.2 0.1 0.3 0.3 0.2 0.3 d2 0.4 0.3 0.2 0.3 0.2 0.3 d3 0 0.2 0.4 0 0.2 0.3 d4 0 0 0.3 0 0 0.3 • usefulness(q, D, 0.5) = 2.1

  24. Quantitative Approaches Two types of quantities may be estimated: • the number of documents in a database D with similarities higher than a threshold T: NoDoc(q, D, T) = |{ d : d  D and sim(q, d) > T }| 2. global similarity of the most similar document in D: msim(q, D) = max { sim(q, d) } dD

  25. Quantitative Approaches Qualitative approaches versus quantitative approaches: • Usefulness measures in quantitative approaches are easier to understand and easier to use. • Quantitative measures are usually more difficult to estimate and need more information to estimate.

  26. Estimating NoDoc(q, D, T) (1) Basic Approach(Meng 98) • representative: (pi , wi) for term ti pi : probability ti appears in a document wi : average weight of ti among documents having ti Ex: Normalized weights of tiin 10 docs are (0, 0, 0, 0, 0.2, 0.2, 0.4, 0.4, 0.6, 0.6). pi = 0.6, wi = 0.4

  27. Estimating NoDoc(q, D, T) (2) Example: Consider query q = (1, 1). Suppose p1 = 0.2, w1 = 2, p2 = 0.4, w2 = 1. A generating function: (0.2 X 2 + 0.8) (0.4 X + 0.6) = 0.08 X 3 + 0.12 X 2 + 0.32 X + 0.48 a X b : a is the probability that a document in D has similarity b with q. NoDoc(q, D, 1) = 10*(0.08 + 0.12) = 2

  28. Estimating NoDoc(q, D, T) (3) Example: Consider query q = (1, 1, 1) and documents: (0, 2, 2), (1, 0, 1), (0, 2, 0), (0, 0, 3), (0, 0, 0). (p1, w1) = (0.2, 1), (p2, w2) = (0.4, 2), (p3, w3) = (0.6, 2) The generating function for this query: (0.2 X + 0.8) (0.4 X 2 + 0.6) (0.6 X 2 + 0.4) = 0.048 X 5 + 0.192 X 4 + 0.104 X 3 + 0.416 X 2 + 0.048 X + 0.192 The accurate function for this query: 0 X 5 + 0.2 X 4 + 0.2 X 3 + 0.4 X 2 + 0 X + 0.2

  29. Estimating NoDoc(q, D, T) (4) Consider query q = (q1, ..., qr). Proposition. If the terms are independent and the weight of term tiwhenever present in a document is wi , then the coefficient of X s in the following generating function is the probability that a document in D has similarity s with q.

  30. Estimating NoDoc(q, D, T) (5) Suppose expanded generating function is: a1 X b1 + a2 X b2 + … + ac X bc , b1 > … > bc For a given threshold T, let v be the largest integer to satisfy bv > T. Then NoDoc(q, D, T) can be estimated by: n (a1 + a2 + … + av) where n is the number of documents in D.

  31. Database Selection Using msim(q,D) Optimal Ranking of Databases(Yu 99) User: for query q, find the m most similar documents. Definition: Databases [D1, D2, …, Dp] are optimally ranked with respect to q if there exists a k such that each of the databases D1, …, Dk contains one of the m most similar documents, and all of these m documents are contained in these k databases.

  32. Database Selection Using msim(q,D) Optimal Ranking of Databases Example: For a given query q: D1 d1: 0.8, d2: 0.5, d3: 0.2, ... D2 d9: 0.7, d2: 0.6, d10: 0.4, ... D3 d8: 0.9, d12: 0.3, … other databases have documents with small similarities When m = 5: pick D1, D2, D3

  33. Database Selection Using msim(q,D) Proposition: Databases [D1, D2, …, Dp] are optimally ranked with respect to a query q if and only if msim(q, Di)  msim(q, Dj), i < j Example: D1 d1: 0.8, … D2 d9: 0.7, … D3 d8: 0.9, … Optimal rank: [D3, D1, D2, …]

  34. Estimating msim(q, D) • global database representative: global dfi of term ti • local database representative: anwi : average normalized weight of ti mnwi: maximum normalized weight of ti Ex: term ti : d1 0.3, d2 0.4, d3 0, d4 0.74 anwi = (0.3 + 0.4 + 0 + 0.7)/4 = 0.35 mnwi = 0.74

  35. Estimating msim(q, D) term weighting scheme query term: tf*gidf document term: tf query q = (q1, q2 , …, qk) modified query: q’ = (q1gidf1, …, qkgidfk) msim(q, D) = max { qi*gidfi*mnwi +  qj*gidfj*anwj }/|q’| 1i k ji

  36. Learning-based Approaches Basic idea: Use past retrieval experiences to predict future database usefulness. Different types of learning methods • Static learning : learning based on static training queries before the system is used by real users.

  37. Learning-based Approaches Different types of learning methods (cont.) • Dynamic learning : learning based on real evaluated user queries. • Combined learning: learned knowledge based on training queries will be adjusted based on real user queries.

  38. Dynamic Learning (1) Example : SavvySearch(Dreilinger 97) • database representative of database D: wi : indicate how well D responds to query term ti cfi : number of databases containing ti ph : penalty due to low return pr : penalty due to long response time • Initially, wi = ph = pr= 0

  39. Dynamic Learning: SavvySearch • Learning the value of wi for database D. After a k-term query having term ti is processed: • if no document is retrieved: wi = wi - 1/k • if some returned document is clicked wi = wi + 1/k • otherwise, no change to wi • Over time, large positive wi indicates that database D responds well with the term ti .

  40. Dynamic Learning: SavvySearch • Compute ph and pr for each database D. • if avg# of hits returned for most recent 5 queries < Th (default: Th = 1) ph = (Th - h) 2 / Th 2 • if avg response time for most recent 5 queries > Tr (default: Tr = 15 second) pr = (r - Tr ) 2 / (45 - Tr ) 2

  41. Dynamic Learning: SavvySearch Compute the ranking score of database D for query q = (t1, ..., tk ) r (q, D) = where N is the number of local databases.

More Related