1 / 38

Searching the Web

Searching the Web. Mark Levene (Follow the links to learn more!). Mechanics of a Typical Search. Query submitted to Google. Mechanics of a Typical Search. Google results for the query . Search Engines as Information Gatekeepers.

grady
Download Presentation

Searching the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Searching the Web Mark Levene (Follow the links to learn more!)

  2. Mechanics of a Typical Search Query submitted to Google

  3. Mechanics of a Typical Search Google results for the query

  4. Search Engines as Information Gatekeepers • Search engines are becoming the primary entry point for discovering web pages. • Ranking of web pages influences which pages users will view. • Exclusion of a site from search engines will cut off the site from its intended audience. • The privacy policy of a search engine is important.

  5. Search Engine Wars • The battle for domination of the web search space is heating up! • The competition is good news for users! • The way in which advertising is combined with search results is crucial! • There are serious implications if one of the search engines will manage to dominate the space!

  6. Google • Verb “google” has become synonymous with searching for information on the web. • Has raised the bar on search quality, • Has been the most popular search engine in the last few years. • Had a very successful IPO in August 2004. • Is innovative and dynamic.

  7. Yahoo! • Synonymous with the dot-com boom, probably the best known brand on the web. • Started off as a web directory service. • Has very strong advertising and e-commerce partnerships. • Acquired leading search engine technology in 2003.

  8. MSN Search • Synonymous with PC software. • Remember its victory in the browser wars with Netscape. • Developed its own search engine technology only recently, officially launched in Feb. 2005. • May link web search into its next version of Windows.

  9. Others • Ask Jeeves • Specialises in natural language question answering. • Search driven by Teoma. • Looksmart • Has its own directory service. • Search driven by Wisenut. • …

  10. Statistics from search engine logs

  11. Experiment with search engine query syntax • Default is AND, e.g. “computer chess” normally interpreted as “computer AND chess”, i.e. both keywords must be present in all hits. • “+chess” in a query means the user insists that “chess” be present in all hits. • “computer OR chess” means either keywords must be present in all hits. • “”computer chess”” means that the phrase “computer chess” must be present in all hits.

  12. The most popular search keywords

  13. Search Engine Architecture

  14. Crawler Algorithm • A crawler is a program that traverses web pages, downloads them for indexing and follows (or harvests) the hyperlinks on the downloaded pages. • A crawler will typically start from a multitude of web pages and aims to cover as much of the indexable web as possible. • Standard algorithm used breadth-first strategy. • Focused crawlers use best-first strategy.

  15. Search Index - Inverted File • Also store position of word in web page and info. on HTML structure.

  16. The query engine • The interface between the search index, the user and the web. • Algorithmic details of commercial search engines kept as trade secrets. • First step is retrieval of potential results from the index. • Second step is the ranking of the results based on their “relevance” to the query.

  17. Vector Space Model –Content Relevance

  18. Term Frequency (TF) • Count number of occurrences of each term. • Bag of words approach • Ignore stopwords such as is, a, of, the, … • Stemming - computer is replaced by comput, as are its variants: computers, computingcomputation,computer and computed. • Normalise TF by dividing by doc length, byte size of doc or max num of occurrences of a word in the bag. is a chess game game computer chess programming chess

  19. Inverse Document Frequency (IDF) • N is number of documents in the corpus. • ni is number of docs in which word i appears. • Log dampens the effect of IDF. • IDF is also number of bits to represent the term.

  20. Ranking with TF-IDF • j – refers to document j • i – refers to word (or term) i in doc j • q – is the query which is a sequence of terms • scorej -is the score for document j given q • Rank results according to the scoring function.

  21. Content Relevance • Phrase matching. • Synonyms. • URL analysis. • Date last updated. • Spell checking. • Home page detection.

  22. Link Text (Anchor Text) • Include link text for a link pointing to a web page, say P, as part of the content of P • Link text is very useful in finding home pages. • Link text behaves like user queries • They act as short summaries • They often match query terms

  23. HTML Weighting • Normal retrieval = (111101) ranking with TF-IDF • (181882) – 39.6% improvement. • (181782) – 48.3% improvement – C2, C4 and C5. • (181582) - 43.5% improvement • Meta tag text is mostly ignored by search engines

  24. Factor in Link Metrics • Multilply by PageRank of document (web page). • We do not know exactly how Google factors in the PR, it may be that log(PR) is used.

  25. Popularity Based Metrics • Factor in users’ opinions as represented in the query logs. • Document space modification adjusts the weights of keywords in popular pages. • Clickthrough data can also be taken into account to improve the ranking of search engine query results.

  26. Precision and Recall • Precision is Overlap/Retrieved (first results page retrieved is most important). • Recall is Overlap/Relevant (for web search recall is related to index coverage).

  27. Typical Recall-Precision Curve • Top-n precision – proportion of relevant for top n ranked results. • Measure top-n precision at fixed recall point for n being 0% to 100% of the ranked results.

  28. Probabilistic IR • Basic question: What is the probability that a document, D, is relevant to a query Q? • Probability ranking assumption: If docs retrieved are ordered by decreasing probability of relevance then the overall effectiveness of the system is the best obtainable given the input documents.

  29. Bayes Formulation of Relevance • R – relevance of D with respect to a query Q • D – document (web page) • P(R|D) – probability that a page is relevant given its description (or representation) • NR – D not relevant with respect to Q

  30. Naïve Bayes Independence Assumption • n – the number of words in D • wi – the word in position i in D • Also assume that the probability of a word is independent of its position in the document.

  31. Computing the probabilities • ri – number of times wi occurs in relevant docs • RW - number of words in relevant docs(counting duplicate words multiple times, since docs are bags) • nri – number of times wi occurs in relevant docs • NRW – number of words in non relevant docs • P(R) is the number of relevant docs with respect to Q. • P(NR) is the number of docs which are not relevant with respect to Q. • c – is a smoothing constant greater or equal to one

  32. Is a Document Relevant? • Assume we have a set of training examples of relevant and non-relevant documents to compute P(wi|R) and P(wi|NR) for words wi. • The user could mark docs as R or NR. • Choose the class (R or NR) which has higher probability.

  33. Ranking Documents • D1 is more relevant than D2 given Q, where • nD1 - the number of words in D1 • nD2 – the number of words in D2 • If we are ranking, then P(wi|R) can be approximated on the basis of a document with a weighting function such as TF-IDF.

  34. Other types of Search Engine • Directory – e.g. Yahoo! (Open Directory) • MetaSearch – e.g. Dogpile (Mamma) • Clustering – e.g. Clusty • Question Answering – e.g. Ask JeevesWolframAlpha, True Knowledge, & Google, Yahoo and Bing • Visual – e.g. Quintura • Collaborative – e.g, Omgili,Sproose, AfterVote • Human Input – e.g. ChaCha • Social Tagging – e.g. Blekko, MrTaggy

  35. Directions in Search • Mobile search – e.g. Google Mobile • Local search – e.g. Google Local • Video search – e.g. YouTube • Image search – e.g. Picsearch • Audio search – e.g. Yahoo Audio • Blog search – e.g. Technorati • Social bookmarking - e.g. Delicious • Some new ideas - e.g. Cuil

  36. Paid Inclusion and Paid Placement • Paid inclusion – payment to speed up inclusion in the search index. • Pay-Per-Click (PPC) or Cost-Per-Click (CPC) – payment for being advertised on the search engine’s sponsored results list. • The sponsored list should be separated from the organic list. • PPC is a major revenue source for search engines. • Click fraud is a problem!

  37. Behavioural Targeting • Contextual targeting is a weaker form based on a single user session. • Personalised advertising in order to increase the effectiveness of advertising. • Data collected from individual users, normally through cookies.

  38. Pay-Per-Action (PPA) • Charge the advertiser only when an action takes place such as a purchase, a download or any other trackable action. • Ad network will require the advertiser to place a script in the web page triggering the action. • Some level of trust is needed.

More Related