World Wide Web • The largest and most widely known repository of hypertext • Hypertext : text, hyperlinks • Comprises billions of documents, authored by millions of diverse people
Brief (non-technical) history • Early keyword-based engines • Altavista, Excite, Infoseek, Inktomi, ca. 1995-1997 • Sponsored search ranking: Goto.com (morphed into Overture.com Yahoo!) • Your search ranking depended on how much you paid • Auction for keywords: casino was expensive!
Brief (non-technical) history • 1998+: Link-based ranking pioneered by Google • Blew away all early engines • Great user experience in search of a business model • Meanwhile Goto/Overture’s annual revenues were nearing $1 billion • Result: Google added paid-placement “ads” to the side, independent of search results • Yahoo followed suit
Ads Algorithmic results.
User Web spider Search Indexer The Web Indexes Ad indexes Web search basics
Seattle weather Mars surface images Canon S410 User Needs • Need [Brod02, RL04] • Informational – want to learn about something (~40% / 65%) • Navigational – want togo to that page (~25% / 15%) • Transactional – want to do something (web-mediated) (~35% / 20%) • Access a service • Downloads • Shop • Gray areas • Find a good hub • Exploratory search “see what’s there” Low hemoglobin United Airlines Car rental Brasil
How far do people look for results? (Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf)
Users’ empirical evaluation of results • Quality of pages varies widely • Relevance is not enough • Other desirable qualities (non IR!!) • Content: Trustworthy, diverse, non-duplicated, well maintained • Web readability: display correctly & fast • Precision vs. recall • On the web, recall seldom matters • What matters • Precision at 1? Precision above the fold? • Comprehensiveness – must be able to deal with obscure queries • Recall matters when the number of matches is very small
Users’ empirical evaluation of engines • Relevance and validity of results • UI – Simple, no clutter, error tolerant • Trust – Results are objective • Coverage of topics for polysemic queries • Pre/Post process tools provided • Mitigate user errors (auto spell check, search assist,…) • Explicit: Search within results, more like this, refine ...
Spam Search Engine Optimization
The trouble with sponsored search … • It costs money. What’s the alternative? • Search Engine Optimization: • “Tuning” your web page to rank highly in the algorithmic search results for select keywords • Alternative to paying for placement • Thus, intrinsically a marketing function
Simplest forms • First generation engines relied heavily on tf/idf • The top-ranked pages for the query maui resort were the ones containing the most maui’s and resort’s • SEOs responded with dense repetitions of chosen terms • e.g., mauiresort maui resort maui resort • Often, the repetitions would be in the same color as the background of the web page • Repeated terms got indexed by crawlers • But not visible to humans on browsers Pure word density cannot be trusted as an IR signal
SPAM Y Is this a Search Engine spider? Real Doc N Cloaking • Serve fake content to search engine spider Cloaking
More spam techniques • Doorway pages • Pages optimized for a single keyword that re-direct to the real target page • Link spamming • Mutual admiration societies, hidden links, awards – more on these later • Domain flooding: numerous domains that point or re-direct to a target page
More on spam • Web search engines have policies on SEO practices they tolerate/block • http://help.yahoo.com/help/us/ysearch/index.html • http://www.google.com/intl/en/webmasters/ • Adversarial IR: the unending (technical) battle between SEO’s and web search engines • Research http://airweb.cse.lehigh.edu/
The Web The Web document collection • Distributed content creation, linking, democratization of publishing • Content includes truth, lies, obsolete information, contradictions … • Unstructured (text, html, …), semi-structured (XML, annotated photos), structured (Databases)… • Scale much larger than previous text collections … Content can be dynamically generated
Crawling basics • Web is a collection of billions of documents written in a way that enables them to cite each other using hyperlinks. • Basic principle of crawlers • Start from a given set of URLs • Progressively fetch and scan them for new URLs (outlinks), and then fetch these pages in turn, in an endless cycle • There is no guarantee that all accessible web pages will be located in this fashion
Engineering Large-Scale Crawlers(1) • Performance is important! • Main concerns • DNS caching, prefetching, and resolution • Address resolution is a significant bottleneck • A crawler may generate dozens of mapping requests per second. • Many crawlers avoid fetching too many pages from one server, which might overload it; rather, they spread their access over many servers at a time. • Lower the locality of access to the DNS cache
Engineering Large-Scale Crawlers(2) • Eliminating already-visited URLs • Before adding a new URL to the work pool, we must check if it has already been fetched at least once • How to check quickly • Hash function • Remember that the amount of storage that usually cannot fit in main memory • Random access is expensive! • Luckily, there is some locality of access on URLs • Relative URLs within sites • Once the crawler starts exploring a site, URLs within the site are frequently checked for a while • However, a good hash function maps the domain strings uniformly over the range. • To achieve locality access, two-level hash function is used.
Engineering Large-Scale Crawlers(3) • Spider traps • Commercial crawlers need to protect themselves from crashing on ill-formed HTML or misleading sites. • The best policy is to prepare regular statistics about the crawl • If a site starts dominating the collection, it can be added to the guard module.
Engineering Large-Scale Crawlers(4) • Refreshing Crawled pages • Search engine’s index should be fresh! • There is no general mechanism of update notifications. • General idea • Depending on the bandwidth available, a round of crawling may run up to a few weeks. • Can we do better? • Statistics • Sort of score reflecting the probability that each page has been modified • A crawler is run at a smaller scale to monitor fast-changing sites, especially related to current news, weather.
Information Retrieval • Input: Document collection • Goal: Retrieve documents or text with information content that is relevant to user’s information need • Two aspects: 1. Processing the collection 2. Processing queries (searching)
Classic information retrieval • Ranking is a function of query term frequency within the document (tf) and across all documents (idf) • This works because of the following assumptions in classical IR: • Queries are long and well specified “What is the impact of the Falklands war on Anglo-Argentinean relations” • Documents (e.g., newspaper articles) are coherent, well authored, and are usually about one topic • The vocabulary is small and relatively well understood
Web information retrieval • None of these assumptionshold: • Queries are short: 2.35 terms in avg • Huge variety in documents: language, quality, duplication • Huge vocabulary: 100s million of terms • Deliberate misinformation • Ranking is a function of the query terms and of the hyperlink structure
Connectivity-based ranking • Hyperlink analysis • Idea: Mine structure of the web graph • Each web page is a node • Each hyperlink is a directed edge • Ranking Returned Documents • Query dependent raking • Query independent ranking
Query dependent ranking • Assigns a score that measures the quality and relevance of a selected set of pages to a given user query. • The basic idea is to build a query-specific graph, called a neighborhood graph, and perform hyperlink analysis on it.
Building a neighborhood graph • A start set of documents matching the query is fetched from a search engine (typically 200-1000 nodes). • The start set is augmented by its neighborhood, which is the set of documents that either hyperlinks to or is hyperlinked to by documents in the start set .(up to 5000 nodes) • Each document in both the start set and the neighborhood is modeled by a node. There exists an edge from node A to node B if and only if document A hyperlinks to document B. • Hyperlinks between pages on the same Web host can be omitted.
Neighborhood graph • Subgraph associated to each query Back Set Forward Set Query Results = Start Set Result1 b1 f1 f2 b2 Result2 ... … ... bm fs Resultn An edge for each hyperlink, but no edges within the same host
Hyperlink-Induced Topic Search (HITS) • In response to a query, instead of an ordered list of pages each meeting the query, find two sets of inter-related pages: • Hub pages are good lists of links on a subject. • e.g., “Bob’s list of cancer-related links.” • Authority pages occur recurrently on good hubs for the subject. • Best suited for “broad topic” queries rather than for page-finding queries. • Gets at a broader slice of common opinion.
Hubs and Authorities • Thus, a good hub page for a topic points to many authoritative pages for that topic. • A good authority page for a topic is pointed to by many good hubs for that topic. • Circular definition - will turn this into an iterative computation.
HITS [K’98] • Goal: Given a query find: • Good sources of content (authorities) • Good sources of links (hubs)
Intuition • Authoritycomes from in-edges. Being a goodhubcomes from out-edges. • Better authoritycomes from in-edges from good hubs. Being a better hubcomes from out-edges to good authorities.
p q1 r1 A H q2 r2 ... ... qk rk
Distilling hubs and authorities • Compute, for each page x in the base set, a hub scoreh(x) and an authority scorea(x). • Initialize: for all x, h(x)1; a(x) 1; • Iteratively update all h(x), a(x); • After iterations • output pages with highest h() scores as top hubs • highest a() scores as top authorities. Key
x Iterative update • Repeat the following updates, for all x: x
Scaling • To prevent the h() and a() values from getting too big, can scale down after each iteration. • Scaling factor doesn’t really matter: • we only care about the relative values of the scores.
How many iterations? • Claim: relative values of scores will converge after a few iterations: • in fact, suitably scaled, h() and a() scores settle into a steady state! • We only require the relative orders of the h() and a() scores - not their absolute values. • In practice, ~5 iterations get you close to stability.
Problems with the HITS algorithm(1) • Only a relatively small part of the Web graph is considered, adding edges to a few nodes can change the resulting hubs and authority scores considerably. • It is relatively easy to manipulate these scores.
Problems with the HITS algorithm(2) • We often find that the neighborhood graph contains documents not relevant to the query topic. If these nodes are well connected, the topic driftproblem arises. • The most highly ranked authorities and hubs tend not to be about the original topic. • For example, when running the algorithm on the query “jaguar and car" the computation drifted to the general topic “car" and returned the home pages of different car manufacturers as top authorities, and lists of car manufacturers as the best hubs.
Query-independent ordering First generation: using link counts as simple measures of popularity. Two basic suggestions: Undirected popularity: Each page gets a score = the number of in-links plus the number of out-links (3+2=5). Directed popularity: Score of a page = number of its in-links (3).
Query processing First retrieve all pages meeting the text query (say venture capital). Order these by their link popularity (either variant on the previous page).
Spamming simple popularity Exercise: How do you spam each of the following heuristics so your page gets a high score?
Pagerank scoring Imagine a browser doing a random walk on web pages: Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably “In the steady state” each page has a long-term visit rate - use this as the page’s score. 1/3 1/3 1/3
Not quite enough The web is full of dead-ends. Random walk can get stuck in dead-ends. Makes no sense to talk about long-term visit rates. ??
Teleporting At a dead end, jump to a random web page. At any non-dead end, with probability 10%, jump to a random web page. With remaining probability (90%), go out on a random link. 10% - a parameter.
Result of teleporting Now cannot get stuck locally. There is a long-term rate at which any page is visited (not obvious, will show this). How do we compute this visit rate?
Markov chains A Markov chain consists of n states, plus an nntransition probability matrixP. At each step, we are in exactly one of the states. For 1 i,j n, the matrix entry Pij tells us the probability of j being the next state, given we are currently in state i. Pii>0 is OK. i j Pij