Information Retrieval

Information Retrieval CSE 8337 (Part B) Spring 2009 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and BerthierRibeiro-Netohttp://www.sims.berkeley.edu/~hearst/irbook/ Data Mining Introductory and Advanced Topics by Margaret H. Dunham http://www.engr.smu.edu/~mhd/book Introduction to Information Retrieval by Christopher D. Manning, PrabhakarRaghavan, and HinrichSchutze http://informationretrieval.org

CSE 8337 Outline • Introduction • Simple Text Processing • Boolean Queries • Web Searching/Crawling • Indexes • Vector Space Model • Matching • Evaluation

Web Searching TOC • Web Overview • Searching • Ranking • Crawling

Web Overview • Size • >11.5 billion pages (2005) • Grows at more than 1 million pages a day • Google indexes over 3 billion documents • Diverse types of data • http://www.google.com/support/websearch/bin/topic.py?topic=8996

Web Data • Web pages • Intra-page structures • Inter-page structures • Usage data • Supplemental data • Profiles • Registration information • Cookies

Zipf’s Law Applied to Web • Distribution of frequency of occurrence of words in text. • “Frequency of i-th most frequent word is 1/i q times that of the most frequent word” • http://www.nslij-genetics.org/wli/zipf/

Heap’s Law Applied to Web • Measures size of vocabulary in a text of size n : O (n b) • b normally less than 1

User Web spider Search Indexer The Web Indexes Ad indexes Web search basics

How far do people look for results? (Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf)

Users’ empirical evaluation of results • Quality of pages varies widely • Relevance is not enough • Other desirable qualities (non IR!!) • Content: Trustworthy, diverse, non-duplicated, well maintained • Web readability: display correctly & fast • No annoyances: pop-ups, etc • Precision vs. recall • On the web, recall seldom matters • What matters • Precision at 1? Precision above the fold? • Comprehensiveness – must be able to deal with obscure queries • Recall matters when the number of matches is very small • User perceptions may be unscientific, but are significant over a large aggregate

Users’ empirical evaluation of engines • Relevance and validity of results • UI – Simple, no clutter, error tolerant • Trust – Results are objective • Coverage of topics for polysemic queries • Pre/Post process tools provided • Mitigate user errors (auto spell check, search assist,…) • Explicit: Search within results, more like this, refine ... • Anticipative: related searches • Deal with idiosyncrasies • Web specific vocabulary • Impact on stemming, spell-check, etc • Web addresses typed in the search box • …

Simplest forms • First generation engines relied heavily on tf/idf • The top-ranked pages for the query maui resort were the ones containing the most maui’s and resort’s • SEOs (Search Engine Optimization) responded with dense repetitions of chosen terms • e.g., mauiresort maui resort maui resort • Often, the repetitions would be in the same color as the background of the web page • Repeated terms got indexed by crawlers • But not visible to humans on browsers Pure word density cannot be trusted as an IR signal

Term frequency tf • The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d. • Raw term frequency is not what we want: • A document with 10 occurrences of the term is more relevant than a document with one occurrence of the term. • But not 10 times more relevant. • Relevance does not increase proportionally with term frequency.

Log-frequency weighting • The log frequency weight of term t in d is • 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc. • Score for a document-query pair: sum over terms t in both q and d: • score • The score is 0 if none of the query terms is present in the document.

Document frequency • Rare terms are more informative than frequent terms • Recall stop words • Consider a term in the query that is rare in the collection (e.g., arachnocentric) • A document containing this term is very likely to be relevant to the query arachnocentric • → We want a high weight for rare terms like arachnocentric.

Document frequency, continued • Consider a query term that is frequent in the collection (e.g., high, increase, line) • For frequent terms, we want positive weights for words like high, increase, and line, but lower weights than for rare terms. • We will use document frequency (df) to capture this in the score. • df ( N) is the number of documents that contain the term

idf weight • dft is the document frequency of t: the number of documents that contain t • df is a measure of the informativeness of t • We define the idf (inverse document frequency) of t by • We use log N/dft instead of N/dft to “dampen” the effect of idf. Will turn out the base of the log is immaterial.

idf example, suppose N= 1 million There is one idf value for each term t in a collection.

Collection vs. Document frequency • The collection frequency of t is the number of occurrences of t in the collection, counting multiple occurrences. • Example: • Which word is a better search term (and should get a higher weight)?

tf-idf weighting • The tf-idf weight of a term is the product of its tf weight and its idf weight. • Best known weighting scheme in information retrieval • Note: the “-” in tf-idf is a hyphen, not a minus sign! • Alternative names: tf.idf, tf x idf, tfidf, tf/idf • Increases with the number of occurrences within a document • Increases with the rarity of the term in the collection

Search engine optimization (Spam) • Motives • Commercial, political, religious, lobbies • Promotion funded by advertising budget • Operators • Search Engine Optimizers for lobbies, companies • Web masters • Hosting services • Forums • E.g., Web master world (www.webmasterworld.com) • Search engine specific tricks • Discussions about academic papers 

SPAM Y Is this a Search Engine spider? Real Doc N Cloaking • Serve fake content to search engine spider • DNS cloaking: Switch IP address. Impersonate • How do you identify a spider? Cloaking

More spam techniques • Doorway pages • Pages optimized for a single keyword that re-direct to the real target page • Link spamming • Mutual admiration societies, hidden links, awards – more on these later • Domain flooding: numerous domains that point or re-direct to a target page • Robots • Fake query stream – rank checking programs

Quality signals - Prefer authoritative pages based on: Votes from authors (linkage signals) Votes from users (usage signals) Policing of URL submissions Anti robot test Limits on meta-keywords Robust link analysis Ignore statistically implausible linkage (or text) Use link analysis to detect spammers (guilt by association) Spam recognition by machine learning Training set based on known spam Family friendly filters Linguistic analysis, general classification techniques, etc. For images: flesh tone detectors, source text analysis, etc. Editorial intervention Blacklists Top queries audited Complaints addressed Suspect pattern detection The war against spam

More on spam • Web search engines have policies on SEO practices they tolerate/block • http://help.yahoo.com/help/us/ysearch/index.html • http://www.google.com/intl/en/webmasters/ • Adversarial IR: the unending (technical) battle between SEO’s and web search engines • Research http://airweb.cse.lehigh.edu

Ranking • Order documents based on relevance to query (similarity measure) • Ranking has to be performed without accessing the text, just the index • About ranking algorithms, all information is “top secret”, it is almost impossible to measure recall, as thenumber of relevant pages can be quite large for simple queries

Ranking • Some of the new ranking algorithms also use hyperlink information • Important difference between the Web and normal IR databases, the number of hyperlinks that point to a page provides a measure of its popularity and quality. • Links in common between pages often indicate a relationship between those pages.

Ranking • Three examples of ranking techniques based in link analysis: • WebQuery • HITS (Hub/Authority pages) • PageRank

WebQuery • WebQuery takes a set of Web pages (for example, the answer to a query) and ranks them based on how connected each Web page is • http://www.cgl.uwaterloo.ca/Projects/Vanish/webquery-1.html

HITS • Kleinberg ranking scheme depends on the query and considers the set of pages S that point to or are pointed by pages in the answer • Pages that have many links pointing to them in S are called authorities • Pages that have many outgoing links are called hubs • Better authority pages come from incoming edges from good hubs and better hub pages come from outgoing edges to good authorities

Ranking

PageRank • Used in Google • PageRank simulates a user navigating randomly in the Web who jumps to a random page with probability q or follows a random hyperlink (on the current page) with probability 1 - a • This process can be modeled with a Markov chain, from where the stationary probability of being in each page can be computed • Let C(a) be the number of outgoing links of page a and suppose that page a is pointed to by pages p1 to pn

PageRank (cont’d) • PR(p) = c (PR(1)/N1 + … + PR(n)/Nn) • PR(i): PageRank for a page i which points to target page p. • Ni: number of links coming out of page I

Conclusion • Nowadays search engines use, basically, Boolean or Vector models and their variations • Link Analysis Techniques seem to be the “next generation” of the search engines • Indexes: Compression and distributed architecture are keys

Crawlers • Robot (spider) traverses the hypertext sructure in the Web. • Collect information from visited pages • Used to construct indexes for search engines • Traditional Crawler – visits entire Web (?) and replaces index • Periodic Crawler – visits portions of the Web and updates subset of index • Incremental Crawler – selectively searches the Web and incrementally modifies index • Focused Crawler – visits pages related to a particular subject

Crawling the Web • The order in which the URLs are traversed is important • Using a breadth first policy, we first look at all the pages linked by the current page, and so on. This matches well Web sites that are structured by related topics. On the other hand, the coverage will be wide but shallow and a Web server can be bombarded with many rapid requests • In the depth first case, we follow the first link of a page and we do the same on that page until we cannot go deeper, returning recursively • Good ordering schemes can make a difference if crawling better pages first (PageRank)

Crawling the Web • Due to the fact that robots can overwhelm a server with rapid requests and can use significant Internet bandwidth a set of guidelines for robot behavior has been developed • Crawlers can also have problems with HTML pages that use frames or image maps. In addition, dynamically generated pages cannot be indexed as well as password protected pages

Focused Crawler • Only visit links from a page if that page is determined to be relevant. • Components: • Classifier which assigns relevance score to each page based on crawl topic. • Distiller to identify hub pages. • Crawler visits pages based on crawler and distiller scores. • Classifier also determines how useful outgoing links are • Hub Pages contain links to many relevant pages. Must be visited even if not high relevance score.

Focused Crawler

Basic crawler operation • Begin with known “seed” pages • Fetch and parse them • Extract URLs they point to • Place the extracted URLs on a queue • Fetch each URL on the queue and repeat

URLs crawled and parsed URLs frontier Web Crawling picture Unseen Web Seed pages

Simple picture – complications • Web crawling isn’t feasible with one machine • All of the above steps distributed • Even non-malicious pages pose challenges • Latency/bandwidth to remote servers vary • Webmasters’ stipulations • How “deep” should you crawl a site’s URL hierarchy? • Site mirrors and duplicate pages • Malicious pages • Spam pages • Spider traps • Politeness – don’t hit a server too often

What any crawler must do • Be Polite: Respect implicit and explicit politeness considerations • Only crawl allowed pages • Respect robots.txt (more on this shortly) • Be Robust: Be immune to spider traps and other malicious behavior from web servers

What any crawler should do • Be capable of distributed operation: designed to run on multiple distributed machines • Be scalable: designed to increase the crawl rate by adding more machines • Performance/efficiency: permit full use of available processing and network resources

What any crawler should do • Fetch pages of “higher quality” first • Continuous operation: Continue fetching fresh copies of a previously fetched page • Extensible: Adapt to new data formats, protocols

URLs crawled and parsed Updated crawling picture Unseen Web Seed Pages URL frontier Crawling thread

URL frontier • Can include multiple pages from the same host • Must avoid trying to fetch them all at the same time • Must try to keep all crawling threads busy

Explicit and implicit politeness • Explicit politeness: specifications from webmasters on what portions of site can be crawled • robots.txt • Implicit politeness: even with no specification, avoid hitting any site too often

Robots.txt • Protocol for giving spiders (“robots”) limited access to a website, originally from 1994 • www.robotstxt.org/wc/norobots.html • Website announces its request on what can(not) be crawled • For a URL, create a file URL/robots.txt • This file specifies access restrictions

Robots.txt example • No robot should visit any URL starting with "/yoursite/temp/", except the robot called “searchengine": User-agent: * Disallow: /yoursite/temp/ User-agent: searchengine Disallow:

Information Retrieval