1 / 52

Searching the Web

Searching the Web. Representation and Management of Data on the Internet. Goal. To better understand Web search engines: Fundamental concepts Main challenges Design issues Implementation techniques and algorithms. What does it do?. Processes users queries

feleti
Download Presentation

Searching the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Searching the Web Representation and Management of Data on the Internet

  2. Goal • To better understand Web search engines: • Fundamental concepts • Main challenges • Design issues • Implementation techniques and algorithms

  3. What does it do? • Processes users queries • Finds pages with related information • Returns a resources list • Is it really that simple? • Is creating a search engine much more difficult than ex1 + ex2?

  4. Motivation • The web is • Used by millions • Contains lots of information • Link based • Incoherent • Changes rapidly • Distributed • Traditional information retrieval was built with the exact opposite in mind

  5. The Web’s Characteristics • Size • Over a billion pages available (Google is a spelling of googol = 10100) • 5-10K per page => tens of terrabytes • Size doubles every 2 years • Change • 23% change daily • About half of the pages do not exist after 10 days • Bowtie structure

  6. Bowtie Structure Reachable from core (22%) Reach the core (22%) Core: Strongly connected component (28%)

  7. Search Engine Components • User Interface • Crawler • Indexer • Ranker

  8. HTML Forms on One Foot

  9. HTML Forms • Search engines usually use an HTML form. How are forms defined?

  10. HTML Behind the Form • Defines an HTML form that: • uses the HTTP method GET (you could use POST instead) • will send form info to http://search.dbi.com/search <form method="get“ action="http://search.dbi.com/search"> Search For: <input type="text" name="query"> <input type="submit" value="Search"> <input type="reset" value="Clear"> </form>

  11. HTML Behind the Form • Defines a text box • name=“query” defines the parameter “query” which will get the value of this text box when the data is submitted <form method="get“ action="http://search.dbi.com/search"> Search For: <input type="text" name="query"> <input type="submit" value="Search"> <input type="reset" value="Clear"> </form>

  12. HTML Behind the Form • The submit button, labeled with “Search” • When this button is pressed, an HTTP request will be generated of the following form: • GET http://search.dbi.com/search?query=encode(text_box) HTTP/1.1 • If there were additional parameters defined, they would be added to the url with the & sign dividing parameters <form method="get“ action="http://search.dbi.com/search"> Search For: <input type="text" name="query"> <input type="submit" value="Search"> <input type="reset" value="Clear"> </form>

  13. Example http://search.dbi.com/search?query=bananas+apples bananas apples

  14. Post Versus Get • Suppose we had the line <form method=“post” action="http://search.dbi.com/search"> • Then, pressing submit would cause a POST HTTP request to be sent. • The values of the parameters would be sent in the body of the request, instead of as part of the url

  15. HTML Behind the Form • The reset button, labeled with “Clear” • Clears the form <form method="get“ action="http://search.dbi.com/search"> Search For: <input type="text" name="query"> <input type="submit" value="Search"> <input type="reset" value="Clear"> </form>

  16. Crawling the Web

  17. removeBestPage( ) findLinksInPage( ) insertIntoQueue( ) Basic Crawler (Spider) Queue of Pages A crawler finds web pages to download into a search engine cache

  18. Choosing Pages to Download • Q: Which pages should be downloaded? • A: It is usually not possible to download all pages because of space limitations. Try to get the most important pages • Q: When is a page important? • A: Use a metric – by interest, by popularity, by location, or combination

  19. Interest Driven • Suppose that there is a query Q that contains the words we will be interested in. • Define the importance of a page P by its textual similarity to Q • Example: TF-IDF(P, Q) = Sum w in Q (TF(P,w)/DF(w)) • Problem: We must decide if a page is important while crawling. However, we don’t know DF until the crawl is complete • Solution: Use an estimate This is what you are using in Ex2!

  20. Popularity Driven • The importance of a page P is proportional to the number of pages with a link to P • This is also called the number of back links of P • As before, need to estimate this amount • There is a more sophisticated metric, called PageRank (will be taught later in the course)

  21. Location Driven • The importance of P is a function of its url • Example: • Words appearing on URL (e.g. com) • Number of “/” on the URL • Easily evaluated, requires no data from pervious crawls • Note: We can also use a combination of all three metrics

  22. Refreshing Web Pages • Pages that have been downloaded must be refreshed periodically • Q: Which pages should be refreshed? • Q: How often should we refresh a page? In Ex2, you never refresh pages 

  23. Freshness Metric • A cached page is fresh if it is identical to the version on the web • Suppose that S is a set of pages (i.e., a cache) Freshness(S) = (number of fresh pages in S) number of pages in S

  24. Age Metric • The age of a page is the number of days since it was refreshed • Suppose that S is a set of pages (i.e., a cache) Age(S) = Average age of pages in S

  25. Refresh Goal • Goal: Minimize the age of a cache an maximize the freshness of a cache. • Crawlers can refresh only a certain amount of pages in a period of time. • The page download resource can be allocated in many ways • We need a refresh strategy

  26. Refresh Strategies • Uniform Refresh: The crawler revisits all pages with the same frequency, regardless of how often they change • Proportional Refresh: The crawler revisits a page with frequency proportional to the page’s change rate (i.e., if it changes more often, we visit it more often) Which do you think is better?

  27. Trick Question • Two page database • e1changes daily • e2changes once a week • Can visit one page per week • How should we visit pages? • e1 e2 e1 e2 e1 e2 e1 e2... [uniform] • e1 e1 e1 e1 e1 e1 e1 e2 e1 e1 …[proportional] • e1 e1 e1 e1 e1 e1 ... • e2 e2 e2 e2 e2 e2 ... • ? e1 e1 e2 e2 web database

  28. Proportional Often Not Good! • Visit fast changing e1  get 1/2 day of freshness • Visit slow changing e2  get 1/2 week of freshness • Visiting e2is a better deal!

  29. Another Example • The collection contains 2 pages: e1 changes 9 times a day, e2 changes once a day • Simplified change model: • Day is split into 9 equal intervals: e1 changes once on each interval, and e2 changes once during the day • Don’t know when the pages change within the intervals • The crawler can download a page a day. • Our goal is to maximize the freshness

  30. Which Page Do We Refresh? • Suppose we refresh e2 in midday • If e2 changes in first half of the day, it remains fresh for the rest (half) of the day. • 50% for 0.5 day freshness increase • 50% for no increase • Expectancy of 0.25 day freshness increase

  31. Which Page Do We Refresh? • Suppose we refresh e1 in midday • If E1 changes in first half of the interval, and we refresh in midday (which is the middle of the interval), it remains fresh for the rest half of the interval = 1/18 of a day. • 50% for 1/18 day freshness increase • 50% for no increase • Expectancy of 1/36 day freshness increase

  32. Not Every Page is Equal! • Suppose that e1 is accessed twice as often as e2 • Then, it is twice as important to us that e1 is fresh than it is that e2 is fresh

  33. Politeness Issues • When a crawler crawls a site, it uses the site’s resources: • web server needs to find file in file system • web server needs to send file in the network • If a crawler asks for many of the pages and at a high speed it may • crash the sites web server or • be banned from the site • Solution: Ask for pages “slowly”

  34. Politeness Issues (cont) • A site may identify pages that it doesn’t want to be crawled • A polite crawler will not crawl these sites (although nothing stops the crawler from being impolite) • Put a file called robots.txt at the main directory to identify pages that should not be crawled (e.g., http://www.cnn.com/robots.txt)

  35. robots.txt • Use the header User-Agent to identify programs whose access should be restricted • Use the header Disallow to identify pages that should be restricted Example

  36. Other Issues • Suppose that a search engine uses several crawlers at the same time (in parallel) • How can we make sure that they are not doing the same work?

  37. Index Repository

  38. Storage Challenges • Scalability: Should be able to store huge amounts of data (data spans disks or computers) • Dual Access Mode: Random access (find specific pages) and Streaming access (find large subsets of pages) • Large Batch Updates: Reclaim old space, avoid access/update conflicts • Obsolete Pages: Remove pages no longer on the web (how do we find these pages?)

  39. Update Strategies • Updates are generated by the crawler • Several characteristics • Time in which the crawl occurs and the repository receives information • Whether the crawl’s information replaces the entire database or modifies parts of it

  40. Batch Crawler vs. Steady Crawler • Batch mode • Periodically executed • Allocated a certain amount of time • Steady mode • Run all the time • Always send results back to the repository

  41. Partial vs. Complete Crawls • A batch mode crawler can either do • A complete crawl every run, and replace entire cache • A partial crawl and replace only a subset of the cache • The repository can implement • In place update: Replaces the data in the cache, thus, quickly refreshes pages • Shadowing: Create a new index with updates, and later replace the previous, thus, avoiding refresh-access conflicts

  42. Partial vs. Complete Crawls • Shadowing resolves the conflicts between updates and read for the queries • Batch mode suits well with shadowing • Steady crawler suits with in place updates

  43. Types of Indices • Content index: Allow us to easily find pages with certain words • Links index: Allow us to easily find links between pages • Utility index: Allow us to easily find pages in certain domain, or of a certain type, etc. • Q: What do we need these for?

  44. Is the Content Index From Ex1 Good? • In Ex1, most of you had a table: • We want to quickly find pages with a specific word • Is this a good way of storing a content index?

  45. Is the Content Index From Ex1 Good? NO • If a word appears in a thousand documents, then the word will be in a thousand rows. Why waste the space? • If a word appears in a thousand documents, we will have to access a thousand rows in order to find the documents • Does not easily support queries that require multiple words

  46. Inverted Keyword Index evil: (1, 5, 11, 17) saddam: (3, 5, 11, 17) war: (3, 5, 17, 28) • lists are sorted by urlId butterfly: (22, 4) lists of matching documents as the values Hashtable Words as keys

  47. Query: “evil saddam war” Algorithm: Always advance pointer(s) with lowest urlId evil: (1, 5, 11, 17) saddam: (3, 5, 11, 17) Answers: war: (3, 5, 17, 28) 5 17

  48. Challenges • Index build must be : • Fast • Economic • Incremental Indexing must be supported • Tradeoff when using compression: memory is saved but time is lost compressing and uncompressing

  49. How do we distribute the indices between files? • Local inverted file • Each file contains disjoint random pages of the index • Query is broadcasted. • Result is the merged query answers. • Global inverted file • Each file is responsible for a subset of terms in the collection. • Query “sent”only to the apropriate files

  50. Ranking

More Related