1 / 21

WWW Search Engines

WWW Search Engines. Indexing for unstructured resources Nils Rooijmans 25-4-2006 @ TU Eindhoven. www.ilse.nl. Founded as first dutch SE in 1996 Number 1 SE in the Netherlands, nowadays nr 2 next to Google 80 mio web url’s in index 1 mio queries per day

trapper
Download Presentation

WWW Search Engines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WWW Search Engines Indexing for unstructured resources Nils Rooijmans 25-4-2006 @ TU Eindhoven

  2. www.ilse.nl • Founded as first dutch SE in 1996 • Number 1 SE in the Netherlands, nowadays nr 2 next to Google • 80 mio web url’s in index • 1 mio queries per day • 7 developpers in Eindhoven (most formely TUE) • BusinessModel: Paid Listings

  3. Global Architecture • Functional Architecture • SE_Architecture.htm • System architecture complex because of scalability issues

  4. Spider/Crawler • Initial seed • Link extraction • Spider fence • URL DB (prioritising, scheduling) • HTTP response handlers • Robots.txt

  5. Parser • HTML is a mess • Webpages have different formats • PDF • Word/Powerpoints/txt • Flash • Javascript! • CSS

  6. Indexing • Vector Space model • Indexfiles • Precompute Ranking Properties • Title/URL/Body flags • Anchor texts • TF/IDF

  7. Retrieval • Term based ranking: • Position and proximity of queryterms • TF*IDF • Achor texts • Link Based ranking: • The web is not just a collection of documents – its hyperlinks are important!

  8. PageRank - Motivation • A link from page A to page B is a vote of the author of A for B, or a recommendation of the page. • The number incoming links to a page is a measure of importance and authority of the page. • Also take into account the quality of recommendation, so a page is more important if the sources of its incomoing llinks are important.

  9. What is a Markov Chain? • A Markov chain has two components: • A network structure much like a web site, where each node is called a state. So the complete web is the set of all possible states. • A transition probability of traversing a link given that the chain is in a state. • For each state the sum of outgoing probabilities is one. • A sequence of steps through the chain is called a random walk.

  10. The Random Surfer • Assume the web is a Markov chain. • Surfers randomly click on links, where the probability of an outlink from page A is 1/m, where m is the number of outlinks from A. • The surfer occasionally gets bored and is teleported to another web page, say B, where B is equally likely to be any page. • Using the theory of Markov chains it can be shown that if the surfer follows links for long enough, the PageRank of a web page is the probability that the surfer will visit that page.

  11. Dangling Pages • Problem: A and B have no outlinks. Solution: Assume A and B have links to all web pages with equal probability.

  12. Rank Sink • Problem: Pages in a loop accumulate rank but do not distribute it. • Solution: Teleportation, i.e. with a certain probability the surfer can jump to any other web page to get out of the loop.

  13. PageRank (PR) - Definition • P is a web page • Pi are the web pages that have a link to P • O(Pi) is the number of outlinks from Pi • d is the teleportation probability • N is the size of the web

  14. Example Web Graph

  15. Iteratively Computing PageRank • Replace d/N in the def. of PR(P) by d, so PR will take values between 1 and N. • d is normally set to 0.15, but for simplicity lets set it to 0.5 • Set initial PR values to 1 • Solve the following equations iteratively:

  16. Example Computation of PR

  17. Large Matrix Computation • Computing PageRank can be done via matrix multiplication, where the matrix has 30 million rows and columns. • The matrix is sparse as average number of outlinks is between 7 and 8. • Setting d = 0.15 or above requires at most 100 iterations to convergence. • Researchers still trying to speed-up the computation.

  18. Link Spamming to Improve PageRank • Spam is the act of trying unfairly to gain a high ranking on a search engine for a web page without improving the user experience. • Link farms - join the farm by copying a hub page which links to all members. • Selling links from sites with high PageRank.

  19. Result Optimization by Machine Learning • Future: personalized search by learning from user feedback • Which ranking features work well? • LEARN IT!

  20. Assignment • Build a meta-search • Learn which algorithm works best for which type of queries: • single VS multi-word queries • general queries VS specific queries(roughly based on the number of results in the resultset) • Improve ranking

  21. Step by step approach • Build website based on API’s • Ilse API (mail searchengine@ilse.net for documentation) • Google API: http://www.google.com/apis/index.html • Yahoo API: http://developer.yahoo.net/web/V1/webSearch.html • Merge resultsets (handle duplicates, hide originating API(s)) • Build feedback mechanism • Score different API’s for different query types • Use score in merging of results

More Related