1 / 20

How Search Engines Work?

How Search Engines Work?. Ziv Bar-Yossef Department of Electrical Engineering Technion. What is the Internet?. A global network of computers connected to each other Computers “talk” to each other using standard protocols TCP/IP. What is the World-Wide Web (WWW)?.

felix
Download Presentation

How Search Engines Work?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How Search Engines Work? Ziv Bar-Yossef Department of Electrical Engineering Technion

  2. What is the Internet? • A global network of computers connected to each other • Computers “talk” to each other using standard protocols • TCP/IP

  3. What is the World-Wide Web (WWW)? • Collection of pages available via the Internet • Internet users can view pages with web browsers • WWW is only one application of the Internet • Other applications: email, messengers, VOIP, newsgroups, ftp

  4. Web Pages • Various formats • pdf, word, excel, images, mp3, video, text • Most popular format: HTML • HTML pages point to each other using hyperlinks • Users “surf the web” by clicking hyperlinks

  5. What are Search Engines? • Users have “information needs” • Where can I find solutions to my math homework problem? • Where can I find mp3s of Miri Messika’s latest album? • What is the weather in Eilat in Channuka? • What other Sharons are famous except for our prime minister? • Search engines enable us to find web pages that match our information needs

  6. Search Engines “Information Need” What other Sharons are famous, except for our prime minister? User query Search Engine sharon -ariel Web • Sharon Creech • Sharon Stone • Sharon, Massachusetts Ranked list of matching pages Web pages

  7. How Search Engines (don’t) Work? • Common misconception: when user submits a query, the search engine scans all web pages to find the relevant matches User query Search Engine sharon -ariel Web • Sharon Creech • Sharon Stone • Sharon, Massachusetts Ranked list of matching pages Web pages

  8. How Search Engines Work? • What do you do when you look for a term in an encyclopedia? • Use the index! User query sharon -ariel Search Engine index Web • Sharon Creech • Sharon Stone • Sharon, Massachusetts Ranked list of matching pages Web pages

  9. Search Engine Architecture Search Engine Crawler Index Query Processor Ranking Algorithm

  10. Web Crawler (a.k.a. Spider) • Fetches web pages and stores them in a local repository • Tries to get as many web pages as possible • Follows hyperlinks to learn about new pages • Refetches pages that change frequently

  11. The Index Index ariel: (cnn.com,1) dress: (hollywood.com,3) found: (cnn.com,8) gaultier: (hollywood.com,8) gown: (hollywood.com,9) israel: (cnn.com,7) jean: (hollywood.com,6) minister: (cnn.com,5) new: (cnn.com,7), (hollywood.com, 5) oscar: (hollywood.com,12) party: (cnn.com,12), (hollywood.com,14) paul: (hollywood.com,7) political: (cnn.com,11) prime: (cnn.com,4) sharon: (cnn.com,2), (hollywood.com,1) stone: (hollywood.com,2) www.cnn.com Ariel1 Sharon2, the3 prime4 minister5of6 Israel7 founded8a9 new10 political11 party12. www.hollywood.com Sharon1 Stone2 dressed3a4 new5 Jean6 Paul7 Gaultier8 gown9at10the11 Oscars12after13 party14.

  12. Index by “Anchor Text” • Anchor text: what’s written inside a link • Example: Ariel Sharon, the prime minister… • Usually succinctly describes what’s written in the linked page • By which terms a page is listed in the index? • Terms that appear in the page • Terms that appear in anchor text of links to the page

  13. Query Processor • Gets a user query • Fetches relevant posting lists from index • Extracts relevant matches from lists • Example: Query = “sharon –ariel” • L1 posting list of sharon • sharon: (cnn.com,2), (hollywood.com,1) • L2 posting list of ariel • ariel: (cnn.com,1) • Return all pages in L1 that do not occur in L2 • cnn.com

  14. Ranking Algorithm • Many queries have many matching pages • 472 million matches for “London” in Google • Cannot return all of them to the user • User needs the most relevant results anyway • Need to order results by relevance • Most relevant results are at the top • Ranking algorithm: a method of ordering matches • The “heart” of a search engine • The reason why Google is the most preferred search engine today

  15. Google’s PageRank • Ranking  Elections • Candidates: all web pages • Voters: all web pages • p votes to q, if p has a hyperlink to q. • Favorites(p) = all the pages p votes for. • Fans(p) = all the pages that vote for p. • 1 if p has no fans

  16. Google’s PageRank 1 1.5 • Underlying principles: • A page is “important” if it has important fans • A page splits its “importance” evenly among its favorite pages. 1 4 1 2.5 1

  17. Google’s PageRank • Ranking algorithm: • Find pages that match the given query • Order them by their PageRank • Return top 10 matches

  18. But…PageRank Not Always Works SPAM

  19. Conclusions • Search engines use index to answer user queries • Ranking is the most important component • Spam is a problem

  20. Thank You

More Related