1 / 27

Behind the Plug

Behind the Plug. 1000: Search Engines. How Do Search Engines Work?. Trying to navigate on the ‘Web is like trying to run across an African plain with no map…. 01 gazelle. How do Search Engines Work?. First we need to find every web page on the Internet… This isn’t a simple feat, of course!

alena
Download Presentation

Behind the Plug

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Behind the Plug 1000: Search Engines

  2. How Do Search Engines Work? • Trying to navigate on the ‘Web is like trying to run across an African plain with no map… 01 gazelle

  3. How do Search Engines Work? • First we need to find every web page on the Internet… • This isn’t a simple feat, of course! • How can you do this? • Start with a set of “seed” web sites • These should cover a wide range of topics and types of information • Follow every link on every one of those web sites • Treat the sites you just found as new “seed sites • Rinse and repeat until you’ve run out of links on every web page in the world

  4. Mapping the WWW Links

  5. Mapping the WWW Links

  6. Mapping the WWW Links Bottom of page…

  7. Mapping the WWW

  8. Mapping the WWW • Virtually everything on the ‘Web is connected in some way • If you follow the links far enough, you’ll get to just about every domain • Walking the web is the job of a spider • Sometimes called a robot • But once I’ve found all these web sites, what do I do with them? • First: Does the web site administrator want the web page to be indexed?

  9. Robots.txt • There is a file on every web site that tells the spider whether or not the site wants to be indexed • This file is robots.txt Disallows indexing

  10. Mapping Sites • If the site does want to be indexed, what’s the next step? • Make a copy of the entire web site in a local “cache” • This keeps the search engine from using the web site’s computing power to do the rest of the work… • Build a map file! • The map file identifies every “page” on the web site • This allows the search engine to “understand” the way the web pages are connected on this site 02 colonial-map.jpg

  11. Mapping Sites • Now you know part of what’s on all those servers Google has • A copy of every web site indexed by Google • This is a huge database! • What does a spider record about a web site?

  12. Mapping Sites What Spiders Record • The Title • The title listed on the file index.html is used for the site • All the other pages in the web site can also have titles

  13. Mapping Sites What Spiders Record • The Description • The description contained in “index.html” is used for the site itself • Each web page in the site might also have separate descriptions

  14. Mapping Sites What Spiders Record • The Keywords • There are keywords for the entire site and for each page on the site • These are used as an indicator of what the site and individual pages relate to

  15. Mapping Sites • The words actually used in the text of the web page • The number of times each word is used • The “alt” information for each image on the web site • This is the text normally displayed when an image can’t be displayed • This is the text the computer speaks when someone who is blind is hearing the page read by the computer

  16. Mapping Sites • Keywords might be added or removed by the search engine based on their analysis of the site…

  17. Finding Pages • So we have this huge database… • …but how do we use it? • Suppose you search for the history of the Statue of Liberty • It would be easy to just return a list of every site that says “Statue of Liberty” in it • But you’d get a lot of “false hits” • Sites that don’t really have anything to do with the history of the Statue of Liberty 03 bing search overload

  18. Finding Pages • The first thing a search engine will do is toss out the objects and other “useless” words • “The,” “of,” “and,” and other words generally don’t add value to the search • For the most part, it’s better just to strip them out

  19. Finding Pages

  20. Finding Pages • Most search engines then use “secret sauce” to determine how to rank the pages • The page ranking is the order in which pages are displayed • The page ranking algorithm is really the heart of the search engine 02 bing search overload

  21. Common Page Ranking Elements • Are all the keywords in the actual title of the page? • “History of the Statue of Liberty” would rank higher • “History of Liberty” would rank lower • Are all the keywords in the “meta keywords” for the site or page? • “History, Statue of Liberty, Ellis Island” would rank higher • “History, statues, Easter island” would rank lower

  22. Common Page Ranking Elements • How often do the keywords show up in the actual text of the web page? • How close together are the keywords when they appear in the web page • “History of the Statue of Liberty” would rank higher • “History of the technique used in building the Statue of Liberty” would rank lower • How often do the keywords show up in the “alt” image information? • Are the keywords in the URL of the web site itself?

  23. Common Page Ranking Elements • The number of sites and pages that link to the web site/page • External links matter more than internal links • The “quality” of a link matters • Is the link from a web site that appears to be on the same topic? • Is the link from a web site that appears to be “spam,” or hosting virus software, illegal “stuff,” etc.? • This is called the “page rank” in Google’s results

  24. Common Page Ranking Elements • Is the page linked from any social networking sites? • Facebook, twitter, etc.? • How often is the site where the page resides linked to? • The number and quality of clickthroughs • How often do others who search for this set of keywords click on this site or page? • How long did others spend on this page once they clicked through to it? • Social factors == “Crowdsourcing”

  25. Common Page Ranking Elements • How fast is the server? • Slower servers tend to mean “edge content” • How well built is the page? • Does all the HTML validate? • Are there broken links to internal or external locations? • How much duplicate information is there? • And a ton of other factors… • This is not a simple problem to solve • In fact, most people don’t consider it “solved…”

  26. Building Better Searches • Try to make the keywords as unique to the topic as possible • Try to use a combination of keywords that will uniquely identify the topic • Pay attention to spelling, etc. • Use advanced searches when you need specific information

  27. Building Better Searches

More Related