1 / 16

Set 10 Search Engines & SEO

Set 10 Search Engines & SEO. IT452 Advanced Web and Internet. Outline. How do search engines work? Basic operation What makes a good one? What makes it difficult? Web Design with search engines in mind. Search Engines – Basic Operation. Crawler Indexer Query Engine.

Download Presentation

Set 10 Search Engines & SEO

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Set 10 Search Engines & SEO IT452 Advanced Web and Internet

  2. Outline • How do search engines work? • Basic operation • What makes a good one? • What makes it difficult? • Web Design with search engines in mind

  3. Search Engines – Basic Operation • Crawler • Indexer • Query Engine • Separate step b/c delay in downloading

  4. Crawler Early spiders: DMOZ directory Links from pages Submitted URLs Site indexes • How does it find the pages? • Does it crawl everything? • How fast does it crawl? No. “bow-tie” structure of web, see next slide • Courtesy wait • Robots.txt

  5. The Web is a Bow-Tie • Early study of 200 million web pages and links • Broder et al. 2000 • Structure of the web: a bow-tie shape • http://www9.org/w9cdrom/160/160.html SCC: strongly connected component IN: any node has a path to the SCC OUT: any node has a backward path to the SCC A spider starting in OUT won’t find most of the Web!

  6. Indexer • Parse document • Remember • Whole text • Words • Phrases • Link text • Builds an “inverted index” Whole text -- often Maps words to document IDs barista burrito burro 531235, 4324, 6981, 125793, 41009, … 344, 7173, 574527, 14513, 2451245, … 8375, 75346, 345231, 5123523, 52388, …

  7. Query Engine • Process text query from user • Inverse index merges document IDs • Return ranked set of hopefully relevant pages • Ranking factors • 1. Query-specific • 2. Page-specific • 3. Page Genre • 4. Contains all query words? Freq of query words? Reputable page? Popular page? “PageRank” Show some blogs, some news, some photos Hacks! – er proprietary intelligent query features

  8. PageRank • Original basis of Google – still important • Developed in 1998. • http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.38.5427 • Basic Model • Two interpretations: • Random walk • Pages voting Fw = links in w pointing out Bw = pages pointing to w c = normalizing factor Draw graph on board, edges, walk. Pages with higher page rank count more

  9. PageRank • Two interpretations: • Random walk • Pages voting

  10. PageRank • Who owns the PageRank patent? • (hint: not Google) Stanford University Stanford gives Google the exclusive rights to use it. • - Stanford received 1.8 million shares for Google to use it. • Sold them in 2005 for $336 million. • 2012 they would be worth $1.1 billion

  11. SEO • Goal • What does it consider? • Types Improve type/quantity of traffic to site from search engines 1.) search algorithm specifics 2.) what people search for 1.) white hat 2.) black hat -- deception

  12. SE0 0.1 • Early search engines heavily dependent on meta tags • What to do? • White hat: • Black hat: • Key issue: easy to _____________________ Provide relevant tags Stuff tags Manipulate with one site

  13. SEO 1.0 • Modern search engines depend heavily on links • What to do? • White hat: • Black hat: Get good, relevant sites to link to you Link farms

  14. SEO 2.0 • Machine Learning • You search for “cats”, which result do you click first? • Learn from user clicks which they prefer • Smarter algorithms cluster words that “mean” the same thing • What to do? • White hat: • Black hat: Nothing? Automate clicks on search engine result pages?

  15. Good principles • Clear hierarchy • Links to all pages (static), not as images • Useful content • Links from relevant sites • Good title / alt / meta • Limit dynamically generated pages (or # args) • No broken links, < 100 links • Use robots.txt – exclude internal search results • Fresh content

  16. Bad principles • Stuff with lots of irrelevant content • Show different version of content to crawler • Link schemes, farms • Hidden text and links • Pages designed just for search engines, not users • Automated querying • Deception in general

More Related