Set 10 Search Engines & SEO - PowerPoint PPT Presentation

set 10 search engines seo n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Set 10 Search Engines & SEO PowerPoint Presentation
Download Presentation
Set 10 Search Engines & SEO

play fullscreen
1 / 16
Set 10 Search Engines & SEO
160 Views
Download Presentation
southern-olson
Download Presentation

Set 10 Search Engines & SEO

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Set 10 Search Engines & SEO IT452 Advanced Web and Internet

  2. Outline • How do search engines work? • Basic operation • What makes a good one? • What makes it difficult? • Web Design with search engines in mind

  3. Search Engines – Basic Operation • Crawler • Indexer • Query Engine • Separate step b/c delay in downloading

  4. Crawler Early spiders: DMOZ directory Links from pages Submitted URLs Site indexes • How does it find the pages? • Does it crawl everything? • How fast does it crawl? No. “bow-tie” structure of web, see next slide • Courtesy wait • Robots.txt

  5. The Web is a Bow-Tie • Early study of 200 million web pages and links • Broder et al. 2000 • Structure of the web: a bow-tie shape • http://www9.org/w9cdrom/160/160.html SCC: strongly connected component IN: any node has a path to the SCC OUT: any node has a backward path to the SCC A spider starting in OUT won’t find most of the Web!

  6. Indexer • Parse document • Remember • Whole text • Words • Phrases • Link text • Builds an “inverted index” Whole text -- often Maps words to document IDs barista burrito burro 531235, 4324, 6981, 125793, 41009, … 344, 7173, 574527, 14513, 2451245, … 8375, 75346, 345231, 5123523, 52388, …

  7. Query Engine • Process text query from user • Inverse index merges document IDs • Return ranked set of hopefully relevant pages • Ranking factors • 1. Query-specific • 2. Page-specific • 3. Page Genre • 4. Contains all query words? Freq of query words? Reputable page? Popular page? “PageRank” Show some blogs, some news, some photos Hacks! – er proprietary intelligent query features

  8. PageRank • Original basis of Google – still important • Developed in 1998. • http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.38.5427 • Basic Model • Two interpretations: • Random walk • Pages voting Fw = links in w pointing out Bw = pages pointing to w c = normalizing factor Draw graph on board, edges, walk. Pages with higher page rank count more

  9. PageRank • Two interpretations: • Random walk • Pages voting

  10. PageRank • Who owns the PageRank patent? • (hint: not Google) Stanford University Stanford gives Google the exclusive rights to use it. • - Stanford received 1.8 million shares for Google to use it. • Sold them in 2005 for $336 million. • 2012 they would be worth $1.1 billion

  11. SEO • Goal • What does it consider? • Types Improve type/quantity of traffic to site from search engines 1.) search algorithm specifics 2.) what people search for 1.) white hat 2.) black hat -- deception

  12. SE0 0.1 • Early search engines heavily dependent on meta tags • What to do? • White hat: • Black hat: • Key issue: easy to _____________________ Provide relevant tags Stuff tags Manipulate with one site

  13. SEO 1.0 • Modern search engines depend heavily on links • What to do? • White hat: • Black hat: Get good, relevant sites to link to you Link farms

  14. SEO 2.0 • Machine Learning • You search for “cats”, which result do you click first? • Learn from user clicks which they prefer • Smarter algorithms cluster words that “mean” the same thing • What to do? • White hat: • Black hat: Nothing? Automate clicks on search engine result pages?

  15. Good principles • Clear hierarchy • Links to all pages (static), not as images • Useful content • Links from relevant sites • Good title / alt / meta • Limit dynamically generated pages (or # args) • No broken links, < 100 links • Use robots.txt – exclude internal search results • Fresh content

  16. Bad principles • Stuff with lots of irrelevant content • Show different version of content to crawler • Link schemes, farms • Hidden text and links • Pages designed just for search engines, not users • Automated querying • Deception in general