1 / 10

Web Science: Searching the web

Web Science: Searching the web. Basic Terms. Search engine Software that finds information on the Internet or World Wide Web Web crawler An automated program that surfs the web and indexes and/or copies the website Also known as bots, web spiders, web robots Meta-tag

talib
Download Presentation

Web Science: Searching the web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Science: Searching the web

  2. Basic Terms • Search engine • Software that finds information on the Internet or World Wide Web • Web crawler • An automated program that surfs the web and indexes and/or copies the website • Also known as bots, web spiders, web robots • Meta-tag • Extra information that tags the HTML document • <meta name="keywords" content="HTML,CSS,XML,JavaScript"> • HyperLink or Link • A reference/link to another web page

  3. How do you evaluate a search engine? • Time taken to return results • Number of results • Quality of results

  4. How does a web crawler work? • Start at a webpage • Download the HTML content • Search for the HTML link tags <a href=“URL”></a> • Repeat steps 2-3 for each of the links • When a website has been completely indexed, load and crawl other websites

  5. Parallel Web Crawling • Speed up your web crawling by running on multiple computers at the same time (i.e. parallel computing • How often should you crawl the entire Internet? • How many copies of the Internet should you keep? • What are the different ways to index a webpage? • Meta keywords • Content • Page rank (# links to page)

  6. Basic Search Engine Algorithm • Crawl the Internet • Save meta keywords for every page • Save the content and popular words on the page • When somebody needs to find something, search for matching keywords or content words Problem: • Nothing stops you from inserting your own keywords or content that do not relate to the page’s *actual* content

  7. PageRank Algorithm • Crawl the Internet • Save the content and index the contents’ popular words • Identify the links on the page • Each link to an already indexed page increases the PageRank of that linked page • When somebody needs to find something, search for matching keywords or content words, BUT rank the search results according to PageRank Problem: Create a bunch of websites that link to a single specific page (http://en.wikipedia.org/wiki/Google_bomb)

  8. Shallow Web vs. Deep Web • Shallow web • Websites and content that are easily visible to “dumb search engines” • Content publicly links to other content • Shallow web content tends to be static content (unchanging) • Deep web • Websites and content that tend to be dynamic and/or unlinked • Private web sites • Unlinked content • Smarter search engines can crawl the deep web

  9. Search Engine Optimization (SEO) • Meta keywords • Words the relate to your content • Human-readible URLs • i.e. avoid complicated dynamically created URLs • Links to your page on other websites • Page visits • Others? • White hat vs. black hat SEO • White hats are the good guys. When would they be used? • Black hats are the bad guys. When would they be used?

  10. Search Engine Design • Assumptions are key to design! • Major problem in older search engines: • People gamed the search results • Results were not tailored to the user • What assumptions does a typical search engine make now? (i.e. what factors influence search today?)

More Related