Internet protocol (IP) • Two major functions: Addresses that identify hosts, locations and identify destination • Connectionless protocol • Reliability: • Data corruption : faulty way of breaking up messages • Lost data packets • Duplicate arrival • IP addresses for host: • A numerical label assigned to each device participating in a computer network that uses internet protocol for communication • Hostnames (Ex: cs.umb.edu) • We prefer meaningful names but behind the scenes hostnames are converted to IP addresses which are a series of four decimal numbers separated by dots. Ex: 22.214.171.124 (stored in 32 bits). What is a potential problem with 32 bit addresses? • Can be split into network address (type/size of network) and a host number (machine/device number on this network)
Domain name system cs.bu.edu ? ? • Domain names: The part of a hostname that specifies type of organization or group • Top-level domain (TLD): The last section of a domain name specifying the type of organization or its country of origin. • Domain name system is used to translate hostnames into numeric IP addresses so domain name servers (when you make a request) translates request into an IP address and then searches for the IP address
World Wide Web • Good News: • Millions of webpages available on a variety of topics • Bad News: • Millions of webpages available on a variety of topics • Haphazard labeling • Sitting on servers in various locations How do search for a specific topic?
Search engines! • A web search engine is designed to search for information on the world wide web and FTP servers • Search based on key words (Crawl) • Keep an index of useful pages (Index) • Presents users with information based on the index (Search) • Search engines operate algorithmically or a mix of algorithmic and human input
Search engines: History • Early summer of 1993: No commercial or large scale search engines existed • W3Catalog : • World’s first primitive search engine • By Oscar Nierstrasz at the University of Geneva • Wandex : • Web Robot • Mathew Gray at MIT • Measure the size of World Wide Web
Search engines: History • Aliweb : • Indexed by hand • Jump Station : • Web Robot • Crawl, Index and Search • 2000: Google rose to fame! • Algorithm called PageRank that ranks webpages based on the number and PageRank of links available on the website
Difference • Early search engines • Few hundred thousand pages • One or two thousand inquiries • Top search engines • Few hundred millions of pages • Billions of queries per day
Web Crawling • Also known as a spider • Special software agent that finds web pages, also follows links on web pages • Contents are analyzed • Words, titles, special fields called meta tags • Starting point? • Popular pages
Google: Web Crawling • At its peak: • Use multiple spiders • Each spider can keep ~300 connections to pages at a time • Generates 600K/s • Starting points: • Dedicated server that feeds URLs to spiders • Instead of relying on ISP for domain names they have their own DNS server • Google spider looks at two things: • Significant words within the page • Location of the words Why is location important?
Meta tags • Owner specific • Can be helpful • Problem? • Robot exclusion protocol
Indexing • Spiders get the data • Now what? • Content analysis • Method by which information is sorted and stored • One way: Storing the word and associated URL • No way to tell if the word is important or trivial • How many times was the word used?
Ranking • A relationship between items about their ordering • For more useful information: • Number of times word appears on page • Assign a weight to each word • Each search engine has a different formula for assigning weight to words in its index • Popular way of indexing : Hashing • Numerical value assigned to each word that can be retrieved using a formula
Building a search • Query: string of words or a single word • Complex queries requires the use of Boolean operators • AND : terms joined by operator, also ‘+’ • OR • NOT • FOLLOWED BY • NEAR • Quotation Marks
Building a search • Literal searches: based on Boolean operators • Concept-based: Statistical analysis on pages containing words or phrases you search for • Information stored about each page is greater • Search times may be longer • Natural language queries • Ask a question : AskJeeves.com • Parses keywords
Money money money.. • Beyond selling shares or private investment • Three main methods: • Online purchases • Web advertising • Keywords relating to product, service or business • Allowing users to integrate ads into their own websites • Fourth shady way: Selling user information
Google Company Culture • Sergey Brin and Larry Page began google with a few networked computers at Stanford • Multibillion dollar organization • >19000 employees globally • Market Capitalization >$145 billion • Googleplex: • Free food – gourmet café stations • Snack rooms • Exercise rooms • Game rooms • Grand piano