chapter 11 the internet l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Chapter 11 The Internet PowerPoint Presentation
Download Presentation
Chapter 11 The Internet

Loading in 2 Seconds...

play fullscreen
1 / 107

Chapter 11 The Internet - PowerPoint PPT Presentation


  • 142 Views
  • Uploaded on

Data Communications and Computer Networks: A Business User’s Approach. Chapter 11 The Internet. This time. Move up the OSI hierarchy Internet Apps Protocols XXXP. The Internet Model. Introduction.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Chapter 11 The Internet' - honey


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
chapter 11 the internet

Data Communications and

Computer Networks: A

Business User’s Approach

Chapter 11

The Internet

this time
This time

Move up the OSI hierarchy

  • Internet
  • Apps
  • Protocols
    • XXXP
slide4

Introduction

Today’s present Internet is a vast collection of thousands of networks and their attached devices.

The Internet began as the Arpanet during the 1960s.

One high-speed backbone connected several university, government, and research sites.

The backbone was capable of supporting 56 Kbps transmission speeds and eventually became financed by the National Science Foundation (NSF).

brief history of the internet 1
Brief History of the Internet (1)
  • 1964 - Packet switching network paper by Rand Corporation
  • 1969 - The DOD Advanced Research Projects Agency creates an experimental network called ARPANET
  • 1972 - Email programs sent
  • 1980s - ARPANET splits into two networks: ARPANET and MILNET
  • 1984 - Arpanet shut down and Internet resulted
  • 1987 - NSFnet Network service Center (NNSC)
brief history of the internet 2
Brief History of the Internet (2)
  • 1993 -InterNIC formed replaced NNSC
  • 1993 - CERN releases the World Wide Web (WWW), developed by Tim Berners-Lee
  • 1993-1994 - The graphical web browsers Mosaic and Netscape Navigator are introduced
  • 1995 - NSF quits all support and backbone, and the Internet became commercially supported
  • 1996-present -Internet access increases rapidly among home, education and business users
brief history of the internet 3
Brief History of the Internet (3)
  • Internet Growth in Nodes
    • 1969 - only 4
    • 1983 - approximately 500
    • 1989 - approximately 80,000
    • 1997 - over 16 million
    • Now - over 370 million
internet growth
Internet Growth
  • http://www.netsizer.com/
  • Hosts vs nodes

Hosts – users connected to the internet

130 M (2001)

  • Nodes are all connected devices
slide10

Internet Services

  • The Internet provides many types of services, including several very common ones:
  • File transfer protocol (FTP)
  • Remote login (Telnet)
  • Internet telephony
  • Electronic mail
  • World Wide Web
  • Streaming Video and Audio
slide11

File Transfer Protocol (FTP)

Used to transfer files across the Internet.

User can upload or download a file.

The URL for an FTP site begins with ftp://…

The three most common ways to access an FTP site is:

1. Through a browser

2. Using a canned FTP program

3. Issuing FTP commands at a text-based command prompt.

slide12

Remote Login (Telnet)

Allows a user to remotely login to a distant computer site.

User usually needs a login and password to remove computer site.

User saves money on long distance telephone charges.

slide13

Internet Telephony

The transfer of voice signals using a packet switched network and the IP protocol.

Also known as packet voice, voice over packet, voice over the Internet, and voice over Internet Protocol (VoIP).

VoIP can be internal to a company or can be external using the Internet.

VoIP consumes many resources and may not always work well, but can be cost effective in certain situations.

slide14

Internet Telephony (VoIP)

Three basic ways to make a telephone call using VoIP:

1. PC to PC using sound cards and headsets (or speakers and microphone)

2. PC to telephone (need a gateway to convert IP addresses to telephone numbers)

3. Telephone to telephone (need gateways)

slide15

Internet Telephony (VoIP)

Three functions necessary to support voice over IP:

1. Voice must be digitized (PCM, 64 Kbps, fairly standard)

2. 64 Kbps voice must be compressed (many standards here - ITU-T G.729A, used by AT&T, Lucent, others; G.723.1, used by Microsoft and Intel)

3. Once the voice is compressed, the data must be transmitted. Many different ways to do this.

slide16

Internet Telephony (VoIP)

How can we transport compressed voice?

Streaming audio, such as Real Time Streaming Protocol (RTSP) and Microsoft’s Active Streaming Format (ASF)

Resource Reservation Protocol (RSVP) - carries a specific QoS through the network, reserving bandwidth at every node. Operates at the transport layer.

Internet Stream Protocol version 2 (ST2) - an experimental resource reservation protocol that operates at same layer as IP

slide17

Electronic Mail

  • E-mail programs can create, send, receive, and store e-mails, as well as reply to, forward, and attach non-text files.
  • Multipurpose Internet Mail Extension (MIME) is used to send e-mail attachments.
  • Simple Mail Transfer Protocol (SMTP) is used to transmit e-mail messages. (uses port TCP port 25)
    • Email daemon always waiting to perform its function
  • Post Office Protocol version 3 (POP3) and Internet Message Access Protocol (IMAP) are used to hold and later retrieve e-mail messages.
slide18

eMail

Consists of 2 parts:

User Agent: Allows users to create, edit, store and forward programs

Message Transfer Agent: Prepares and transfers email message

slide19

Electronic Mail Holders

Post Office Protocol version 3 (POP3) and Internet Message Access Protocol (IMAP) are used to hold and later retrieve e-mail messages.

POP allows you to save messages in your email box

IMAP allows you to only view message heading and not download everything. Also permits mailboxs, search, etc.

slide20

Listservs

A popular software program used to create and manage Internet mailing lists.

When an individual sends an e-mail to a listserv, the listserv sends a copy of the message to all listserv members.

Listservs can be useful business tools for individuals trying to follow a particular area of study.

slide21

Usenet

A voluntary set of rules for passing messages and maintaining newsgroups.

A newsgroup is the Internet equivalent of an electronic bulletin board system.

Thousands of Usenet groups exist on virtually any topic.

slide22

Streaming Audio and Video

The continuous download of a compressed audio or video file, which can be heard or viewed on the user’s workstation.

Real-time Protocol (RTP) and Real Time Streaming Protocol (RTSP) support streaming audio and video.

Streaming audio and video consume a large amount of network resources.

slide23

World Wide Web

The World Wide Web (WWW) is a immense collection of web pages and other resources that can be downloaded across the Internet and displayed on a workstation via a web browser.

Browser is the user agent.

The most popular service on the Internet.

Basic web pages are created with the HyperText Markup Language (HTML).

slide24

World Wide Web

While HTML is the language to display a web page, HyperText Transport Protocol (HTTP) is the protocol to transfer a web page.

Many extensions to HTML have been created. Dynamic HTML is a very popular extension to HTML.

Common examples of dynamic HTML include mouse-over techniques, live positioning of elements (layers), data binding, and cascading style sheets.

slide25

World Wide Web – XML

Extensible Markup Language (XML) is a description for how to create a document - both the definition of the document and the contents of the document.

The syntax of XML is fairly similar to HTML.

You can define your own tags, such as <CUSTOMER> which have their own, unique properties.

slide26

e-Commerce and e-government

The buying and selling of goods and services via the internet.

Government transitions via the internet.

e-commerce major areas:

1. e-retailing

2. Electronic Data Interchange (EDI)

3. Micro-marketing

4. Electronic security

5. Web services

slide27

Security

of

Data

Privacy

of

Data

Business

Policies

Transaction

Processing

Integrity

security of data
Security of Data
  • How secure is the data maintained by the business?
    • Personal/business entity data
    • data stored by a web site that is used by a trading partner to make transaction decision
  • How secure is the data as it is transmitted to and from this business?
business policies
Business Policies
  • What are the business policies and practices of this business?
    • billing and payment policies
    • shipping policy
    • return policy
    • tax collection
    • additional policy information
transaction processing integrity
Transaction Processing Integrity
  • What procedures are in place to ensure that the transactions are handled as disclosed?
    • How does the company ensure that is does not lose orders placed?
    • How does the company ensure that it accurately processes bills and account information?
    • What controls exist to ensure that the company accurately posts payment in a timely fashion?
    • Does the company have controls in place to ensure that it ships the right inventory items and quantities?
privacy of data
Privacy of Data
  • What is the privacy policy of the business?
  • What information does it keep?
  • How will the information collected be used by the business?
  • Will this business share or sell customer data without the customer’s permission or knowledge?
  • What ensures that the company’s privacy policies are observed and practiced on a continuous basis?
security assurance systems ensure that
Security Assurance Systems ensure that...
  • The transacting parties are authenticated - who they claim to be - a security issue
  • that electronic data are protected from unauthorized disclosure - a security issue
electronic data interchange
Electronic Data Interchange...
  • is the electronic exchange of business documents between trading partners using a standardized format.
  • Traditional EDI
    • High start-up costs
    • Used primarily by large firms
    • Generally, even large firms could only connect with 20% of their trading partners
slide34

Cookies and State Information

A cookie is data created by a web server that is stored on the hard drive of a user’s workstation.

This state information is used to track a user’s activity and to predict future needs.

Information on previous viewing habits stored in a cookie can also be used by other web sites to provide customized content.

Many consider cookies to be an invasion of privacy.

www.cookiecentral.com

slide35

Cookie Control

Delete cookies after inserted

Accept no or restricted cookies

Change permissions

www.cookiecentral.com

slide36

Intranets and Extranets

An intranet is a TCP/IP network inside a company that allow employees to access the company’s information resources through an Internet-like interface.

When an intranet is extended outside the corporate walls to include suppliers, customers, or other external agents, the intranet becomes an extranet.

slide37

Internet Protocols

  • To support the Internet and all its services, many protocols are necessary.
  • Some of the protocols that we will look at:
  • Internet Protocol (IP)
  • Transmission Control Protocol (TCP)
  • Address Resolution Protocol (ARP)
  • Domain Name System (DNS)
slide38

Internet Protocols

Recall that the Internet with all its protocols follows the Internet model.

An application, such as e-mail, resides at the highest layer.

A transport protocol, such as TCP, resides at the transport layer.

The Internet Protocol (IP) resides at the Internet or network layer.

A particular media and its framing resides at the interface layer.

network layer
Network Layer
  • Responsible for creating maintaining and ending network connections.
  • Transfers a data packet from node to node within the network.
  • Message routing
  • Billing
  • Accounting
transport layer
Transport Layer
  • Provides an end-to-end, error-free network connection.
  • Makes sure the data arrives at the destination exactly as it left the source.
  • Makes sure all information is accounted for:
    • Missing information
    • Duplicated information
slide42

The Internet Protocol (IP)

IP prepares a packet called a datagram for transmission across the Internet.

The IP header is encapsulated onto a transport data packet.

The IP packet is then passed to the next layer where further network information is encapsulated onto it.

slide44

The Internet Protocol (IP)

Using IP, a subnet router:

Makes routing decision based on the destination address.

May have to fragment the datagram into smaller datagrams (very rare) using Fragment Offset.

May determine that the current datagram has been hopping around the network too long and delete it TTL (Time to Live).

slide46

The Transmission Control Protocol (TCP)

  • The TCP layer creates a connection between sender and receiver using port numbers.
  • The port number identifies a particular application on a particular device (IP address).
    • ftp: 20
    • smtp: 25
    • http: 80
  • TCP can multiplex multiple connections (using port numbers) over a single IP line.
slide47

The Transmission Control Protocol (TCP)

The TCP layer can ensure that the receiver is not overrun with data (end-to-end flow control) using the Window field.

TCP can perform end-to-end error correction (Checksum).

TCP allows for the sending of high priority data (Urgent Pointer).

slide49

Internet Control Message Protocol (ICMP)

ICMP, which is used by routers and nodes, performs the error reporting for the Internet Protocol.

ICMP reports errors such as invalid IP address, invalid port address, and the packet has hopped too many times.

ping tcp ip troubleshooting
Ping – TCP/IP Troubleshooting
  • Ping is the primary tool for troubleshooting IP-level connectivity. Type ping -? at a command prompt to see a complete list of available command-line options. Ping allows you to specify the size of packets to use (the default is 32 bytes), how many to send, whether to record the route used, what Time To Live (TTL) value to use, and whether to set the "don't fragment" flag.
  • When a ping command is issued, the utility sends an ICMP Echo Request to a destination IP address. Try pinging the IP address of the target host to see if it responds. If that succeeds, try pinging the target host using a host name. Ping first attempts to resolve the name to an address through a DNS server, then a WINS server (if one is configured), then attempts a local broadcast. When using DNS for name resolution, if the name entered is not a fully qualified domain name, the DNS name resolver appends the computer's domain name or names to generate a fully qualified domain name.
  • If pinging by address succeeds but pinging by name fails, the problem usually lies in name resolution, not network connectivity. Note that name resolution might fail if you do not use a fully qualified domain name for a remote name. These requests fail because the DNS name resolver is appending the local domain suffixes to a name that resides elsewhere in the domain hierarchy.
tracert command
tracert command

tracert – trace route

how the tracert command works
How the TRACERT command works
  • The TRACERT diagnostic utility determines the route taken to a destination by sending Internet Control Message Protocol (ICMP) echo packets with varying IP Time-To-Live (TTL) values to the destination. Each router along the path is required to decrement the TTL on a packet by at least 1 before forwarding it, so the TTL is effectively a hop count. When the TTL on a packet reaches 0, the router should send an ICMP Time Exceeded message back to the source computer.
  • TRACERT determines the route by sending the first echo packet with a TTL of 1 and incrementing the TTL by 1 on each subsequent transmission until the target responds or the maximum TTL is reached. The route is determined by examining the ICMP Time Exceeded messages sent back by intermediate routers. Note that some routers silently drop packets with expired TTLs and are invisible to TRACERT.
  • TRACERT prints out an ordered list of the routers in the path that returned the ICMP Time Exceeded message. If the -d switch is used (telling TRACERT not to perform a DNS lookup on each IP address), the IP address of the near- side interface of the routers is reported.
slide54

User Datagram Protocol (UDP)

A transport layer protocol used in place of TCP.

Where TCP supports a connection-oriented application, UDP is used with connectionless applications.

UDP also encapsulates a header onto an application packet but the header is much simpler than TCP.

slide55

Address Resolution Protocol (ARP)

When an IP packet has traversed the Internet and encounters the destination LAN, how does the packet find the destination workstation?

Even though the destination workstation may have an IP address, a LAN does not use IP addresses to deliver frames. A LAN uses the MAC layer address.

ARP translates an IP address into a MAC layer address so a frame can be delivered to the proper workstation.

slide56

Tunneling Protocols

The Internet is not normally a secure system.

If a person wants to use the Internet to access a corporate computer system, how can a secure connection be created?

One possible technique is by creating a virtual private network (VPN).

A VPN creates a secure connection through the Internet by using a tunneling protocol.

slide57

Every workstation attached to the

Internet needs:

  • Its IP address
  • Its subnet mask (more on this later)
  • The IP address of a router
  • The IP address of a name server
slide58

BOOTP (you don’t have an IP address?)

  • Thin client workstations do not have a disk drive, and its ROM does not contain the previous four pieces of information.
  • How do we tell the machine this information? BOOTP (Bootstrap protocol).
  • There are two types of BOOTP operations:
    • REQUEST – A workstation asks a server for the information (source IP address = all 0s, destination IP address = all 1s).
    • REPLY – The server returns the information to the workstation.
slide60

Dynamic Host Configuration Protocol

(DHCP)

BOOTP is not dynamic (when a client requests its IP address, it is retrieved from a static table).

DHCP is a dynamic extension of BOOTP.

When a DHCP client issues an IP request, the DHCP server looks in its static table. If no entry exists, the server selects an IP address from an available pool.

slide61

Dynamic Host Configuration Protocol

(DHCP)

The address assigned by the DHCP server is temporary.

Part of the agreement includes a specific period of time.

If no time period specified, the default is one hour.

DHCP clients may negotiate for a renewal before the time period expires.

slide62

Network Address Translation (NAT)

NAT protocol lets a router represent an entire local area network to the Internet as a single IP address.

Thus it appears all traffic leaving this LAN appears as originating from a global IP address.

All traffic coming into this LAN uses this global IP address.

This security feature allows a LAN to hide all the workstation IP addresses from the Internet.

slide63

NAT

  • Since the outside world cannot see into the LAN, you do not need to use registered IP addresses on the inside LAN.
  • We can use the following blocks of addresses for private use:
  • 10.0.0.0 – 10.255.255.255
  • 172.16.0.0 – 172.31.255.255
  • 192.168.0.0 – 192.168.255.255
slide64

NAT

When a user on inside sends a packet to the outside, the NAT interface changes the user’s inside address to the global IP address. This change is stored in a cache.

When the response comes back, the NAT looks in the cache and switches the addresses back.

No cache entry? The packet is dropped. Unless NAT has a service table of fixed IP address mappings. This service table allows packets to originate from the outside.

slide65

Locating a Document on the Internet

Every document on the Internet has a uniform resource locator (URL) (not necessarily unique) and an IP address (not necessarily unique).

All URLs consist of four parts:

1. Service type

2. Host or domain name

3. Directory or subdirectory information

4. Filename

slide66

The Parts of a Uniform Resource Locator (URL)

http://psu.edu/stuff

http

service type

edu

top level domain – type of organization

often followed by a country code, eg. --.uk

psu

mid level domain – name of organization

stuff, www.psu.edu

domains generated by organization

top and mid levels

Determined by assignment

boards

slide67

The Parts of a Uniform Resource Locator (URL)

New domains:

.biz

.zzz

.xxx

.dog

Who controls this?

http://www.icann.org/

slide69

Locating a Document on the Internet

  • When a user, running a web browser, enters a URL, how is the URL translated into an IP address?
  • The Domain Name System (DNS) is a large, distributed database of URLs and IP addresses.
    • tracert command does this for you.
  • The first operation performed by DNS is to query a local database for URL/IP address information.
  • If the local server does not recognize the address, the server at the next level will be queried.
slide70

Locating a Document on the Internet

Eventually the root server for URL/IP addresses will be queried.

If the root server has the answer, the results are returned.

If the root server recognizes the domain name but not the extension in front of the domain name, the root server will query the server at the domain name’s location.

When the domain’s server returns the results, they are passed back through the chain of servers (and their caches).

slide71

IP Addresses

All devices connected to the Internet have a 32-bit IP (IPv4) address associated with it. 232 = total addresses?

Think of the IP address as a logical address (possibly temporary), while the 48-bit address on every NIC is the physical, or permanent address.

Computers, networks and routers use the 32-bit binary address, but a more readable form is the dotted decimal notation.

slide72

IP Addresses

  • For example, the 32-bit binary address
  • 10000000 10011100 00001110 00000111 (4 octets)
  • translates to
  • 128.156.14.7 (called dotted decimal notation)
  • Range of octets is 0-255 = 28
  • There are basically four types of IP addresses:
    • Classes A, B, C and D.
  • A particular class address has a unique network address size and a unique host address size.
slide73

Four Basic Forms of an IP 32-bit Address

What is psu’s IP address?

Ping: psu.edu 128.118.141.56

Ping ist.psu.edu?

slide74

IP Addresses

When you examine the first decimal value in the dotted decimal notation:

All Class A addresses are in the range 0 - 127

All Class B addresses are in the range 128 - 191

All Class C addresses are in the range 192 - 223

slide75

IP Subnet Masking

Sometimes you have a large number of IP address to manage.

By using subnet masking, you can break the host ID portion of the address into a subnet ID and host ID.

Each subnet supports a number of other hosts.

For example, the subnet mask 255.255.255.0 applied to a class B address will break the host ID (normally 16 bits) into an 8-bit subnet ID and an 8-bit host ID.

slide77

The Future of the Internet

  • Various Internet committees are constantly working on new and improved protocols.
  • Examples include:
  • Internet Printing Protocol
  • Internet fax
  • Extensions to FTP
  • Common Name Resolution Protocol
  • WWW Distributed Authoring and Versioning
  • Web Services
slide78

IPv6

http://www.ipv6.org/

  • The next version of the Internet Protocol.
  • Main features include:
  • Simpler header
  • 128-bit IP addresses 2128 = (210)12 28 = (103)12 28 = 2 x 1038
  • Priority levels and quality of service parameters
  • No fragmentation (datagram is big!)
slide80

Internet2

http://www.internet2.edu/

  • A new form of the Internet is being developed by a number of businesses and universities.
  • Internet2 will support very high speed data streams (Gigs).
  • Applications might include:
  • Digital library services
  • Tele-immersion
  • Virtual laboratories
slide81

The Internet In Action:

A Company Creates a VPN

A fictitious company wants to allow 3500 of its workers to work from home.

If all 3500 users used a dial-in service, the telephone costs would be very high.

slide83

Data Communications and Computer Networks

Chapter 11

The Internet In Action: A Company Creates a VPN

Instead, the company will require each user to access the Internet via their local Internet service provider.

This local access will help keep telephone costs low.

Then, once on the Internet, the company will provide software to support virtual private networks.

The virtual private networks will create secure connections from the users’ homes into the corporate computer system.

slide85

Your old web pages!!! Internet Archivewww.archive.org

  • Founded in 1996 by Brewster Kahle.
  • Maintains many, many TB’s of Internet data, including snapshots of
    • World Wide Web
    • Usenet
    • Gopher
    • FTP archives
  • Goals:
    • Accumulate and preserve digital information for the long term that would otherwise be lost.
    • Provide access to researchers, journalists, historians and others.
bow tie theory of the web
Bow-tie Theory of the Web

200 million (billion links) urls explored - Broder, et.al. WWW9 ’00

how big is the publicly indexable web
How Big is the Publicly Indexable Web?
  • Feb’99: estimate 16 million total web servers reduces to about 2.8 million servers for the publicly indexable web
  • Average number of pages per site was 289
  • Estimated total number of pages on the web about 800 million
  • Current estimate – 3 to 5 billion pages

From a random sample of IP addresses (address space 2564 or about 4.3 billion)

volume of information on web feb 99
Volume of Information on Web - Feb, ‘99
  • Mean page size was 19k (median 4k)
  • Total amount of data: about 15 terabytes of pages
  • About 6 terabytes after removing comments, extra whitespace, and HTML tags
  • About 63 images per server, mean image size 15k (median 6k)
  • About 180 million images on the publicly indexable web, about 3 terabytes of image data
distribution of the content of www information
Distribution of the content of WWW Information
  • %’s of manually classified homepage of first 2,500 randomly found web servers
  • 83% of sites commercial
    • Off scale for this chart ->
  • Percentage of sites in areas like science, health,and government relatively small
    • Would be feasible and very valuable to create specialized search services that are very comprehensive and up to date
  • 65% of sites have a majority of pages in English
web search techniques
Web Search Techniques

- 85% of users use search engines to locate information (GVU survey)

- Several search engines consistently rank in the top 10 sites accessed on the web

  • Full-text indexes
  • Hierarchical directories
  • Specialized or niche search services
  • What’s related (Alexa/Netscape)
  • Collaborative filtering
  • Notification systems
  • Softbots
search engines
Search Engines
  • Lots: over 3000? - 20 make up 98% of all searches done the web
  • Business models are often not just search!
  • AltaVista (summer, 1998):
    • Indexes about 0.8 Tb (index about 30% of the size of the grabbed data)
    • Every word indexed
    • About 37 million queries on weekdays
    • Mean response time of 0.6 seconds
    • About 20 64-bit machines
      • 10 CPU, 625 MHz, 12Gb RAM, 300 Gb RAID (each)
  • Google (spring, 2000):
    • 2500 PCs, buy 30 a day, discard them when they break
search engine architecture
Search Engine Architecture
  • Web crawler that crawls the web and harvests data – html, text, etc.
  • Indexer that indexes some of the crawled pages
  • Query engine that queries the index and presents results
  • Query interface
slide95

Index

Query Engine

Interface

Indexer

Users

Crawler

Web

A Typical Web Search Engine

slide96

Ways to compare search engines

  • Relevance ranking
  • Coverage (comments once seen in the press)
    • “If you can’t find it using XXX search, it’s probably not out there”
    • “HotBot is the first search robot capable of indexing and searching the entire web”
  • Recency (comment once seen in the press)
    • “[With XXX] you can find new information just about as quickly as it's available on the Web”
  • Functionality (e.g. query syntax)
  • Speed
  • Availability
  • Usability
  • Time/ability to satisfy user requests
ranking options
Ranking Options

Special factors

• Conventional methods (e.g., tf.idf) were developed for homogenous collections, e.g., items of similar length

• Some items are deliberately constructed to distort indexing

Options

• Vector space ranking with corrections for document length

• Extra weighting for specific fields, e.g., title, anchors, etc.

• Link structure, e.g., Google's PageRank, Kleinberg's Hubs and Authorities

google page brin
Google(Page, Brin)
  • 2nd Generation Search Engine!
  • Makes greater use of HTML structure and the graph formed by hyperlinks between pages
  • PageRank
    • Iteratively uses information about the number of pages pointing to a page in order to estimate the popularity of a page
    • Links from more popular pages count more
  • Uses the text in links to a page
    • Link descriptions may describe a page better than the page itself
  • Yahoo’s search engine

www.google.com

pagerank and google
Prestige of a page is proportional to sum of prestige of citing pages

Standard bibliometric measure of influence

Simulate a random walk on the Web to precompute prestige of all pages

Sort keyword-matched responses by decreasing prestige

Follow randomoutlink from page

PageRank and Google

p1

p2

p4

p3

p4 p1 + p2 + p3

I.e., p = Ep

google architecture
GoogleArchitecture
  • Perl with C/C++
  • Linux
  • Module-based architecture
  • Multi-machine
  • Multi-thread
metasearch engines or tools

Search

Engine #1

Information

Need

Search

Engine #2

Fusion

Policy

Query

Search

Engine #3

Result

Set

etc

Metasearch Engines or Tools
  • Single search engine coverage is low, maximum of 16%
    • Querying multiple can significantly improve coverage
  • Query is sent to several search engines simultaneously
    • Policies?
  • Results are fused by a fusion policy
    • Similar, but slightly different from an ordering policy
  • Fusion at many levels
search engine coverage 11 engines feb 99
Search Engine Coverage - 11 engines Feb ‘99
  • Combined coverage with respect to each other
  • With respect to each other compared to total web size
  • Combined coverage - 42%
search engines sizes
Search Engines Sizes

searchenginewatch.com

covered
Covered
  • Protocols – XXXP
  • URL and DNS
  • IP addresses
  • Search engines