search and discovery searching the web l.
Skip this Video
Loading SlideShow in 5 Seconds..
Search and Discovery: Searching the Web PowerPoint Presentation
Download Presentation
Search and Discovery: Searching the Web

Loading in 2 Seconds...

play fullscreen
1 / 42

Search and Discovery: Searching the Web - PowerPoint PPT Presentation

  • Uploaded on

Search and Discovery: Searching the Web Stages of a transaction Discovery Find what you’re interested in Locate sellers Locate buyers Compare products Negotiation Exchange Discovery Encompasses: Search engines Recommender systems Price comparison/shopping agents Description languages

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Search and Discovery: Searching the Web' - bernad

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
stages of a transaction
Stages of a transaction
  • Discovery
    • Find what you’re interested in
      • Locate sellers
      • Locate buyers
      • Compare products
  • Negotiation
  • Exchange
  • Encompasses:
    • Search engines
    • Recommender systems
    • Price comparison/shopping agents
    • Description languages
    • Data sources
      • Generic sources: portals, web directories
      • Domain-specific sources: catalogs, guides, etc.
    • Advertising
  • More than just finding a resource
    • Need to be able to estimate value, likelihood of successful negotiation
    • An evaluative infrastructure is required
  • Least formalized of e-commerce subareas.
  • Unlikely to have a general-purpose solution soon
    • Too complex
a brief history of the web
A Brief History of the Web
  • Prehistory:
    • Hypertext as an idea has been around since the 40s.
      • Vannevar Bush: Memex
      • Engelbart: 60s
    • 1987: Hypercard
      • Graphical tool allowing users to create hyperlinked documents.
    • Late 80s/early 90s: WAIS, Gopher
a brief history of the web6
A Brief History of the Web
  • 1989/90: Tim Berners-Lee proposes the WWW at CERN
    • A new global information retrieval system
    • Develops HTML, a simple markup language
  • 1993: Mosaic developed at NCSA
    • Marc Andressen then founds Netscape
  • 1993/94: NCSA httpd released
    • Open-source web server, supported CGI
    • Precursor to Apache
a brief history of the web7
A Brief History of the Web
  • 1994: Banner ads appear on HotWired
    • Beginning of the commercial web
  • 1994: Yahoo founded
    • Appearance of the portal, search engine
  • 1995: NSF backbone privatized
    • AT&T, Sprint, etc take over traffic
    • Network Solutions given a monopoly on domain names
  • 1995: Microsoft releases Internet Explorer
    • In 7 years, Netscape goes from 100% market share to 20% (2001).
a brief history of the web8
A Brief History of the Web
  • 1995: AltaVista started
    • Full-text Web search
  • 1995: Andressen first WWW billionaire
  • 1995: Sun introduces Java
    • Able to ship code and text across networks
  • 1995: eBay founded
    • First online auction
  • 1995-98: Explosive growth
    • Many new formats, applications, companies
  • 1998: Akamai founded (web caching)
a brief history of the web9
A Brief History of the Web
  • 1998: ICANN governs names & addresses
  • 1998: MP3 format popularized
    • WinAmp released
    • Small enough to make audio distribution practical
  • 1998: Google founded.
  • 2000: Napster appears
    • Beginnings of peer-to-peer technology, file sharing
  • 2000(ish): End of the boom
    • Consolidation, reduction in growth
lessons from radio
Lessons from Radio
  • Radio was popularized in the 1920s
    • Originally intended as a one-to-one messaging system.
    • Fee-for-use pay structure.
  • 1922: Explosive growth begins
    • RCA’s revenues from sales of receivers doubled each year
    • Broadcast model becomes prevalent
    • Thousands of broadcasters emerge
lessons from radio11
Lessons From Radio
  • 1922-1924: Transition
    • How to make money broadcasting?
      • Support sale of receivers
      • Goodwill (sponsors)
      • Public good – supported as a non-profit
      • Advertising
      • Tube tax/set tax (a la BBC)
    • By 1924, stations are failing as quickly as they start.
lessons from radio12
Lessons From Radio
  • Affordable content driven by audience size
  • “Rich-get-richer” for large stations
  • 1926: RCA launches NBC
    • First nationwide broadcast
    • Creates the network system
      • National content, local broadcasting
    • Advertising the dominant revenue generator
  • WWW questions:
    • Who will be NBC?
    • What will the revenue model be?
      • Advertising? Competition with TV, radio for this revenue.
      • Micropayments? Subscriptions? Content aggregation?
searching the web
Searching the Web
  • Web growth estimated at 1000% in late 90s.
  • Can search engines keep up with this growth?
  • How to deal with the dynamic nature of the web?
    • Page contents change
    • Pages appear, disappear, move
    • Link structure changes
search engines
Search Engines
  • Most common form of discovery
  • Crawl the web to collect pages
  • Stored and indexed for easy retrieval
  • Query languages simple
  • Goals:
    • Fast retrieval (Google gets 150 million queries per day)
    • Accurate (no dead links)
    • Precise (pages match user’s needs)
  • Outward link
    • Object that a page links to
  • Outdegree: number of outward links
  • Inward link
    • Pages that link to an object
  • Indegree: number of inward links
  • Path
    • Series of outward links from A to B
the web as a directed graph
The Web as a Directed Graph
  • We can represent the web as a directed graph.
    • Sites are nodes
    • Links are edges.
  • Outward link
    • Object that a page links to
  • Inward link
    • Pages that link to an object
adjacency matrix
Adjacency Matrix
  • We can also represent the Web as a very large adjacency matrix.
  • The eigenvector of this matrix illustrates the clusteredness of the Web
    • Distribution of in-degree and out-degree
    • Connectedness
    • Some ranking algorithms (HITS) use this measure.
web structure
Web structure
  • Web can be broken into four areas (Kleinberg/Lawrence)
    • Core: Path between any two pages
    • Upstream: Can reach the core, but no path from core.
    • Downstream: can be reached from core, but cannot reach core.
    • Tendrils/islands – disconnected from the core.
  • Areas (allegedly) have roughly equal size.
  • Search engines claim they index a large fraction of the web.
  • How to verify this?
  • Run queries on many engines and compare number of hits.
    • May return irrelevant documents
    • Documents may no longer exist
    • Documents may have changed
  • NEC (1998) – Estimate size of web, coverage for major search engines.
    • Query each engine, retrieve and compare all results (only exact matches).
  • Coverage estimates:
    • HotBot: 57%, AltaVista: 46%
    • NorthernLight: 33%, Excite: 23%
    • Infoseek: 16%, Lycos: 4%
estimating the size of the indexable web
Estimating the size of the indexable web
  • Overlap in coverage was used to estimate size.




U/B serves as an estimate of A/N, where N is the size of

the Web.

1998: Altavista/Hotbot estimate: 320 million pages.

using size to refine coverage estimates 1997
Using size to refine coverage estimates.(1997)
  • This value can then be used to determine a coverage estimate for each engine.
  • For each pair, solve for N.
  • Assume real N is largest found.
  • Updated: HotBot: 34%, AltaVista: 28%
  • NorthernLight: 20%, Excite: 14%
  • Infoseek: 10%, Lycos: 3%
updates 1999
Updates: (1999)
  • Web growth ahead of indexing
    • No search engine covers more than 16% of the Web.
    • Union of all engines: ~50% coverage
    • Estimated size: 800 million pages
    • Search engines more likely to link to authorities
    • More likely to link to US, commercial sites.
updates 12 2001
Updates (12/2001)
  • Self-reported number of pages indexed:
  • Google: 2 billion (3 billion+ today)
  • FAST ( 625 million
    • (claimed 2.1 billion in 2002)
  • Altavista: 550 million
  • Inktomi: 500 million
  • NorthernLight: 390 million
indexing the web
Indexing the web
  • Spiders are used to crawl the web and collect pages.
    • A page is downloaded and its outward links are found.
    • Each outward link is then downloaded.
    • Exceptions:
      • Links from CGI interfaces
      • Robot Exclusion Standard
indexing the web27
Indexing the Web
  • “Stop words” stripped from page
  • Forward index created
    • Bundles words
    • Maps words to documents.
  • Can use TFIDF to only map “significant” keywords
    • Term Frequency * InverseDocumentFrequency
indexing the web28
Indexing the web
  • An inverted index is created
    • Forward index sorted according to word
    • Maps keywords to URLs
  • Some wrinkles:
    • Morphology: stripping suffixes (stemming), singular vs. plural, tense, case folding
    • Semantic similarity
      • Words with similar meanings share an index.
  • Issue: trading coverage (number of hits) for precision (how closely hits match request)
indexing issues
Indexing Issues
  • Indexing techniques were designed for static collections
  • How to deal with pages that change?
    • Periodic crawls, rebuild index.
    • Varied frequency crawls
      • Records need a way to be “purged”
      • Hash of page stored
  • Can use the text of a link to a page to help label that page.
    • Helps eliminate the addition of spurious keywords.
indexing issues30
Indexing Issues
  • Availability and speed
    • Most search engines will cache the page being referenced.
  • Multiple search terms
    • OR: separate searches concatenated
    • AND: intersection of searches computed.
    • Regular expressions not typically handled.
  • Parsing
    • Must be able to handle malformed HTML, partial documents
  • Google uses PageRank to determine relevance.
  • Based on the “quality” of a page’s inward links.
  • Average the PageRanks of each page that points to a given page, divided by their outdegree.
  • Let p be a page, with T1 – Tn linking to p.
  • PR(p) = (1-d) + d(SumI(Pr(TI)/outI))
  • d is a ‘damping’ factor.
  • PR ‘propagates’ through a graph.
  • Justification:
    • Imagine a random surfer who keeps clicking through links.
      • d is the probability she starts a new search.
    • Or …
    • A page has a high ranking if highly ranked pages point to it.
    • Pros: difficult to game the system
    • Cons: Creates a “rich get richer” web structure where highly popular sites grow in popularity.
  • HITS is also commonly used for document ranking.
  • Gives each page a hub score and an authority score
    • A good authority is pointed to by many good hubs.
    • A good hub points to many good authorities.
    • Users want good authorities.
issues with ranking algorithms
Issues with Ranking Algorithms
  • Spurious keywords and META tags
  • Users reinforcing each other
    • Increases “authority” measure
  • Topic drift
    • Many hubs link to more than one topic
web structure35
Web structure
  • Structure is important for:
    • Predicting traffic patterns
      • Who will visit a site?
      • Where will visitors arrive from?
      • How many visitors can you expect?
    • Estimating coverage
      • Is a site likely to be indexed?
  • Compact
    • Short paths between sites
    • “Small world” phenomenon
      • Distances are small relative to average path length
    • Number if inward and outward links follows a power law.
  • Mechanism: preferential attachment
    • As new sites arrive, the probability of gaining an inward link is proportional to in-degree.
power laws and small worlds
Power laws and small worlds
  • Power laws occur everywhere in nature
    • Distribution of site sizes, city sizes, incomes, word frequencies
    • Random networks tend to evolve according to a power law.
  • Small-world phenomenon
    • “Neighborhoods” will be joined by a common member
    • Hubs serve to connect neighborhoods
    • Linkage is closer than one might expect
    • Six Degrees of Separation, Kevin Bacon
local structure
Local structure
  • More diverse than a power law
  • Pages with similar topics self-organize into communities
    • Short average path length
    • High link density
    • Webrings
    • Inverse: Does a high link density imply the existence of a community?
    • Can this be used to study the emergence and growth of web communities?
hubs and authorities
Hubs and Authorities
  • Common community structure
    • Hubs
      • Many outward links
      • Lists of resources
    • Authorities
      • Many inward links
      • Provide resources, content
hubs and authorities40
Hubs and Authorities



Link structure estimates over 100,000 Web communities

Often not categorized by portals

web communities
Web Communities
  • Alternate definition
    • Each member has more links to community members than non-community members.
    • Extension of a clique.
    • Can be discovered with network flow algorithms.