Cs 430 information discovery
Download
1 / 37

CS 430: Information Discovery Lecture 20 - PowerPoint PPT Presentation


  • 385 Views
  • Uploaded on

CS 430: Information Discovery. Lecture 20 Web Search 2. Course Administration. • Outstanding queries on Assignment 2 have been answered. • Wording change made to Assignment 3: output need not be in Web format. . Effective Information Retrieval.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'CS 430: Information Discovery Lecture 20' - johana


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Cs 430 information discovery
CS 430: Information Discovery

Lecture 20

Web Search 2


Course administration
Course Administration

• Outstanding queries on Assignment 2 have been answered.

• Wording change made to Assignment 3: output need not be in Web format.


Effective Information Retrieval

1. Comprehensive metadata with Boolean retrieval (e.g., monograph catalog).

Can be excellent for well-understood categories of material, but requires expensive metadata, which is rarely available.

2. Full text indexing with ranked retrieval (e.g., news articles).

Excellent for relatively homogeneous material, but requires available full text.

Neither of these methods is very effective when applied directly to the Web.


New concepts in web searching
New concepts in Web Searching

  • Goal of search is redefined.

  • Concept of relevance is changed.

  • Browsing is tightly connected to searching.

  • Contextual information is used as an integral part of the search.


Indexing goals precision
Indexing Goals: Precision

Short queries applied to very large numbers of items

leads to large numbers of hits.

• Goal is that the first 10-100 hits presented should satisfy the user's information need

-- requires ranking hits in order that fits user's requirements

• Recall is not an important criterion

Completeness of index is not an important factor.

• Comprehensive crawling is unnecessary


Concept of relevance
Concept of Relevance

Document measures

Relevance, as conventionally defined, is binary (relevant or not relevant). It is usually estimated by the similarity between the terms in the query and each document.

Importancemeasures documents by their likelihood of being useful to a variety of users. It is usually estimated by some measure of popularity.

Web search engines rank documents by a combination of

relevance and importance. The goal is to present the user

with the most important of the relevant documents.


Ranking options
Ranking Options

1. Paid advertisers

2. Manually created classification

3. Vector space ranking with corrections for document length

4. Extra weighting for specific fields, e.g., title, anchors, etc.

5. Popularity, e.g., PageRank

The balance between 3, 4, and 5 is not made public.


Browsing and searching
Browsing and Searching

Searching is followed by browsing.

Browsing the hit list:

helpful summary records (snippets)

removal of duplicates

grouping results from a single site

Browsing the Web pages themselves:

direct links from the snippets to the pages

cache with highlights

translation in same format


Browsing and searching1
Browsing and Searching

Query:Cornell sports

LII: Law about...Sports...sports law: an overview. Sports Law encompasses a multitude areas of law brought together in unique ways. Issues ... vocation. Amateur Sports. ...www.law.cornell.edu/topics/sports.html

Query: NCAATarkanian

LII: Law about...Sports... purposes. See NCAA v. Tarkanian, 109 US 454 (1988). State action status may also be a factor in mandatory drug testing rules. On ...www.law.cornell.edu/topics/sports.html


Contextual information
Contextual information

Information about a document:

  • Content (terms, formatting, etc.)

  • Metadata (externally created following rules)

  • Context (citations and links, reviews, annotations, etc.)

    Context has many uses:

  • Selecting documents to index

  • Retrieval clues (e.g., href text)

  • Ranking


Effective information retrieval cont
Effective Information Retrieval (cont)

3. Full text indexingwith contextual information and ranked retrieval (e.g., Google, Teoma).

Excellent for mixed textual information with rich structure.

4. Contextual information with non-textual materialsand ranked retrieval (e.g., Google image retrieval).

Promising, but still experimental.


Scalability
Scalability

10,000,000,000

1,000,000,000

100,000,000

10,000,000

1,000,000

100,000

10,000

1,000

100

10

1

1994

1997

2000

The growth of the web


Scalability

Web search services are centralized systems

• Over the past 9 years, Moore's Law has enabled the services to keep pace with the growth of the web and the number of users, while adding extra function.

• Will this continue?

• Possible areas for concern are: staff costs, telecommunications costs, disk access rates.


Cost example google
Cost Example (Google)

85 people

50% technical, 14 Ph.D. in Computer Science

Equipment

2,500 Linux machines

80 terabytes of spinning disks

30 new machines installed daily

Reported by Larry Page, Google, March 2000

At that time, Google was handling 5.5 million searches per day

Increase rate was 20% per month

By fall 2002, Google had grown to over 400 people.


Scalability staff
Scalability: Staff

Programming: Have very well trained staff. Isolate complex code. Most coding is single image.

System maintenance: Organize for minimal staff (e.g., automated log analysis, do not fix broken computers).

Customer service: Automate everything possible, but complaints, large collections, etc. require staff.


Scalability performance
Scalability: Performance

Very large numbers of commodity computers

Algorithms and data structures scale linearly

  • Storage

    • Scale with the size of the Web

    • Compression/decompression

  • System

    • Crawling, indexing, sorting simultaneously

  • Searching

    • Bounded by disk I/O


Bibliometrics
Bibliometrics

Techniques that use citation analysis to measure the similarity of journal articles or their importance

Bibliographic coupling: two papers that cite many of the same papers

Co-citation: two papers that were cited by many of the same papers

Impact factor (of a journal): frequency with which the average article in a journal has been cited in a particular year or period


Citation graph
Citation Graph

cites

Paper

is cited by

Note that journal citations always refer to earlier work.


Graphical analysis of hyperlinks on the web
Graphical Analysis of Hyperlinks on the Web

This page links to many other pages (hub)

2

1

4

Many pages link to this page (authority)

3

6

5


Pagerank algorithm
PageRank Algorithm

Used to estimate importance of documents.

Concept:

The rank of a web page is higher if many pages link to it.

Links from highly ranked pages are given greater weight than links from less highly ranked pages.


Intuitive model basic concept
Intuitive Model (Basic Concept)

A user:

1. Starts at a random page on the web

2. Selects a random hyperlink from the current page and jumps to the corresponding page

3. Repeats Step 2 a very large number of times

Pages are ranked according to the relative frequency with

which they are visited.


Matrix representation
Matrix Representation

Citing page (from)

P1 P2 P3 P4 P5 P6 Number

P1 1 1

P2 1 1 2

P3 1 1 1 3

P4 1 1 1 1 4

P5 1 1

P6 1 1

Cited page (to)

Number 4 2 1 1 3 1


Basic algorithm normalize by number of links from page
Basic Algorithm: Normalize by Number of Links from Page

Citing page

P1 P2 P3 P4 P5 P6

P1 0.33

P2 0.25 1

P3 0.25 0.5 1

P40.25 0.5 0.33 1

P50.25

P6 0.33

= B

Cited page

Normalized link matrix

Number 4 2 1 1 3 1


Basic algorithm weighting of pages
Basic Algorithm: Weighting of Pages

Initially all pages have weight 1

w1 =

Recalculate weights

w2 = Bw1 =

0.33

1.25

1.75

2.08

0.25

0.33

1

1

1

1

1

1


Basic algorithm iterate
Basic Algorithm: Iterate

Iterate: wk = Bwk-1

w1 w2 w3 w4 ... converges to ...w

->

->

->

->

->

->

0.00

2.39

2.39

1.19

0.00

0.00

0.08

1.83

2.79

1.12

0.08

0.08

0.03

2.80

2.06

1.05

0.02

0.03

1

1

1

1

1

1

0.33

1.25

1.75

2.08

0.25

0.33


Graphical analysis of hyperlinks on the web1
Graphical Analysis of Hyperlinks on the Web

There is no link out of {2, 3, 4}

2

1

4

3

6

5


Google pagerank with damping
Google PageRank with Damping

A user:

1. Starts at a random page on the web

2a. With probability p, selects any random page and jumps to it

2b. With probability 1-p, selects a random hyperlink from the current page and jumps to the corresponding page

3. Repeats Step 2a and 2b a very large number of times

Pages are ranked according to the relative frequency with

which they are visited.


The pagerank iteration
The PageRank Iteration

The basic method iterates using the normalized link matrix, B.

wk = Bwk-1

This w is the high order eigenvector of B

Google iterates using a damping factor. The method iterates using a matrix B', where:

B' = dN + (1 - d)B

N is the matrix with every element equal to 1/n.

d is a constant found by experiment.


Google pagerank
Google: PageRank

The Google PageRank algorithm is usually written with the

following notation

If page A has pages Ti pointing to it.

  • d: damping factor

  • C(A): number of links out of A

    Iterate until:


Information retrieval using pagerank
Information Retrieval Using PageRank

Simple Method

Consider all hits (i.e., all document vectors that share at least one term with the query vector) as equal.

Display the hits ranked by PageRank.

The disadvantage of this method is that it gives no attention to how closely a document matches a query


Reference pattern ranking using dynamic document sets
Reference Pattern Ranking using Dynamic Document Sets

PageRank calculates document ranks for the entire (fixed) set of documents. The calculations are made periodically (e.g., monthy) and the document ranks are the same for all queries.

Concept of dynamic document sets. Reference patterns among documents that are related to a specific query convey more information than patterns calculated across entire document collections.

With dynamic document sets, references patterns are calculated for a set of documents that are selected based on each individual query.


Reference pattern ranking using dynamic document sets1
Reference Pattern Ranking using Dynamic Document Sets

Teoma Dynamic Ranking Algorithm (used in Ask Jeeves)

1. Search using conventional term weighting. Rank the hits using similarity between query and documents.

2. Select the highest ranking hits (e.g., top 5,000 hits).

3. Carry out PageRank or similar algorithm on this set of hits. This creates a set of document ranks that are specific to this query.

4. Display the results ranked in the order of the reference patterns calculated.


Combining term weighting with reference pattern ranking
Combining Term Weighting with Reference Pattern Ranking

Combined Method

1. Find all documents that share a term with the query vector.

2. The similarity, using conventional term weighting, between the query and documentj is sj.

3. The rank of documentj using PageRank or other reference pattern ranking is pj.

4. Calculate a combined rank cj = sj + (1- )pj, where  is a constant.

5. Display the hits ranked by cj.

This method is used in several commercial systems, but the details have not been published.


Cornell note
Cornell Note

Jon Kleinberg of Cornell Computer Science has carried out extensive research in this area, both theoretical and practical development of new algorithms. In particular he has studied hubs (documents that refer to many others) and authorities (documents that are referenced by many others).





ad