text and web search l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Text and Web Search PowerPoint Presentation
Download Presentation
Text and Web Search

Loading in 2 Seconds...

play fullscreen
1 / 87

Text and Web Search - PowerPoint PPT Presentation


  • 283 Views
  • Uploaded on

Text and Web Search Text Databases and IR Text databases (document databases) Large collections of documents from various sources: news articles, research papers, books, digital libraries, e-mail messages, and Web pages, library database, etc. Information retrieval

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Text and Web Search' - Albert_Lan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
text databases and ir
Text Databases and IR
  • Text databases (document databases)
    • Large collections of documents from various sources: news articles, research papers, books, digital libraries, e-mail messages, and Web pages, library database, etc.
  • Information retrieval
    • A field developed in parallel with database systems
    • Information is organized into (a large number of) documents
    • Information retrieval problem: locating relevant documents based on user input, such as keywords or example documents
information retrieval
Information Retrieval
  • Typical IR systems
    • Online library catalogs
    • Online document management systems
  • Information retrieval vs. database systems
    • Some DB problems are not present in IR, e.g., update, transaction management, complex objects
    • Some IR problems are not addressed well in DBMS, e.g., unstructured documents, approximate search using keywords and relevance
basic measures for text retrieval

Relevant

Relevant & Retrieved

Retrieved

All Documents

Basic Measures for Text Retrieval
  • Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses)
  • Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved
information retrieval techniques
Information Retrieval Techniques
  • Index Terms (Attribute) Selection:
    • Stop list
    • Word stem
    • Index terms weighting methods
  • Terms  Documents Frequency Matrices
  • Information Retrieval Models:
    • Boolean Model
    • Vector Model
    • Probabilistic Model
problem motivation
Problem - Motivation
  • Given a database of documents, find documents containing “data”, “retrieval”
  • Applications:
    • Web
    • law + patent offices
    • digital libraries
    • information filtering
problem motivation7
Problem - Motivation
  • Types of queries:
    • boolean (‘data’ AND ‘retrieval’ AND NOT ...)
    • additional features (‘data’ ADJACENT ‘retrieval’)
    • keyword queries (‘data’, ‘retrieval’)
  • How to search a large collection of documents?
full text scanning
Full-text scanning
  • for single term:
    • (naive: O(N*M))

ABRACADABRA

text

CAB

pattern

full text scanning9
Full-text scanning
  • for single term:
    • (naive: O(N*M))
    • Knuth, Morris and Pratt (‘77)
      • build a small FSA; visit every text letter once only, by carefully shifting more than one step

ABRACADABRA

text

CAB

pattern

full text scanning10
Full-text scanning

ABRACADABRA

text

CAB

pattern

CAB

...

CAB

CAB

full text scanning11
Full-text scanning
  • for single term:
    • (naive: O(N*M))
    • Knuth Morris and Pratt (‘77)
    • Boyer and Moore (‘77)
      • preprocess pattern; start from right to left & skip!

ABRACADABRA

text

CAB

pattern

text detailed outline
Text - Detailed outline
  • text
    • problem
    • full text scanning
    • inversion
    • signature files
    • clustering
    • information filtering and LSI
text inverted files14
Text – Inverted Files

Q: space overhead?

A: mainly, the postings lists

slide15
how to organize dictionary?

stemming – Y/N?

Keep only the root of each word ex. inverted, inversion  invert

insertions?

Text – Inverted Files

slide16
how to organize dictionary?

B-tree, hashing, TRIEs, PATRICIA trees, ...

stemming – Y/N?

insertions?

Text – Inverted Files

slide17

Text – Inverted Files

  • postings list – more Zipf distr.: eg., rank-frequency plot of ‘Bible’

log(freq)

freq ~ 1/rank / ln(1.78V)

log(rank)

slide18

Text – Inverted Files

  • postings lists
    • Cutting+Pedersen
      • (keep first 4 in B-tree leaves)
    • how to allocate space: [Faloutsos+92]
      • geometric progression
    • compression (Elias codes) [Zobel+] – down to 2% overhead!
    • Conclusions: needs space overhead (2%-300%), but it is the fastest
vector space model and clustering
Vector Space Model and Clustering
  • Keyword (free-text) queries (vs Boolean)
  • each document: -> vector (HOW?)
  • each query: -> vector
  • search for ‘similar’ vectors
vector space model and clustering20
Vector Space Model and Clustering
  • main idea: each document is a vector of size d: d is the number of different terms in the database

document

zoo

aaron

data

‘indexing’

...data...

d (= vocabulary size)

document vectors
Document Vectors
  • Documents are represented as “bags of words”
  • Represented as vectors when used computationally
    • A vector is like an array of floating points
    • Has direction and magnitude
    • Each vector holds a place for every term in the collection
    • Therefore, most vectors are sparse
document vectors one location for each word
Document VectorsOne location for each word.

A

B

C

D

E

F

G

H

I

nova galaxy heat h’wood film role diet fur

10 5 3

5 10

10 8 7

9 10 5

10 10

9 10

5 7 9

6 10 2 8

7 5 1 3

“Nova” occurs 10 times in text A

“Galaxy” occurs 5 times in text A

“Heat” occurs 3 times in text A

(Blank means 0 occurrences.)

document vectors one location for each word23
Document VectorsOne location for each word.

A

B

C

D

E

F

G

H

I

nova galaxy heat h’wood film role diet fur

10 5 3

5 10

10 8 7

9 10 5

10 10

9 10

5 7 9

6 10 2 8

7 5 1 3

“Hollywood” occurs 7 times in text I

“Film” occurs 5 times in text I

“Diet” occurs 1 time in text I

“Fur” occurs 3 times in text I

document vectors24
Document Vectors

nova galaxy heat h’wood film role diet fur

10 5 3

5 10

10 8 7

9 10 5

10 10

9 10

5 7 9

6 10 2 8

7 5 1 3

Document ids

A

B

C

D

E

F

G

H

I

we can plot the vectors
We Can Plot the Vectors

Star

Doc about movie stars

Doc about astronomy

Doc about mammal behavior

Diet

vector space model and clustering26
Vector Space Model and Clustering

Then, group nearby vectors together

  • Q1: cluster search?
  • Q2: cluster generation?

Two significant contributions

  • ranked output
  • relevance feedback
vector space model and clustering27
Vector Space Model and Clustering
  • cluster search: visit the (k) closest superclusters; continue recursively

MD TRs

CS TRs

vector space model and clustering28
Vector Space Model and Clustering
  • ranked output: easy!

MD TRs

CS TRs

vector space model and clustering29
Vector Space Model and Clustering
  • relevance feedback (brilliant idea) [Roccio’73]

MD TRs

CS TRs

vector space model and clustering30
Vector Space Model and Clustering
  • relevance feedback (brilliant idea) [Roccio’73]
  • How?

MD TRs

CS TRs

vector space model and clustering31
Vector Space Model and Clustering
  • How? A: by adding the ‘good’ vectors and subtracting the ‘bad’ ones

MD TRs

CS TRs

cluster generation
Cluster generation
  • Problem:
    • given N points in V dimensions,
    • group them
cluster generation33
Cluster generation
  • Problem:
    • given N points in V dimensions,
    • group them (typically a k-means or AGNES is used)
assigning weights to terms
Assigning Weights to Terms
  • Binary Weights
  • Raw term frequency
  • tf x idf
    • Recall the Zipf distribution
    • Want to weight terms highly if they are
      • frequent in relevant documents … BUT
      • infrequent in the collection as a whole
binary weights
Binary Weights
  • Only the presence (1) or absence (0) of a term is included in the vector
raw term weights
Raw Term Weights
  • The frequency of occurrence for the term in each document is included in the vector
assigning weights
Assigning Weights
  • tf x idf measure:
    • term frequency (tf)
    • inverse document frequency (idf) -- a way to deal with the problems of the Zipf distribution
  • Goal: assign a tf * idf weight to each term in each document
inverse document frequency
Inverse Document Frequency
  • IDF provides high values for rare words and low values for common words

For a collection

of 10000 documents

similarity measures for document vectors
Similarity Measures for document vectors

Simple matching (coordination level match)

Dice’s Coefficient

Jaccard’s Coefficient

Cosine Coefficient

Overlap Coefficient

tf x idf normalization
tf x idf normalization
  • Normalize the term weights (so longer documents are not unfairly given more weight)
    • normalize usually means force all values to fall within a certain range, usually between 0 and 1, inclusive.
computing similarity scores
Computing Similarity Scores

1.0

0.8

0.6

0.4

0.2

0.2

0.4

0.6

0.8

1.0

vector space with term weights and cosine matching
Vector Space with Term Weights and Cosine Matching

Di=(di1,wdi1;di2, wdi2;…;dit, wdit)

Q =(qi1,wqi1;qi2, wqi2;…;qit, wqit)

Term B

1.0

Q = (0.4,0.8)

D1=(0.8,0.3)

D2=(0.2,0.7)

Q

D2

0.8

0.6

0.4

D1

0.2

0

0.2

0.4

0.6

0.8

1.0

Term A

text detailed outline45
Text - Detailed outline
  • Text databases
    • problem
    • full text scanning
    • inversion
    • signature files (a.k.a. Bloom Filters)
    • Vector model and clustering
    • information filtering and LSI
information filtering lsi
Information Filtering + LSI
  • [Foltz+,’92] Goal:
    • users specify interests (= keywords)
    • system alerts them, on suitable news-documents
  • Major contribution: LSI = Latent Semantic Indexing
    • latent (‘hidden’) concepts
information filtering lsi47
Information Filtering + LSI

Main idea

  • map each document into some ‘concepts’
  • map each term into some ‘concepts’

‘Concept’:~ a set of terms, with weights, e.g.

    • “data” (0.8), “system” (0.5), “retrieval” (0.6) -> DBMS_concept
information filtering lsi48
Information Filtering + LSI

Pictorially: term-document matrix (BEFORE)

information filtering lsi49
Information Filtering + LSI

Pictorially: concept-document matrix and...

information filtering lsi50
Information Filtering + LSI

... and concept-term matrix

information filtering lsi51
Information Filtering + LSI

Q: How to search, eg., for ‘system’?

information filtering lsi52
Information Filtering + LSI

A: find the corresponding concept(s); and the corresponding documents

information filtering lsi53
Information Filtering + LSI

A: find the corresponding concept(s); and the corresponding documents

information filtering lsi54
Information Filtering + LSI

Thus it works like an (automatically constructed) thesaurus:

we may retrieve documents that DON’T have the term ‘system’, but they contain almost everything else (‘data’, ‘retrieval’)

slide55
SVD
  • LSI: find ‘concepts’
svd definition
SVD - Definition

A[n x m] = U[n x r]L [ r x r] (V[m x r])T

  • A: n x m matrix (eg., n documents, m terms)
  • U: n x r matrix (n documents, r concepts)
  • L: r x r diagonal matrix (strength of each ‘concept’) (r : rank of the matrix)
  • V: m x r matrix (m terms, r concepts)
svd example
SVD - Example
  • A = ULVT - example:

retrieval

inf.

lung

brain

data

CS

x

x

=

MD

svd example58
SVD - Example
  • A = ULVT - example:

retrieval

CS-concept

inf.

lung

MD-concept

brain

data

CS

x

x

=

MD

svd example59
SVD - Example

doc-to-concept

similarity matrix

  • A = ULVT - example:

retrieval

CS-concept

inf.

lung

MD-concept

brain

data

CS

x

x

=

MD

svd example60
SVD - Example
  • A = ULVT - example:

retrieval

‘strength’ of CS-concept

inf.

lung

brain

data

CS

x

x

=

MD

svd example61
SVD - Example
  • A = ULVT - example:

term-to-concept

similarity matrix

retrieval

inf.

lung

brain

data

CS-concept

CS

x

x

=

MD

svd example62
SVD - Example
  • A = ULVT - example:

term-to-concept

similarity matrix

retrieval

inf.

lung

brain

data

CS-concept

CS

x

x

=

MD

svd for lsi
SVD for LSI

‘documents’, ‘terms’ and ‘concepts’:

  • U: document-to-concept similarity matrix
  • V: term-to-concept sim. matrix
  • L: its diagonal elements: ‘strength’ of each concept
svd for lsi64
SVD for LSI
  • Need to keep all the eigenvectors?
  • NO, just keep the first k (concepts)
web search
Web Search
  • What about web search?
    • First you need to get all the documents of the web…. Crawlers.
    • Then you have to index them (inverted files, etc)
    • Find the web pages that are relevant to the query
    • Report the pages with their links in a sorted order
  • Main difference with IR: web pages have links
    • may be possible to exploit the link structure for sorting the relevant documents…
kleinberg s algorithm hits
Kleinberg’s Algorithm (HITS)
  • Main idea: In many cases, when you search the web using some terms, the most relevant pages may not contain this term (or contain the term only a few times)
    • Harvard : www.harvard.edu
    • Search Engines: yahoo, google, altavista
  • Authorities and hubs
kleinberg s algorithm
Kleinberg’s algorithm
  • Problem dfn: given the web and a query
  • find the most ‘authoritative’ web pages for this query

Step 0: find all pages containing the query terms (root set)

Step 1: expand by one move forward and backward (base set)

kleinberg s algorithm68
Kleinberg’s algorithm
  • Step 1: expand by one move forward and backward
kleinberg s algorithm69
Kleinberg’s algorithm
  • on the resulting graph, give high score (= ‘authorities’) to nodes that many important nodes point to
  • give high importance score (‘hubs’) to nodes that point to good ‘authorities’)

hubs

authorities

kleinberg s algorithm70
Kleinberg’s algorithm

observations

  • recursive definition!
  • each node (say, ‘i’-th node) has both an authoritativeness score ai and a hubness score hi
kleinberg s algorithm71
Kleinberg’s algorithm

Let E be the set of edges and A be the adjacency matrix:

the (i,j) is 1 if the edge from i to j exists

Let h and a be [n x 1] vectors with the ‘hubness’ and ‘authoritativiness’ scores.

Then:

kleinberg s algorithm72
Kleinberg’s algorithm

Then:

ai = hk + hl + hm

that is

ai = Sum (hj) over all j that (j,i) edge exists

or

a = ATh

k

i

l

m

kleinberg s algorithm73
Kleinberg’s algorithm

symmetrically, for the ‘hubness’:

hi = an + ap + aq

that is

hi = Sum (qj) over all j that (i,j) edge exists

or

h = Aa

i

n

p

q

kleinberg s algorithm74
Kleinberg’s algorithm

In conclusion, we want vectors h and a such that:

h = Aa

a = ATh

Start from a and h to all 1. Then apply the following trick:

h=Aa=A(ATh)=(AAT)h = ..=(AAT)2 h ..= (AAT)k h

a = (ATA)ka

kleinberg s algorithm75
Kleinberg’s algorithm

In short, the solutions to

h = Aa

a = ATh

are the left- and right- eigenvectors of the adjacency matrix A.

Starting from random a’ and iterating, we’ll eventually converge

(Q: to which of all the eigenvectors? why?)

kleinberg s algorithm76
Kleinberg’s algorithm

(Q: to which of all the eigenvectors? why?)

A: to the ones of the strongest eigenvalue, because of property :

(ATA )k v’ ~ (constant) v1

So, we can find the a and h vectors and the page with the

highest a values are reported!

kleinberg s algorithm results
Kleinberg’s algorithm - results

Eg., for the query ‘java’:

0.328 www.gamelan.com

0.251 java.sun.com

0.190 www.digitalfocus.com (“the java developer”)

kleinberg s algorithm discussion
Kleinberg’s algorithm - discussion
  • ‘authority’ score can be used to find ‘similar pages’ to page p
  • closely related to ‘citation analysis’, social networs / ‘small world’ phenomena
google page rank algorithm
google/page-rank algorithm
  • closely related: The Web is a directed graph of connected nodes
  • imagine a particle randomly moving along the edges (*)
  • compute its steady-state probabilities. That gives the PageRank of each pages (the importance of this page)

(*) with occasional random jumps

pagerank definition
PageRank Definition
  • Assume a page A and pages T1, T2, …, Tm that point to A. Let d is a damping factor. PR(A) the Pagerank of A. C(A) the out-degree of A. Then:
google page rank algorithm81
google/page-rank algorithm
  • Compute the PR of each page~identical problem: given a Markov Chain, compute the steady state probabilities p1 ... p5

2

1

3

4

5

computing pagerank
Computing PageRank
  • Iterative procedure
  • Also, … navigate the web by randomly follow links or with prob p jump to a random page. Let A the adjacency matrix (n x n), ci out-degree of page i

Prob(Ai->Aj) = dn-1+(1-d)ci–1Aij

A’[i,j] = Prob(Ai->Aj)

google page rank algorithm83
google/page-rank algorithm
  • Let A’ be the transition matrix (= adjacency matrix, row-normalized : sum of each row = 1)

2

1

3

=

4

5

google page rank algorithm84
google/page-rank algorithm
  • A p = p

A p = p

2

1

3

=

4

5

google page rank algorithm85
google/page-rank algorithm
  • A p = p
  • thus, p is the eigenvector that corresponds to the highest eigenvalue (=1, since the matrix is row-normalized)
kleinberg google conclusions
Kleinberg/google - conclusions

SVD helps in graph analysis:

hub/authority scores: strongest left- and right- eigenvectors of the adjacency matrix

random walk on a graph: steady state probabilities are given by the strongest eigenvector of the transition matrix

references
Brin, S. and L. Page (1998). Anatomy of a Large-Scale Hypertextual Web Search Engine. 7th Intl World Wide Web Conf.References