- 283 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Text and Web Search' - Albert_Lan

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Text Databases and IR

- Text databases (document databases)
- Large collections of documents from various sources: news articles, research papers, books, digital libraries, e-mail messages, and Web pages, library database, etc.
- Information retrieval
- A field developed in parallel with database systems
- Information is organized into (a large number of) documents
- Information retrieval problem: locating relevant documents based on user input, such as keywords or example documents

Information Retrieval

- Typical IR systems
- Online library catalogs
- Online document management systems
- Information retrieval vs. database systems
- Some DB problems are not present in IR, e.g., update, transaction management, complex objects
- Some IR problems are not addressed well in DBMS, e.g., unstructured documents, approximate search using keywords and relevance

Relevant & Retrieved

Retrieved

All Documents

Basic Measures for Text Retrieval- Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses)
- Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved

Information Retrieval Techniques

- Index Terms (Attribute) Selection:
- Stop list
- Word stem
- Index terms weighting methods
- Terms Documents Frequency Matrices
- Information Retrieval Models:
- Boolean Model
- Vector Model
- Probabilistic Model

Problem - Motivation

- Given a database of documents, find documents containing “data”, “retrieval”
- Applications:
- Web
- law + patent offices
- digital libraries
- information filtering

Problem - Motivation

- Types of queries:
- boolean (‘data’ AND ‘retrieval’ AND NOT ...)
- additional features (‘data’ ADJACENT ‘retrieval’)
- keyword queries (‘data’, ‘retrieval’)
- How to search a large collection of documents?

Full-text scanning

- for single term:
- (naive: O(N*M))
- Knuth, Morris and Pratt (‘77)
- build a small FSA; visit every text letter once only, by carefully shifting more than one step

ABRACADABRA

text

CAB

pattern

Full-text scanning

- for single term:
- (naive: O(N*M))
- Knuth Morris and Pratt (‘77)
- Boyer and Moore (‘77)
- preprocess pattern; start from right to left & skip!

ABRACADABRA

text

CAB

pattern

Text - Detailed outline

- text
- problem
- full text scanning
- inversion
- signature files
- clustering
- information filtering and LSI

how to organize dictionary?

stemming – Y/N?

Keep only the root of each word ex. inverted, inversion invert

insertions?

Text – Inverted Files

how to organize dictionary?

B-tree, hashing, TRIEs, PATRICIA trees, ...

stemming – Y/N?

insertions?

Text – Inverted Files

- postings list – more Zipf distr.: eg., rank-frequency plot of ‘Bible’

log(freq)

freq ~ 1/rank / ln(1.78V)

log(rank)

- postings lists
- Cutting+Pedersen
- (keep first 4 in B-tree leaves)
- how to allocate space: [Faloutsos+92]
- geometric progression
- compression (Elias codes) [Zobel+] – down to 2% overhead!
- Conclusions: needs space overhead (2%-300%), but it is the fastest

Vector Space Model and Clustering

- Keyword (free-text) queries (vs Boolean)
- each document: -> vector (HOW?)
- each query: -> vector
- search for ‘similar’ vectors

Vector Space Model and Clustering

- main idea: each document is a vector of size d: d is the number of different terms in the database

document

zoo

aaron

data

‘indexing’

...data...

d (= vocabulary size)

Document Vectors

- Documents are represented as “bags of words”
- Represented as vectors when used computationally
- A vector is like an array of floating points
- Has direction and magnitude
- Each vector holds a place for every term in the collection
- Therefore, most vectors are sparse

Document VectorsOne location for each word.

A

B

C

D

E

F

G

H

I

nova galaxy heat h’wood film role diet fur

10 5 3

5 10

10 8 7

9 10 5

10 10

9 10

5 7 9

6 10 2 8

7 5 1 3

“Nova” occurs 10 times in text A

“Galaxy” occurs 5 times in text A

“Heat” occurs 3 times in text A

(Blank means 0 occurrences.)

Document VectorsOne location for each word.

A

B

C

D

E

F

G

H

I

nova galaxy heat h’wood film role diet fur

10 5 3

5 10

10 8 7

9 10 5

10 10

9 10

5 7 9

6 10 2 8

7 5 1 3

“Hollywood” occurs 7 times in text I

“Film” occurs 5 times in text I

“Diet” occurs 1 time in text I

“Fur” occurs 3 times in text I

Document Vectors

nova galaxy heat h’wood film role diet fur

10 5 3

5 10

10 8 7

9 10 5

10 10

9 10

5 7 9

6 10 2 8

7 5 1 3

Document ids

A

B

C

D

E

F

G

H

I

Vector Space Model and Clustering

Then, group nearby vectors together

- Q1: cluster search?
- Q2: cluster generation?

Two significant contributions

- ranked output
- relevance feedback

Vector Space Model and Clustering

- cluster search: visit the (k) closest superclusters; continue recursively

MD TRs

CS TRs

Vector Space Model and Clustering

- How? A: by adding the ‘good’ vectors and subtracting the ‘bad’ ones

MD TRs

CS TRs

Cluster generation

- Problem:
- given N points in V dimensions,
- group them

Cluster generation

- Problem:
- given N points in V dimensions,
- group them (typically a k-means or AGNES is used)

Assigning Weights to Terms

- Binary Weights
- Raw term frequency
- tf x idf
- Recall the Zipf distribution
- Want to weight terms highly if they are
- frequent in relevant documents … BUT
- infrequent in the collection as a whole

Binary Weights

- Only the presence (1) or absence (0) of a term is included in the vector

Raw Term Weights

- The frequency of occurrence for the term in each document is included in the vector

Assigning Weights

- tf x idf measure:
- term frequency (tf)
- inverse document frequency (idf) -- a way to deal with the problems of the Zipf distribution
- Goal: assign a tf * idf weight to each term in each document

Inverse Document Frequency

- IDF provides high values for rare words and low values for common words

For a collection

of 10000 documents

Similarity Measures for document vectors

Simple matching (coordination level match)

Dice’s Coefficient

Jaccard’s Coefficient

Cosine Coefficient

Overlap Coefficient

tf x idf normalization

- Normalize the term weights (so longer documents are not unfairly given more weight)
- normalize usually means force all values to fall within a certain range, usually between 0 and 1, inclusive.

Vector Space with Term Weights and Cosine Matching

Di=(di1,wdi1;di2, wdi2;…;dit, wdit)

Q =(qi1,wqi1;qi2, wqi2;…;qit, wqit)

Term B

1.0

Q = (0.4,0.8)

D1=(0.8,0.3)

D2=(0.2,0.7)

Q

D2

0.8

0.6

0.4

D1

0.2

0

0.2

0.4

0.6

0.8

1.0

Term A

Text - Detailed outline

- Text databases
- problem
- full text scanning
- inversion
- signature files (a.k.a. Bloom Filters)
- Vector model and clustering
- information filtering and LSI

Information Filtering + LSI

- [Foltz+,’92] Goal:
- users specify interests (= keywords)
- system alerts them, on suitable news-documents
- Major contribution: LSI = Latent Semantic Indexing
- latent (‘hidden’) concepts

Information Filtering + LSI

Main idea

- map each document into some ‘concepts’
- map each term into some ‘concepts’

‘Concept’:~ a set of terms, with weights, e.g.

- “data” (0.8), “system” (0.5), “retrieval” (0.6) -> DBMS_concept

Information Filtering + LSI

Pictorially: term-document matrix (BEFORE)

Information Filtering + LSI

Pictorially: concept-document matrix and...

Information Filtering + LSI

... and concept-term matrix

Information Filtering + LSI

Q: How to search, eg., for ‘system’?

Information Filtering + LSI

A: find the corresponding concept(s); and the corresponding documents

Information Filtering + LSI

A: find the corresponding concept(s); and the corresponding documents

Information Filtering + LSI

Thus it works like an (automatically constructed) thesaurus:

we may retrieve documents that DON’T have the term ‘system’, but they contain almost everything else (‘data’, ‘retrieval’)

SVD

- LSI: find ‘concepts’

SVD - Definition

A[n x m] = U[n x r]L [ r x r] (V[m x r])T

- A: n x m matrix (eg., n documents, m terms)
- U: n x r matrix (n documents, r concepts)
- L: r x r diagonal matrix (strength of each ‘concept’) (r : rank of the matrix)
- V: m x r matrix (m terms, r concepts)

SVD - Example

doc-to-concept

similarity matrix

- A = ULVT - example:

retrieval

CS-concept

inf.

lung

MD-concept

brain

data

CS

x

x

=

MD

SVD - Example

- A = ULVT - example:

term-to-concept

similarity matrix

retrieval

inf.

lung

brain

data

CS-concept

CS

x

x

=

MD

SVD - Example

- A = ULVT - example:

term-to-concept

similarity matrix

retrieval

inf.

lung

brain

data

CS-concept

CS

x

x

=

MD

SVD for LSI

‘documents’, ‘terms’ and ‘concepts’:

- U: document-to-concept similarity matrix
- V: term-to-concept sim. matrix
- L: its diagonal elements: ‘strength’ of each concept

SVD for LSI

- Need to keep all the eigenvectors?
- NO, just keep the first k (concepts)

Web Search

- What about web search?
- First you need to get all the documents of the web…. Crawlers.
- Then you have to index them (inverted files, etc)
- Find the web pages that are relevant to the query
- Report the pages with their links in a sorted order
- Main difference with IR: web pages have links
- may be possible to exploit the link structure for sorting the relevant documents…

Kleinberg’s Algorithm (HITS)

- Main idea: In many cases, when you search the web using some terms, the most relevant pages may not contain this term (or contain the term only a few times)
- Harvard : www.harvard.edu
- Search Engines: yahoo, google, altavista
- Authorities and hubs

Kleinberg’s algorithm

- Problem dfn: given the web and a query
- find the most ‘authoritative’ web pages for this query

Step 0: find all pages containing the query terms (root set)

Step 1: expand by one move forward and backward (base set)

Kleinberg’s algorithm

- Step 1: expand by one move forward and backward

Kleinberg’s algorithm

- on the resulting graph, give high score (= ‘authorities’) to nodes that many important nodes point to
- give high importance score (‘hubs’) to nodes that point to good ‘authorities’)

hubs

authorities

Kleinberg’s algorithm

observations

- recursive definition!
- each node (say, ‘i’-th node) has both an authoritativeness score ai and a hubness score hi

Kleinberg’s algorithm

Let E be the set of edges and A be the adjacency matrix:

the (i,j) is 1 if the edge from i to j exists

Let h and a be [n x 1] vectors with the ‘hubness’ and ‘authoritativiness’ scores.

Then:

Kleinberg’s algorithm

Then:

ai = hk + hl + hm

that is

ai = Sum (hj) over all j that (j,i) edge exists

or

a = ATh

k

i

l

m

Kleinberg’s algorithm

symmetrically, for the ‘hubness’:

hi = an + ap + aq

that is

hi = Sum (qj) over all j that (i,j) edge exists

or

h = Aa

i

n

p

q

Kleinberg’s algorithm

In conclusion, we want vectors h and a such that:

h = Aa

a = ATh

Start from a and h to all 1. Then apply the following trick:

h=Aa=A(ATh)=(AAT)h = ..=(AAT)2 h ..= (AAT)k h

a = (ATA)ka

Kleinberg’s algorithm

In short, the solutions to

h = Aa

a = ATh

are the left- and right- eigenvectors of the adjacency matrix A.

Starting from random a’ and iterating, we’ll eventually converge

(Q: to which of all the eigenvectors? why?)

Kleinberg’s algorithm

(Q: to which of all the eigenvectors? why?)

A: to the ones of the strongest eigenvalue, because of property :

(ATA )k v’ ~ (constant) v1

So, we can find the a and h vectors and the page with the

highest a values are reported!

Kleinberg’s algorithm - results

Eg., for the query ‘java’:

0.328 www.gamelan.com

0.251 java.sun.com

0.190 www.digitalfocus.com (“the java developer”)

Kleinberg’s algorithm - discussion

- ‘authority’ score can be used to find ‘similar pages’ to page p
- closely related to ‘citation analysis’, social networs / ‘small world’ phenomena

google/page-rank algorithm

- closely related: The Web is a directed graph of connected nodes
- imagine a particle randomly moving along the edges (*)
- compute its steady-state probabilities. That gives the PageRank of each pages (the importance of this page)

(*) with occasional random jumps

PageRank Definition

- Assume a page A and pages T1, T2, …, Tm that point to A. Let d is a damping factor. PR(A) the Pagerank of A. C(A) the out-degree of A. Then:

google/page-rank algorithm

- Compute the PR of each page~identical problem: given a Markov Chain, compute the steady state probabilities p1 ... p5

2

1

3

4

5

Computing PageRank

- Iterative procedure
- Also, … navigate the web by randomly follow links or with prob p jump to a random page. Let A the adjacency matrix (n x n), ci out-degree of page i

Prob(Ai->Aj) = dn-1+(1-d)ci–1Aij

A’[i,j] = Prob(Ai->Aj)

google/page-rank algorithm

- Let A’ be the transition matrix (= adjacency matrix, row-normalized : sum of each row = 1)

2

1

3

=

4

5

google/page-rank algorithm

- A p = p
- thus, p is the eigenvector that corresponds to the highest eigenvalue (=1, since the matrix is row-normalized)

Kleinberg/google - conclusions

SVD helps in graph analysis:

hub/authority scores: strongest left- and right- eigenvectors of the adjacency matrix

random walk on a graph: steady state probabilities are given by the strongest eigenvector of the transition matrix

Brin, S. and L. Page (1998). Anatomy of a Large-Scale Hypertextual Web Search Engine. 7th Intl World Wide Web Conf.References

Download Presentation

Connecting to Server..