information search n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
INFORMATION SEARCH PowerPoint Presentation
Download Presentation
INFORMATION SEARCH

Loading in 2 Seconds...

play fullscreen
1 / 46

INFORMATION SEARCH - PowerPoint PPT Presentation


  • 119 Views
  • Uploaded on

INFORMATION SEARCH. Presenter: Pham Kim Son Saint Petersburg State University. Overview. Introduction Overview of IR system and basic terminologies Classical models of Information Retrieval Boolean Model Vector Model Modern models of Information Retrieval Latent Semantic Indexing

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'INFORMATION SEARCH' - clarke


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
information search

INFORMATION SEARCH

Presenter: Pham Kim Son

Saint Petersburg State University

overview
Overview
  • Introduction
  • Overview of IR system and basic terminologies
  • Classical models of Information Retrieval
    • Boolean Model
    • Vector Model
  • Modern models of Information Retrieval
    • Latent Semantic Indexing
    • Correlation Method
  • Conclusion
introduction
Introduction
  • People need information to solve problem
    • Very simple thing
    • Complex thing
  • “Perfect search machine” defined by Larry Page is something that understand exactly what you mean and return exactly what you want.
challenges to ir
Challenges to IR
  • Introduction of www
    • www is large
    • Heterogeneous
  • Gives challenges to IR
overview of ir and basic terminologies
Overview of IR and basic terminologies
  • IR can be divided in to 3 components
    • Input
    • Processor
    • Output

Query

Processor

Output

Input

Documents

input
Input

Input

  • The main task: Obtain a good representation of each document and query for computer to use
    • A document representative: a list of extracted keywords
  • Idea: Let the computer process the natural language in the document
input cont
Input (cont)
  • Obstacles
    • Theory of human language has not been sufficiently developed for every language
    • Not clear how to use it to enhance information retrieval
input cont1
Input (cont)
  • Representing documents by keywords
    • Step1:Removing common words
    • Step2: Stemming
slide9
Step1: Removing common words
    • Very high frequency words very common words
    • They should removed
    • comparing with stop-list
    • So
      • Non-significant words not interfere IR process
      • Reduce the size document between 30->50 per cent (C.J. van Rijsbergen)

are

be

By

will

of the

slide10
Step2:Stemming
  • Def: Stemming : The process of chopping the ending of a term, e.g. removing “ed”, ”ing”
  • Algorithm Porter
processor
Processor

Processor

Input

Input

  • This part of the IR system are concerned with the retrieval process
    • Structuring the documents in an appropriate way
    • Performing actual retrieval function, using a predefined model
output

Processor

Output

output

Input

  • The output will be a set of documents, assumed to be relevant to users.
  • Purpose of IR:
    • Retrieved all relevant documents
    • Retrieved irrelevant documents as few as possible
slide13

Relevant docs retrieved

  • Definition

relevant docs

Relevant docs retrieved

All docs retrieved

slide14

Returns relevant documents but

misses many useful ones too

The ideal

Returns most relevant

documents but includes

lots of junk

1

Precision

0

1

Recall

information retrieval models
Information Retrieval Models
  • Classical models
    • Boolean model
    • Vector model
  • Novel models
    • Latent semantic indexing model
    • Correlation method
boolean model
Boolean model
  • Earliest and simplest method, widely used in IR systems today
  • Based on set theories and Boolean algebra.
  • Queries are Boolean expressions of keywords, connected by operators AND, OR, ANDNOT
  • Ex: (Saint Petersburg AND Russia) | (beautiful city AND Russia)
slide17
Inverted files are widely used
  • Ex:
    • Term1 : doc 2 , doc 5, doc6 ;
    • Term2 : doc 2, doc4, doc5;
    • Query : q = (term1 AND term2)
    • Result: doc2, doc5
  • Term-document matrix can be used
thinking about boolean model
Thinking about Boolean model
  • Advantages:
    • Very simple model based on sets theory
    • Easy to understand and implement
    • Supports exact query
  • Disadvantages:
    • Retrieval based on binary decision criteria , without notion of partial matching
    • Sets are easy, but complex Boolean expressions are not
    • The Boolean queries formulated by users are most often so simplistic
    • Retrieves so many or so few documents
    • Gives unranked results
vector model
Vector Model
  • Why Vector Model ?
    • Boolean model
      • just takes into account the existence or nonexistence of terms in a document
      • Has no sense about their different contributions to documents
overview theory of vector model
Overview theory of vector model
  • Documents and queries are displayed as vectors in index-term space
  • Space dimension is equal to the vocabulary size
  • Components of these vectors: the weights of the corresponding index term, which reflects its significant in terms of representative and discrimination power
  • Retrieval is based on whether the “query vector” and “document vector” are closed enough.
slide21
Set of document:
  • A finite set of terms :
  • Every document can be displayed as vector:
  • the same to the query:
slide22

j

dj

  • Similarity of query q and document d:
  • Given a threshold , all documents with similarity > threshold are retrieved

q

i

compute a good weight
Compute a good weight
  • A variety of weighting schemes are available
  • They are based on three proven principles:
    • Terms that occur in only a few documents are more valuable than ones that appear in many
    • The more often a term occur in a document, the more likely it is to be important to that document
    • A term that appears the same number of times in a short document and in a long one is likely to be more available for the former
tf idf dl tf idf scheme
tf-idf-dl (tf-idf) scheme
  • The term frequency of a term ti in document dj:
  • The length of document dj:
  • DLj = total number of terms occurrences in document dj
  • Inverted document frequency: collection of N documents, inverted document frequency of a term ti that appears in n document is :
  • Weight:
think about vector model
Think about Vector model
  • Advantages:
    • Term weighting improves the quality of the answer
    • Partial matching allows to retrieve the documents that approximate the query conditions
    • Cosine ranking formula sorts the answer
  • Disadvantages
    • Assumes the independences of terms
    • Polysemy and synonymy problem are unsolved
modern models of ir
Modern models of IR
  • Why ?
    • Problems with polysemy:
      • Bass (fish or music ?)
    • Problems with synonymy:
      • Car or automobile ?
    • These failures can be traced to :
      • The way index terms are identified is incomplete
      • Lack of efficient methods to deal with polysemy
  • Idea to solve this problem: take the advance of implicit higher order structure(latent) in the association terms with documents .
latent semantic indexing lsi
Latent Semantic Indexing (LSI)
  • LSI overview
    • Representing documents roughly by terms is unreliability, ambiguity and redundancy
    • Should find a method , which can :
      • Documents and terms are displayed as vectors in a k-dim concepts space. Its weights indicating the strength of association with each of these concepts
      • That method should be flexible enough to remove the weak concepts, considered as noises
slide28

d1 d2 …. dm

t1 w11 w12 … w1m

t2 w21 w22 … w2m

: : : :

: : : :

tn wn1 wn2 … wnm

  • Document-term matrix A[mxn] are built
  • Matrix A is factored in to 3 matrices, using Singular value decomposition SVD
  • U,V are orthogonal matrices
slide29
These special matrices show a break down of the original relationship (doc-term) to a linearly independent components (factors)
  • Many of these components are very small: ( considered as noises) and should be ignored
slide30
Criteria to choose k:
    • Ignore noises
    • Important information are not lost
  • Documents and terms are displayed as vectors in k-dim space
  • Theory Eckart & Young ensures us about not losing important information
query
Query
  • Should find a method to display a query to k-dim space
  • Query q can be seen as a document.
  • From equation:
  • We have
similarity between objects
Similarity between objects
  • Term-Term:
    • Dot product between two rows vector of matrix Ak reflects the similarity between two terms
    • Term-term similarity matrix :
    • Can consider the rows of matrix as coordinate of terms.
    • The relation between taking rows of as coordinate and rows of as coordinates is simple
slide33
Document-document
    • Dot product between two columns vectors of matrix Ak reflect the similarity between two documents.
    • Can consider the row of matrix as coordinates of documents.
  • Term-document
    • This value can be obtained by looking at the element of matrix Ak
    • Drawback: between and within comparisons can not be done simultaneously without resizing the coordinate.
example
Example
  • q1: human machine interface for Lab ABC computer applications
  • q2: a survey of user opinion of computer system response time.
  • q3: the EPS user interface management system
  • q4: System and human system engineering testing of EPS
  • q5:Relation of user-perceived response time to error measurement
  • q6: The generation of random, binary, unordered tree
  • q7: The intersection graph of paths in trees
  • q8: Graph minors IV: Widths of trees and well-quasi-ordering
  • q9: Graph minors: A survey
updating
Updating
  • Folding in
    • New document
    • New terms and docs has no effect on the presentation of pre-existing docs and terms
  • Re-computing SVD
    • Re-compute SVD
    • Requires times and memory
  • Choosing one of these two methods
think about lsi
Think about LSI
  • Advantages:
    • Synonymy problem is solved
    • Displaying documents in a more reliable space: Concepts space
  • Disadvantages:
    • Polysemy problem is still unsolved
    • A special algorithms for handling with large size matrices should be implemented
correlation method
Correlation Method
  • Idea: If a keyword is present in the document, correlated keywords should be taken into account as well. So, the concepts containing in the document aren’t obscured by the choices of a specific vocabulary.
  • In vector space model: similarity vector :

Depend on the user query q, we now build the best query, taking the correlated keyword into account as well.

slide40
The correlation matrix is built based on the term-document matrix
    • Let: A is the term-document matrix
    • D : number of document, is the mean vector. Clearly that :
  • Covariance matrix is computed:
  • Correlation Matrix S :
slide41
Better query:
  • We now use SVD to reduce noises in the correlation of keywords:
  • We choose the first k largest factors to obtain the k dimensional of S
  • Generate our best query:
  • Vector of similarity :
  • Define a projection, defined by :
strength of correlation method
Strength of correlation method
    • In real world, correlation between words are static.
    • Number of terms has a higher stability level when comparing with number of documents
    • Number of documents are many times larger than number of keywords.
  • This method is able to handle database with a very large number of documents and doesn’t have to update the correlation matrix every time adding new documents.
  • Its importance in the electronic networks.
conclusion
Conclusion
  • IR overview
  • Classical IR models :
    • Boolean model
    • Vector model
  • Modern IR models :
    • LSI
    • Correlation methods
references
References
  • Gheorghe Muresan: Using document Clustering and language modeling in mediated IR.
  • Georges Dupret: Latent Concepts and the Number Orthogonal Factors in Latent Semantic Indexing
  • C.J. van RIJSBERGEN: Information Retrieval
  • Ricardo Baeza-Yates: Modern Information Retrieval
  • Sandor Dominich: Mathematical foundation of information retrieval.
  • IR lectures note from : www.cs.utexas.edu
  • Scott Deerwester, Susan T.Dumais: Indexing by Latent Semantic Analysis
desktop search
Desktop Search
  • Design the desktop search that satisfies
    • Not affects computer’s performance
    • Privacy is protected
    • Ability to search as many types of files as possible( consider music file)
    • Multi languages search
    • User-friendly
    • Support semantic search if possible