information retrieval models n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Information Retrieval Models PowerPoint Presentation
Download Presentation
Information Retrieval Models

Loading in 2 Seconds...

play fullscreen
1 / 48

Information Retrieval Models - PowerPoint PPT Presentation


  • 271 Views
  • Uploaded on

Information Retrieval Models. School of Informatics Dept. of Library and Information Studies Dr. Miguel E. Ruiz. What is Information Retrieval. Information Retrieval deals with information items in terms of Representation Storage Organization Access

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Information Retrieval Models' - chibale


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
information retrieval models

Information Retrieval Models

School of Informatics

Dept. of Library and Information Studies

Dr. Miguel E. Ruiz

what is information retrieval
What is Information Retrieval
  • Information Retrieval deals with information items in terms of
    • Representation
    • Storage
    • Organization
    • Access
  • An IR system should provide access to the information that the user is interested in.
example of an information need
Example of an Information need

“Find all documents containing information on college tennis teams which: (1) are maintained by an university in the USA and (2) participate in the NCAA tennis tournament. To be relevant, the document must include information on the national ranking of the team in the last three years and the email or phone number of the team coach.”

slide4
The user must translate his/her information needs into a query.
  • Most commonly a query is a set of keywords that summarizes the information needs.
information retrieval vs data retrieval
Information Retrieval vs. Data Retrieval
  • Data retrieval consists of determining which documents of the collection contain the keywords in the user query.
  • Information retrieval should “interpret” the contents of the documents in the collection and retrieve all the documents that are relevant to the user query while retrieving a few non relevant documents as possible.
basic concepts
Basic Concepts
  • Effective retrieval of relevant information is affected by:
    • the user task
    • the logical view of the documents
the user task
The User Task

Retrieval

Database

Browsing

logical view of a document
Logical View of a Document
  • Documents can be represented as:
    • A set of keywords or indexing terms
    • Full text
logical view of the document
Logical View of the Document

Text +

Structure

Document

Automatic

or manual

indexing.

Accents,

Spacing,

ect.

Noun

groups

stopwords

stemming

Text

Structure

Recognition

Index term

Full text

Structure

the retrieval process
The Retrieval Process

User

Interface

Text

User’s need

Text

Text Operations

Logical view

Query

Operations

Indexing

DB Manager

Module

User’s feedback

Query

Inverted file

Index

Searching

Text

Database

Retrieved Docs

Ranking

Ranked Docs

ir models
IR Models
  • An IR model is a quadruple [D,Q,F, R(qi,dj)]
    • D: set of logical representations of the documents
    • Q: set of logical representation of the queries
    • F : framework for modeling document representations, queries, and their relationships
    • R(qi,dj): ranking function that defines an association between the query and the documents. This ranking defines an ordering among the documents regarding the query.
slide12

Set Theoretic

  • Fuzzy
  • Extended Boolean

Classic Models

  • Boolean
  • Vector space
  • Probabilistic

User

Tasks

  • Retrieval:
    • Adhoc
    • Filtering

Algebraic

  • Generalized Vector
  • Lat. Semantic Index
  • Neural Networks

Structured Models

  • Non Overlapping
  • Lists
  • Proximal nodes

Probabilistic

Browsing

  • Inference Network
  • Belief Network

Browsing

  • Flat
  • Structure Guided
  • Hypertext
retrieval models and logical view of documents

Index Terms

Full Text

Full Text + Structure

Retrieval

Classic

Set Theoretic

Algebraic

Probabilistic

Classic

Set Theoretic

Algebraic

Probabilistic

Structured

Browsing

Flat

Flat

Hypertext

Structured Guided

Hypertext

Retrieval Models and Logical View of Documents
ir models1
IR Models
  • Basic concepts:
    • Each document is described as a set of representative keywords called index terms.
    • An index term is a word (which can be in the document) that helps in remembering the document’s main themes.
    • Index terms are used to index and summarize the document contents
ir models2
IR Models
  • Basic concepts (cont.)
    • Index terms have varying relevance when used to describe the document contents. This effect is captured by assigning numerical weights to each index term in the document.
    • A weight is a positive value associated with each index term in the document.
ir models3
IR Models
  • The Boolean Model is a simple retrieval model based on set theory and Boolean algebra.
    • Documents are represented by the index terms assigned to the document. There is no indication on which terms are more important than others ( weights are binary either 0 or 1)
ir models4
IR Models
  • Boolean Model (cont.)
    • The Boolean operators used are
      • Conjunction (AND, )
      • Disjunction (OR, )
      • Negation (NOT, )
    • Queries are specified as conventional Boolean expressions which can be represented as a disjunction of conjunctive forms vectors (disjunctive normal form - DNF)
ir models5
IR Models
  • Boolean model

Examples:

Q = Safety  (Car  Industry)

Qdnf= (1,1,1)  (1,1,0)  (1,0,0)

ir models6
IR Models
  • Boolean Model
ir models7
IR Models
  • Vector Space Model
    • Documents and queries are expressed using a vector whose components are all the possible index terms(t). Each index term has an associated weight that indicates the importance of the index term in the document (or query).
ir models8
IR Models
  • In other words, the document dj and the query q are represented as t-dimensional vectors.

dj

q

ir model
IR Model
  • The vector space model proposes to evaluate the degree of similarity of document dj with regard to the query q as the correlation between the two vectors dj and q.
ir models9
IR Models
  • This correlation can be quantified in different way, for example by the cosine of the angle between these two vectors.
ir models10
IR Models
  • Since wi,j 0 and Wi,q 0 , sim(dj,q) varies between 0 to +1. The vector space model assumes that the similarity value is an indication of the relevance of the document to the given query. Thus space models ranks the retrieved documents by the similarity value.
ir models11
IR Models
  • How can we compute the values of the weights wi,j?
    • One of the most popular methods is based on combining two factors:
      • The importance of each index term in the document
      • The importance of the index term in the collection
ir models12
IR Models
  • Importance of the index term in the document:
    • This can be measured by the number of times that the term appears in the document. The higher the number of times that it is mentioned in the document the better the term is. This is called the term frequency which is denoted by the symbol tf.
ir models13
IR Models
  • The importance of the index term in the collection:
    • An index term that appears in every document in the collection is not very useful, but a term that occurs in only a few documents may indicate that these few documents could be relevant to a query that uses this term.
ir model1
IR Model

In other words, the importance of an index term in the collection is quantified by the inverse of the frequency that this term has among the documents in the collection. This factor is usually called the inverse document frequency or the idf factor.

ir models14
IR Models

Mathematically this can be expressed as:

Where:

N= number of documents in the collection

ni = number of document that contain the term i

ir models15
IR Models
  • Combining these two factors we can obtain the weight of an index term i as:

Also called the tf-idf weighting scheme

ir models16
IR Models
  • Vector Space Model
ir models17
IR Models
  • Probabilistic model
    • Assumption: given a query q and a document dj in the collection, this model tries to estimate the probability that the user will find the document dj relevant.
    • The model assumes that there exists an ideal subset R of the document collection that contains only relevant documents.
ir models18
IR Models
  • Similarity in the probabilistic model is computed as:

Where:

P(ki|R) is the probability that index term ki is present in a document randomly selected from the set R

ir models19
IR Models

_

  • Estimation of P(ki|R) and P(ki|R):
    • Initial constant values:
    • Iterative process to improve estimates

V is the set of documents retrieved

Vi is the set of documents in V that contain ki

ir models20
IR Models
  • Probabilistic Model
slide36

Set Theoretic

  • Fuzzy
  • Extended Boolean

Classic Models

  • Boolean
  • Vector space
  • Probabilistic

User

Tasks

  • Retrieval:
    • Adhoc
    • Filtering

Algebraic

  • Generalized Vector
  • Lat. Semantic Index
  • Neural Networks

Structured Models

  • Non Overlapping
  • Lists
  • Proximal nodes

Probabilistic

Browsing

  • Inference Network
  • Belief Network

Browsing

  • Flat
  • Structure Guided
  • Hypertext
retrieval evaluation
Retrieval Evaluation

Relevant docs in the Answer set

|Ra|

Collection

Rel. docs

|R|

Answer set

|A|

retrieval evaluation1
Retrieval Evaluation
  • Pooling

System

1

System

2

System

3

Use top K documents from each run

pooling

Pool of

combined

results

User evaluation

retrieval evaluation2
Retrieval Evaluation
  • Relevance:
    • Strength: it involves people (users) as judges of effectiveness of performance.
    • Weakness: It involves judgments by people, with all the associated problems of subjectivity and variability.
retrieval evaluation3
Retrieval Evaluation
  • Types of relevance: (Saracevic, 1999)
    • System or algorithmic relevance: relation between the query and information objects retrieved.
    • Topical or Subject relevance: relation between the topic expressed in the query and the topic covered in the retrieved documents.
    • Cognitive relevance or pertinence: relation between the state of knowledge and cognitive information need of a user and documents retrieved.
    • Situational relevance or utility: relation between the task or problem at hand ant and the documents retrieved.
    • Motivational or affective relevance: relation between the intents, goals and motivations of a user and the documents retrieved.
retrieval evaluation4
Retrieval Evaluation
  • T. Joachims. Evaluating retrieval performance using clickthrough data. In Text Mining: theoretical aspects and applications. Physca-Verlag, 79-96, 2003.
  • Available online: http://www.cs.cornell.edu/People/tj/publications/joachims_02b.pdf
clickthrough for ir performance
Clickthrough for IR Performance
  • Main idea:
    • Use a unified interface to submit results to two search engines.
    • Estimate relevance information from links visited from the results returned by a search engine.
clickthrough for ir performance1
Clickthrough for IR Performance
  • Regular clickthrough data:
    • User types a query into a unified interface.
    • The query is sent to both search engines A & B.
    • Randomly select one list of results and present it to the user
    • Ranks of links clicked by user are recorded
clickthrough for ir performance2
Clickthrough for IR Performance
  • Unbiased clickthrough data for comparing retrieval functions:
    • Blind test: Interface should hide the random variables underlying the hypothesis test to avoid biasing the user response. (placebo effect)
    • Click  Preference: design the interface so that interaction with the system demonstrates a particular judgement by the user.
    • Low usability impact
clickthrough for ir performance3
Clickthrough for IR Performance

Input: ranking A = (a1, a2,…), ranking B= (b1, b2,…)

Call: combine(A,B,0,0,)

Output:combined ranking D

Combine (A,B,ka,kb,D)

if (ka = kb) {

if(A[ka+1]  D) { D:= D+A[ka+1];}

combine (A,B,ka+1,kb,D);

}

else {

if (B[kb+1]  D) { D:= D+B[kb+1];}

combine (A,B,ka,kb+1,D);

}

clickthrough for ir performance4
Clickthrough for IR Performance
  • Hypothesis test:
    • Use a binomial sign test (i.e., McNemar’s test) to detect significant deviations from the median
clickthrough for ir performance5
Clickthrough for IR Performance
  • Experimental design:
    • Interface that sends query to two systems:
      • Google and MSNsearch.
      • Google and Default (50 links from MSNsearch in reverse orde)
      • MSNsearch and Default
clickthrough for ir performance6
Clickthrough for IR Performance
  • Does the clickthrough evaluation agree with the relevance judgements?
    • Their conclusion in the small experiment is that there is a string correlation between relevance judgements and clickthrough data.
    • Can this be a generalized conclusion?