Modern information retrieval chapter 1 introduction
This presentation is the property of its rightful owner.
Sponsored Links
1 / 28

Modern Information Retrieval Chapter 1: Introduction PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Modern Information Retrieval Chapter 1: Introduction. Ricardo Baeza-Yates Berthier Ribeiro-Neto. Motivation. Example of the user information need Topic: NCAA college tennis team

Download Presentation

Modern Information Retrieval Chapter 1: Introduction

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Modern information retrieval chapter 1 introduction

Modern Information RetrievalChapter 1: Introduction

Ricardo Baeza-Yates

Berthier Ribeiro-Neto



  • Example of the user information need

    • Topic: NCAA college tennis team

    • Description: Find all the pages (documents) containing information on college tennis teams which (1) are maintained by an university in the USA and (2) participate in the NCAA tennis tournament.

    • Narrative: To be relevant, the page must include information on the national ranking of the team in the last three years and the email or phone number of the team coach.

Ir research

IR Research

  • Information retrieval vs Data retrieval

  • Research

    • information search

    • information filtering (routing)

    • document classification and categorization

    • user interfaces and data visualization

    • cross-language retrieval

Ir history

IR History

  • 1970

  • 1990, WWW

The user task

The User Task

  • Retrieval (Searching)

    • classic information search process where clear objectives are defined

  • Browsing

    • a process where one’s main objectives are not clearly defined and might change during the interaction with the system

Logical view of the documents

Logical View of the Documents

  • Text Operations

    • reduce the complexity of the document representation

    • a full text  a set of index terms

  • Steps

    1. Stopwords removing

    2. Stemming

    3. Noun groups

    4. ...

Past present and future

Past, Present, and Future

  • Early Development

    • Index

  • Library

    • Author name, title, subject headings, keywords

  • The Web and Digital Libraries

    • Hyperlinks

Conventional text retrieval systems automatic text processing

Conventional Text-Retrieval SystemsAutomatic Text Processing

G. Salton, Addison-Wesley, 1989.

(Chapter 9)

Data retrieval

Data Retrieval

  • A specified set of attributes is used to characterize each record.EMPLOYEE(NAME, SSN, BDATE, ADDR, SEX, SALARY, DNO)

  • Exact match between the attributes used inquery formulationsandthose attached to the document. SELECT BDATE, ADDRFROM EMPLOYEEWHERE NAME = ‘John Smith’

Text retrieval systems

Text-Retrieval Systems

  • Content identifiers (keywords, index terms, descriptors) characterize the stored texts.

  • Degrees of coincidence between the sets of identifiers attached to queries and documents

content analysis

query formulation

Possible representation

Possible Representation

  • Document representation (Text operation)

    • unweighted index terms (term vectors)

    • weighted index terms

  • Query (Query operation)

    • unweighted or weighted index terms

    • Boolean combinations (or, and, not)

  • Search operation must be effective

    • (Indexing)

File structures

File Structures

  • Main requirements

    • fast-access for various kinds of searches

    • large number of indices

  • Alternatives

    • Inverted Files

    • Signature Files

    • PAT trees

Inverted files

Inverted Files

  • File is represented as an array of indexed documents.

Inverted file process

Inverted-file process

  • The document-term array is inverted (transposed).

Inverted file process continued

Inverted-file process (Continued)

  • Take two or more rows of an inverted term-document array, and produce a single combined list of document identifiers.

  • Ex: Query= (term2 and term3)

    term21100term3 0111------------------------------------------------------1 <-- D2

List merging for two ordered lists

List-merging for two ordered lists

  • The inverted-index operations to obtain answers are based on list-merging process.

  • ExampleT1:{D1, D3}T2:{D1, D2}Merged(T1, T2): {D1, D1, D2, D3}

Extensions of inverted index operations distance constraints

Extensions of Inverted Index Operations(Distance Constraints)

  • Distance Constraints

    • (A within sentence B)terms A and B must co-occur in a common sentence

    • (A adjacent B)terms A and B must occur adjacently in the text

Extensions of inverted index operations distance constraints1

Extensions of Inverted Index Operations(Distance Constraints)

  • Implementation

    • include term-location in the inverted indexesinformation:{P345, P348, P350, …}retrieval:{P123, P128, P345, …}

    • include sentence-location in the indexes

      information:{P345, 25; P345, 37; P348, 10; P350, 8; …}retrieval:{P123, 5; P128, 25; P345, 37; P345, 40; …}

Extensions of inverted index operations distance constraints2

Extensions of Inverted Index Operations(Distance Constraints)

  • Include paragraph numbers in the indexessentence numbers within paragraphsword numbers within sentencesinformation: {P345, 2, 3, 5; …}retrieval: {P345, 2, 3, 6; …}

  • Query examples(informationadjacentretrieval)(informationwithin five wordsretrieval)

  • Cost: the size of indexes

Retrieval models

Retrieval models

Set Theoretic


Extended Boolean

Classic Models





Generalized Vector

Latent Semantic Index

Neural Networks


Inference Network

Belief Network

Classic ir model

Classic IR Model

  • Basic concepts : Each document is described by a set of representative keywords called index terms.

  • Assign a numerical weights to distinct relevance between index terms.

Boolean model

Boolean model

  • Binary decision criterion

  • Data retrieval model

  • Advantage

    • clean formalism, simplicity

  • Disadvantage

    • It is not simple to translate an information need into a Boolean expression.

    • exact matching may lead to retrieval of too few or too many documents

Vector model

Vector model

  • Assign non-binary weights to index terms in queries and in documents. => TFxIDF

  • Compute the similarity between documents and query. => Sim(Dj, Q)

  • More precise than Boolean model.

Term weights

Term Weights

  • Term WeightsDi={Ti1, 0.2; Ti2, 0.5; Ti3, 0.6}

  • Issues

    • How to generate the term weights?

    • How to apply the term weights?

      • Sum the weights of all document terms that match the given query.

      • Rank the output documents in the descending order of term weight.

Boolean query with term weights

Boolean Query with Term Weights

  • Transform a Boolean expression into disjunctive normal form.T1 and (T2 or T3)=(T1 and T2) or (T1 and T3)

  • For each conjunct, compute the minimum term weight of any document term in that conjunct.

  • The document weight is the maximum of all the conjunct weights.

Boolean query with term weights1

Boolean Query with Term Weights

  • Example: Q=(T1 and T2) or T3DocumentConjunctQueryVectorsWeightsWeight(T1 and T2)(T3) (T1 and T2) or T3D1=(T1,0.2;T2,0.5;T3,0.6),0.7;T2,0.2;T3,0.1) is preferred.



  • Conventional IR systems

  • Evaluation

  • Text operations (Term selection)

  • Query operations (Pattern matching, Relevance feedback)

  • Indexing (File structure)

  • Modeling



  • Journals

    • Journal of American Society of Information Sciences

    • ACM Transactions on Information Systems

    • Information Processing and Management

    • Information Systems (Elsevier)

    • Knowledge and Information Systems (Springer)

  • Conferences

    • ACM SIGIR, DL, CIKM, CHI, etc.

    • Text Retrieval Conference (TREC)

  • Login