Modern information retrieval chapter 1 introduction
Sponsored Links
This presentation is the property of its rightful owner.
1 / 28

Modern Information Retrieval Chapter 1: Introduction PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Modern Information Retrieval Chapter 1: Introduction. Ricardo Baeza-Yates Berthier Ribeiro-Neto. Motivation. Example of the user information need Topic: NCAA college tennis team

Download Presentation

Modern Information Retrieval Chapter 1: Introduction

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Modern Information RetrievalChapter 1: Introduction

Ricardo Baeza-Yates

Berthier Ribeiro-Neto


  • Example of the user information need

    • Topic: NCAA college tennis team

    • Description: Find all the pages (documents) containing information on college tennis teams which (1) are maintained by an university in the USA and (2) participate in the NCAA tennis tournament.

    • Narrative: To be relevant, the page must include information on the national ranking of the team in the last three years and the email or phone number of the team coach.

IR Research

  • Information retrieval vs Data retrieval

  • Research

    • information search

    • information filtering (routing)

    • document classification and categorization

    • user interfaces and data visualization

    • cross-language retrieval

IR History

  • 1970

  • 1990, WWW

The User Task

  • Retrieval (Searching)

    • classic information search process where clear objectives are defined

  • Browsing

    • a process where one’s main objectives are not clearly defined and might change during the interaction with the system

Logical View of the Documents

  • Text Operations

    • reduce the complexity of the document representation

    • a full text  a set of index terms

  • Steps

    1. Stopwords removing

    2. Stemming

    3. Noun groups

    4. ...

Past, Present, and Future

  • Early Development

    • Index

  • Library

    • Author name, title, subject headings, keywords

  • The Web and Digital Libraries

    • Hyperlinks

Conventional Text-Retrieval SystemsAutomatic Text Processing

G. Salton, Addison-Wesley, 1989.

(Chapter 9)

Data Retrieval

  • A specified set of attributes is used to characterize each record.EMPLOYEE(NAME, SSN, BDATE, ADDR, SEX, SALARY, DNO)

  • Exact match between the attributes used inquery formulationsandthose attached to the document. SELECT BDATE, ADDRFROM EMPLOYEEWHERE NAME = ‘John Smith’

Text-Retrieval Systems

  • Content identifiers (keywords, index terms, descriptors) characterize the stored texts.

  • Degrees of coincidence between the sets of identifiers attached to queries and documents

content analysis

query formulation

Possible Representation

  • Document representation (Text operation)

    • unweighted index terms (term vectors)

    • weighted index terms

  • Query (Query operation)

    • unweighted or weighted index terms

    • Boolean combinations (or, and, not)

  • Search operation must be effective

    • (Indexing)

File Structures

  • Main requirements

    • fast-access for various kinds of searches

    • large number of indices

  • Alternatives

    • Inverted Files

    • Signature Files

    • PAT trees

Inverted Files

  • File is represented as an array of indexed documents.

Inverted-file process

  • The document-term array is inverted (transposed).

Inverted-file process (Continued)

  • Take two or more rows of an inverted term-document array, and produce a single combined list of document identifiers.

  • Ex: Query= (term2 and term3)

    term21100term3 0111------------------------------------------------------1 <-- D2

List-merging for two ordered lists

  • The inverted-index operations to obtain answers are based on list-merging process.

  • ExampleT1:{D1, D3}T2:{D1, D2}Merged(T1, T2): {D1, D1, D2, D3}

Extensions of Inverted Index Operations(Distance Constraints)

  • Distance Constraints

    • (A within sentence B)terms A and B must co-occur in a common sentence

    • (A adjacent B)terms A and B must occur adjacently in the text

Extensions of Inverted Index Operations(Distance Constraints)

  • Implementation

    • include term-location in the inverted indexesinformation:{P345, P348, P350, …}retrieval:{P123, P128, P345, …}

    • include sentence-location in the indexes

      information:{P345, 25; P345, 37; P348, 10; P350, 8; …}retrieval:{P123, 5; P128, 25; P345, 37; P345, 40; …}

Extensions of Inverted Index Operations(Distance Constraints)

  • Include paragraph numbers in the indexessentence numbers within paragraphsword numbers within sentencesinformation: {P345, 2, 3, 5; …}retrieval: {P345, 2, 3, 6; …}

  • Query examples(informationadjacentretrieval)(informationwithin five wordsretrieval)

  • Cost: the size of indexes

Retrieval models

Set Theoretic


Extended Boolean

Classic Models





Generalized Vector

Latent Semantic Index

Neural Networks


Inference Network

Belief Network

Classic IR Model

  • Basic concepts : Each document is described by a set of representative keywords called index terms.

  • Assign a numerical weights to distinct relevance between index terms.

Boolean model

  • Binary decision criterion

  • Data retrieval model

  • Advantage

    • clean formalism, simplicity

  • Disadvantage

    • It is not simple to translate an information need into a Boolean expression.

    • exact matching may lead to retrieval of too few or too many documents

Vector model

  • Assign non-binary weights to index terms in queries and in documents. => TFxIDF

  • Compute the similarity between documents and query. => Sim(Dj, Q)

  • More precise than Boolean model.

Term Weights

  • Term WeightsDi={Ti1, 0.2; Ti2, 0.5; Ti3, 0.6}

  • Issues

    • How to generate the term weights?

    • How to apply the term weights?

      • Sum the weights of all document terms that match the given query.

      • Rank the output documents in the descending order of term weight.

Boolean Query with Term Weights

  • Transform a Boolean expression into disjunctive normal form.T1 and (T2 or T3)=(T1 and T2) or (T1 and T3)

  • For each conjunct, compute the minimum term weight of any document term in that conjunct.

  • The document weight is the maximum of all the conjunct weights.

Boolean Query with Term Weights

  • Example: Q=(T1 and T2) or T3DocumentConjunctQueryVectorsWeightsWeight(T1 and T2)(T3) (T1 and T2) or T3D1=(T1,0.2;T2,0.5;T3,0.6),0.7;T2,0.2;T3,0.1) is preferred.


  • Conventional IR systems

  • Evaluation

  • Text operations (Term selection)

  • Query operations (Pattern matching, Relevance feedback)

  • Indexing (File structure)

  • Modeling


  • Journals

    • Journal of American Society of Information Sciences

    • ACM Transactions on Information Systems

    • Information Processing and Management

    • Information Systems (Elsevier)

    • Knowledge and Information Systems (Springer)

  • Conferences

    • ACM SIGIR, DL, CIKM, CHI, etc.

    • Text Retrieval Conference (TREC)

  • Login