natural language processing applications
Skip this Video
Download Presentation
Natural Language Processing Applications

Loading in 2 Seconds...

play fullscreen
1 / 126

Natural Language Processing Applications - PowerPoint PPT Presentation

  • Uploaded on

Natural Language Processing Applications. Lecture 7 Fabienne Venant Université Nancy2 / Loria. Information Retrieval. What is Information Retrieval?.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Natural Language Processing Applications' - Sharon_Dale

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
natural language processing applications

Natural Language Processing Applications

Lecture 7

Fabienne Venant

Université Nancy2 / Loria

what is information retrieval
What is Information Retrieval?
  • Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers)
  • Applications:
    • Many universities and public libraries use IR systems to provide access to books journals and other documents.
    • Web search
      • Large volumes of unstable, unstructured dat
      • Speed is important
    • Cross-language IR
      • Finding documents written in another language
      • Touches on Machine translation
    • ....
  • The set of texts can be very large hence hence efficiency is a concern
  • Textual data is noisy, incomplete and untrustworthy hence robustness is a concern
  • Information may be hidden:
    • Need to derive information from raw data
    • Need to derive information from vaguely expressed needs
ir basic concepts
IR Basic concepts
  • Information needs : queries and relevance
  • Indexing: helps speeding up retrieval
  • Retrieval models: describe how to search and recover relevant documents
  • Evaluation: IR systems are large and convincing evaluation is tricky
information needs7
Information needs
  • INFORMATION NEED : the topic about which the user desires to know more
  • QUERY : what the user conveys to the computer in an attempt to communicate the information need
  • RELEVANCE : a document is relevant if it is one that the user perceives as containing information of value wrt their personal information need

Ex :

    • topic “pipeline leaks”
    • relevant documents : doesn’t matter if they use those words or express the concept with other words such a « pipeline rupture ».
capturing information needs
Capturing information needs
  • Information needs can be hard to capture
  • One possibility : use natural language
    • Advantage: expressive enough to allow all needs to be described
    • Drawbacks:
      • Semantic analysis of arbitrary NL is very hard
      • Users may not want to type full blown sentences into a search engine
  • Information needs are typically expressed as a query :
    • Where shall I go on holiday? holiday destinations
  • Two main types of possible queries
    • How much blood does the human heart pump in one minute?
      • Boolean queries :

 heart AND blood AND minutes

      • Web types queries :

 human biology

  • A query :
    • is usually quite short and incomplete;
    • may contain misspelled or poorly selected words
    • may contain too many or too few words
  • The information need :
    • may be difficult to describe precisely,especially when the user isn't familiar about the topic
  • Precise understanding of the document content is difficult.
persistent vs one off queries
Persistent vs one-off Queries

Queries might or not evolve over times

  • Persistent queries :
    • predefined and routinely performed :
      • Top ten performing shares today
      • Continuous queries : persistent queries that allow users to receive new results when they become available
    • typical of Information extraction and News Routing systems
  • One-off (or ad-hoc) queries
    • created to obtain information as the need arises
    • typical of Web searching
  • Relevance is subjective
    • ’python’ : ambiguous but not for user
    • Topicality vs. Utility: a document is relevant wrt a specific Goal

 A document is relevant if it addresses the stated information need, not because it just happens to contain all the words in the query.

  • Relevance is a gradual concept (a document is not just relevant or not; it is more or less relevant to a query)
  • IR systems usually rank retrieved documents by relevance
    • But many algorithm use a binary decision of relevance.
  • An IR system looks for data matching some criteria defined by the users in their queries.
  • The langage used to ask a question is called the query language.
  • These queries use keywords (atomic items characterizing some data).
  • The basic unit of data is a document (can be a file, an article, a paragraph, etc.).
  • A document corresponds to free text (may be unstructured).
  • All the documents are gathered into a collection (or corpus).
searching for a given word in a document
Searching for a given word in a document
  • One way to do that is to start at the beginning and to read through all the text
    • Pattern matching (re) + speed of modern computer grepping through tex can be a very effective
  • Enough for simple querying of modest collections (millions of words)
  • But for many purposes, you do need more:
    • To process large document collections (billions ot trillions of words) quickly.
    • To allow more flexible matching operations. For example, it is impractical to perform the query Romans NEAR countrymen with grep, where NEAR might be defined as “within 5 words” or “within the same sentence”.
    • To allow ranked retrieval: in many cases you want the best answer to an information need among many documents that contain certain words

-- >You need an Index

motivation for indexing
Motivation for Indexing
  • Extremely large dataset
  • Only a tiny fraction of the dataset is relevant to a given query
  • Speed is essential (0.25 second for web searching)
  • Indexing helps speedup retrieval
indexing documents
Indexing documents
  • How to relate the user’s information need with some documents’ content ?
  • Idea : using an index to refer to documents
  • Usually an index is a list of terms that appear in a document, it can be represented mathematically as:

index : doci→ {Uj keywordj}

  • Here, the kind of index we use maps keywords to the list of documents they appear in:

index′ : keywordj → {Ui doci}

  • We call this an inverted index.
indexing documents20
Indexing documents
  • The set of keywords is usually called the dictionary (or vocabulary)
  • A document identifier appearing in the list associated with a keyword is called a posting
  • The list of document identifiers associated with a given keyword is called a posting list
inverted files
Inverted files

The most common indexing technique

  • Source file: collection organised by documents
  • Inverted file: collection organised by terms
inverted index
Inverted Index
  • Given a dictionary of terms (also called vocabulary or vocabulary lexicon)
  • For each term, record in a list which documents the term occurs in
  • Each item in the list:
    • records that a term appeared in a document
    • and, later, often, the positions in the document
    • is conventionally called a posting
  • The list is then called a postings list (or inverted list),
inverted index23
Inverted Index

From « an introduction to information retrieval », C.D. Manning,P. Raghavan and H.Schütze


Draw the inverted index that would be built for the following document collection

  • Doc 1 breakthrough drug for schizophrenia
  • Doc 2 new schizophrenia drug
  • Doc 3 new approach for treatment of schizophrenia
  • Doc 4 new hopes for schizophrenia patients

For this document collection, what are the returned results for these queries:

    • schizophrenia AND drug
    • schizophrenia AND NOT(drug OR approach)
indexing documents25
Indexing documents
  • Arising questions: how to build an index automatically ? What are the relevant keywords ?
  • Some additional desiderata:
    • fast processing of large collections of documents,
    • having flexible matching operations (robust retrieval),
    • having the possibility to rank the retrieved document in terms of relevance
  • To ensure these requirements (especially fast processing) are fulfilled, the indexes are computed in advance
  • Note that the format of the index has a huge impact on the performances of the system
indexing documents26
Indexing documents

NB: an index is built in 4 steps:

  • Gathering of the collection (each document is given a unique identifier)
  • Segmentation of each document into a list of atomic tokens  tokenization
  • Linguistic processing of the tokens in order to normalize them lemmatizing.
  • Indexing the documents by computing the dictionary and lists of postings
manual indexing
Manual indexing
  • Advantages
    • Human judgement are most reliable
    • Retrieval is better
  • Drawbacks
    • Time consuming
    • Not always consistent
      • different people build different indexes for the same document.
automatic indexing
Automatic indexing
  • Using NLU?
    • Not fast enough in real world settings (e.g., web search)
    • Not robust enough (low coverage)
    • Difficulty : what to include and what to exclude.
      • Indexes should not contain headings for topics for which there is no information in the document
      • Can a machine parse full sentences of ideas and recognize the core ideas, the important terms, and the relationships between related concepts throughout the entire text?
stop list
Stop list
  • The members of which are discarded during indexing
    • some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely.
  • These words are called STOP WORDS
  • Collection strategy :
    • Sort the terms by collection frequency (the total number of times each term appears in the document collection),
    • Take the most frequent terms
      • often hand-filtered for their semantic content relative to the domain of the documents being indexed
    • What counts as a stop word depends on the collection
      • in a collection of legal article law can be considered a stop word
  • Ex:
    • a an and are as at be by for from has he in is it its of on that the to was were will with
why eliminate stop words
Why eliminate stop words?
  • Efficiency
    • Eliminating stop words reduces the size of the index considerably
    • Eliminating stop words reduces retrieval time considerably
  • “Quality of results”
    • Most of the time not indexing stop words does little harm
      • keyword searches with terms like the and by don’t seem very useful
    • BUT, this is not true for phrase searches.
      • The phrase query “President of the United States” is more precise than President AND “United States”.
      • The meaning of “ flights to London “ is likely to be lost if the word to is stopped out.
      • .....
building the vocabulary32
Building the vocabulary
  • Processing a stream of characters to extract keywords
  • 1st task: tokenization, main difficulties:
    • token delimiters (ex: Chinese)
    • apostrophes (ex: O’neill, Finland’s capital)
    • hyphens (ex: Hewlett-Packard, state-of-the-art)
    • segmented compound nouns (ex: Los Angeles)
    • unsegmented compound nouns (icecream, breadknife)
    • numerical data (dates, IP addresses)
    • word order (ex: Arabic wrt nouns and numbers)
solutions for tokenization issues
Solutions for tokenization issues:
  • Using a pre-defined dictionary with largest matches and heuristics for unknown words
  • Using learning algorithms trained over hand-segmented words
choosing keywords
Choosing keywords
  • Selecting the words that are most likely to appear in a query
    • These words characterize the documents they appear in
    • Which are they?
the bag of words approach
The bag of words approach
  • Extreme interpretation of the the principle of compositional semnaics
  • The meaning of documents resides solely in the words that are contained within them
  • The exact ordering of the terms in a document is ignored but the number of occurrences of each term is material

“Not the same thing a bit!” said the Hatter.

“You might just as well say that ‘I see what Ieat’ is the same thing as ‘I eat what I see’!”

“You might just as well say,” added the March Hare, “that ‘I like what I get’ is the same thing as ‘I get what I like’!”

“You might just as well say,” added the Dormouse, who seemed to be talking in its sleep, “that ‘I breathe when I sleep’ is the same thing as ‘I sleep when I breathe’!”

bags of words
Bags of words
  • Nevertheless, it seems intuitive that two documents with similar bag of words representations are similar in content..
what s in a bag of words
What’s in a bag of words?
  • Are all words in a document equally important?
    • stop words do not contribute in any way to retrieval and scoring
    • BoW contain terms
      • What should count as a term?
        • Words
        • Phrases (e.g., president of the US)
morphological normalization
Morphological normalization
  • Should index terms be word forms, lemmas or stems?
    • Matching morphological variants increase recall
    • Example morphological variants :
      • anticipate, anticipating, anticipated, anticipation
      • Company/Companies, sell/sold
      • USA vs U.S.A.,
      • 22/10/2007 vs 10/22/2007 vs 2007/10/22
      • university vs University
    • Idea: using equivalence classes of terms,
      • ex: { Opel, OPEL, opel }  opel
    • Two techniques:
      • stemming : refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time
      • Lemmatisation : refers to doing things,properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return a dictionary form of a word, which is known as the lemma.
  • NB: documents and queries have to be processed using the same tokenization process !
stemming and lemmatization
Stemming and Lemmatization
  • Role: reducing inflectional forms to common base forms,
  • Example:
    • car, cars, car’s, cars’  car
    • am, are, is  be
  • Stemming removes suffixes (surface markers) to produce root forms
  • Lemmatization reduces a word to a canonical form (using a dictionary and a morphological analyser)
  • Illustration of the difficulty:
    • plurals (woman/women, crisis/crisis)
    • derivational morphology (automatize/automate)
  • English  Porter stemming algorithm (University of Cambridge, UK, 1980)
porter stemmer
Porter stemmer
  • Algorithm based on a set of context-sensitive rewriting rules

  • Rules are composed of a pattern (left-hand-side) and a string (right-hand-side), example:

(.*)sses  \1 ss sses  ss : caresses  caress

(.* [aeiou].*)ies  \1i ies  i : ponies  poni, ties ti

(.* [aeiou].*)ss  \1 ss ss  ss : caress  caress

  • Rules may be constrained by conditions on the word’s measure, example:

(m > 1) (.*)ement  \1 replacement replac but not cement c

(m>0) (.*)eed -> \1ee feed -> feed but agreed -> agree

(*v*) ed -> \1 plastered -> plaster but bled -> bled

(*v*) ing -> \1 motoring -> motor but sing -> sing

porter stemmer word measure
Porter StemmerWord measure
  • Assumed that a list of consonants is denoted by C, and a list of vowels by V
  • Any word, or part of a word has one of the four forms:
    • CVCV ... C
    • CVCV ... V
    • VCVC ... C
    • VCVC ... V
  • These may all be represented by the single form
    • [C]VCVC ... [V] where the square brackets denote arbitrary presence of their contents.
  • Using (VC)m to denote VC repeated m times, this may again be written as
    • [C](VC)m[V].
  • m will be called the measure of any word or word part when represented in this form.
  • Here are some examples:
    • m=0 TR,   EE,   TREE,   Y,   BY
    • m=1 TROUBLE,   OATS,   TREES,   IVY
    • (m > 1) EMENT ->
      • This would map REPLACEMENT to REPLAC, since REPLAC is a word part for which m = 2.
  • What is the Porter measure of the following words (give your computation) ?
    • crepuscular
    • rigorous
    • placement

cr ep usc ul ar


m = 4

r ig or ous


m = 3

pl ac em ent


m = 3

  • Most stemmers also removes suffixes such as ed, ing, ational, ation, able, ism...
    • Relational  relate
  • Most stemmers don’t use lexical look up
  • There are shortcomings:
    • Stemming can result in non-words
      • Organization  Organ
      • Doing  doe
    • Unrelated words can be reduced to the same stem
      • police, policy polic
  • Popular stemmers
    • Porter’s
    • Lovin’s
    • Iterated Lovin’s
    • Kstem
  • Exceptions needs to be handled:
    • sought  seek, sheep sheep, feet foot
  • Computationally more expensive than stemming as it lookups words in a dictionnary
  • Lemmatizer for French
    • FLEMM (F. Namer)
  • POS taggers with lemmatization: TreeTagger, LT-POS
what is actually used
What is actually used?
  • Most retrieval systems use stemming/lemmatising and stop word lists
    • Stemming increases recall while harming precision
  • Most web search engines do use stop word lists but not stemming/lemmatising because
    • the text collection is extremely large so that the change of matching morphogical variants is higher
    • recall is not an issue
    • stemming is imperfect and the size and diversity of the web increase the chance of a mismatch
    • stemming/tokenising tools are available for few languages
example text representations
Example Text Representations

Scientists have found compelling new evidence of possible ancient

microscopic life on mars, derived from magnetic crystals in a meteorit that fell to Earth from the red planet, NASA anounced on Monday.

Web search: scientists, found, compelling, new, evidence,

possible, ancient, microscopic, life, mars, derived, magnetic, crystals,

meteorite, fell, earth, red, planet, NASA, anounced, Monday

Information service or library search: scientist, find, compelling,

new, evidence, possible, ancient, microscopic, life, mars, derived,

magnetic, crystal, meteorite, fall, earth, red, planet, NASA,

anounce, Monday

  • Document unit :
    • An index can map terms
      • ... to documents
      • ... to paragraphs in documents
      • ... to sentences in document
      • ... to positions in documents
  • An IR system should be designed to offer choices of granularity.
  • For now, we will henceforth assume that a suitable size document unit has been chosen, together with an appropriate way of dividing or aggregating files, if needed.
index content
Index Content
  • The index usually stores some or all of the following information:
    • For each term:
      • Document count. How many documents the term occurs in.
      • Total Frequency count. How many times the term occurs accross all documents  “popularity measure”
    • For each term and for each document:
      • Frequency : How often the term occurs in that document.
      • Position. The offsets at which the term occurs in that document.
what is a retrieval model
What is a retrieval model
  • A model is an abstraction of a process – here: retrieval
  • Conclusions derived by the model are good if the model provides a good approximation of the retrieval process
  • IR Model variables: queries, documents, terms, relevance, users, information needs
  • Existing types of retrieval models :
    • Boolean models
    • Vector space models
    • Probabilistic models
    • Models based on Belief nets
    • Models based on language models
retrieval models the general intuition
Retrieval Models: the general intuition
  • Documents and user information needs are represented using index terms
    • Index terms serve as links to documents
    • Queries consists of index terms
  • Relevance can be measured in terms of a match between queries and document index
exact vs best match
Exact vs. Best Match
  • Exact Match
    • A query specifies precise retrieval criteria
    • Each document either matches or fails to match the query
    • The result is a set of documents (no ranking)
  • Best match
    • A query describes good or best matching documents
    • The result is a ranked list of documents
statistical models
Statistical Models

A document is typically represented by a bag of words (unordered words with frequencies)

User specifies a set of desired terms with optional weights:

Weighted query terms:

Q = < database 0.5; text 0.8; information 0.2 >

Unweighted query terms:

Q = < database; text; information >

No Boolean conditions specified in the query.


statistical retrieval
Statistical Retrieval

Retrieval based on similaritybetween query and documents.

Output documents are ranked according to similarity to query

Similarity based on occurrence frequencies of keywords in query and document

Automatic relevance feedback can be supported

The user issues a (short, simple) query.

The system returns an initial set of retrieval results.

The user marks some returned documents as relevant or nonrelevant.

The systemcomputes a better representation of the information need base on the user feedback.

The system displays a revised set of retrieval results.


the boolean model
The boolean model

Most common exact-match model

  • Basic assumptions:
    • An index term is either present or absent in a document
    • All index terms provide equal evidence wrt information needs
  • Queries are boolean combinations of index terms
    • x AND y: docts that contains both x and y (intersection of addresses)
    • x OR y: docts that contains x, y or both (union of addresses)
    • NOT x: docts that do not contain x (complement set of addresses)
  • Additionnally,
    • proximity operator
    • simple regular expressions
    • spelling variants
boolean queries example
Boolean queriesExample
  • User information need:

 interested in learning about vitamins that are antioxidant

  • User boolean query:

 antioxidant AND vitamin

the boolean model61
The boolean model

Example of input collection (Shakespeare’s plays):

  • Doc1

I did enact Julius Caesar:

I was killed in the Capitol;

Brutus killed me.

  • Doc2

So let it be with Caesar. The

noble Brutus hath told you Caesar

was ambitious

the boolean model index construction
The boolean model index construction
  • First we build the list of pairs (keyword, docID)):
the boolean model index construction63
The boolean model index construction
  • Then the lists are sorted by keywords, frequency information is added:
the boolean model index construction64
The boolean model index construction
  • Multiple occurences of keywords are then merged to create a dictionary file and a postings file:
processing boolean queries
Processing Boolean queries
  • User boolean query: Brutus AND Calpurnia
    • over the inverted index :
      • Locate Brutus in the Dictionary
      • Retrieve its postings
      • Locate Calpurnia in the Dictionary
      • Retrieve its postings
      • Intersect the two postings lists
  • The intersection operation is the crucial one. It has to be we efficient so as to be able to quickly find documents that contain both terms.
    • sometimes referred to as merging postings lists because it uses a merge algorithm
    • Merge algortihm : general family of algorithms that combine multiple sorted lists by interleaved advancing of pointers through each list
extended boolean queries
Extended boolean queries

Merging algorithm (from Manning et al., 07)

NB: the posting lists HAVE to be sorted.

extended boolean queries68
Extended boolean queries
  • Generalisation of the merging process:
    • Imagine more than 2 keywords appear in the query:
      • (Brutus AND Caesar) AND NOT (Capitol)
      • Brutus AND Caesar AND Capitol
      • (Brutus OR Caesar) AND (Capitol
      • ...
  • Ideas:
    • consider keywords with shorter posting lists first (to reduce the number of operations).
    • use the frequency information stored in the dictionary

 See Manning et al., 07 for the algorithm

extended boolean queries69
Extended boolean queries

retrieved docs : D7, D5, D2

  • How would you process the following queries (main steps)
  • Brutus AND NOT Caesar
  • Try your algorithm on
  • How would you process the following query (main steps)

Brutus OR NOT Caesar

remarks on the boolean model
Remarks on the boolean model
  • The boolean model allows to express precise queries (you know what you get, BUT you do not have flexibility → exact matches)
  • Boolean queries can be processed efficiently (time complexity of the merge algorithm is linear in the sum of the length of the lists to be merged)
  • Has been a reference model in IR for a long time
advantages of exact match retrieval
Advantages of exact-match retrieval
  • Predictable, easy to explain
  • Structured queries
  • Works well when information need is clear and precise
drawbacks of exact match retrieval
Drawbacks of exact-match retrieval
  • Unintuitive for non experts: adequate query formulation difficult for most users
  • no ranking of retrieved documents
  • exact matching may lead to too few or too many retrieved documents
    • too few: if not using synonyms
    • difficulty increases with collection size
    • large results sets need to be compensated by interactive query refinement
  • No notion of partial relevance (useful if query is overrestrictive)
  • All terms have equal importance (no term weighing)
  • Ranking models consistently better
boolean model the story so far
Boolean modelThe story so far
  • An inverted index associate keywords with posting lists
  • The postings lists contain document identifiers (and other useful information, such as total frequences, number of documents, etc.)
  • Boolean queries are processed by merging posting lists in order to find the documents satisfaying the query
  • The cost of this list merging is time linear in the total number of document Ids: O(m + n)
  • Question: how to process phrase queries (i.e. taking the word’s context into account) ?
dealing with phrases queries
Dealing with phrases queries
  • Many complex or technical concepts and many organization and product names are multiword compounds or phrases.
    • Stanford University
    • Graph Theory
    • Natural Language Processing
    • ...
  • The user wants documents were the whole phrase appears, and not only some parts of it (i.e. The inventor Stanford Ovshinsky never went to university is not a match )
  • About 10 % of the web queries are phrase queries (songs’ names, institutions...)
  • Such queries need either more complex dictionary terms, or more complex index (critical parameter: size of the index)
biword indexes
Biword indexes

Use key-phrases of length 2, example :

  • Text : Natural Language Processing
  • Dictionary:
    • Natural Language
    • Language Processing
    • The dictionary is made of biwords (notion of context)
  • Query : Information retrieval in Natural Langage Processing
    • (Information retrieval) and (retrieval Natural) and (Natural Language) and (Language Processing)
    • It might seem a better query to omit the middle biword.
    • Better results can be obtained by using more precise part-of-speech patterns that define which extended biwords should be indexed
positionnal indexes
Positionnal indexes
  • Store positions in the inverted indexes, example:

termID ::=

doc1: position1, position2, ...

doc2: position1, position2, ..


  • Processing then corresponds to an extension of the merging algorithm (additional checkings while traversing the lists)
  • NB: such indexes can be used to process proximity queries (i.e. using constraints on proximity between words)
Positional indexes need an entry per occurence (NB: classic inverted indexes need an entry per document Id)
  • The size of such indexes grows exponentially with the size of the document
  • The size of a positional index depends on the language being indexed and the type of document (books, articles, etc)
  • On average, a positional index is 2-4 times bigger than a inverted index, it can reach 35 to 50 % of the size of the original text (for English)
  • Positional indexes can be used in combination with classic indexes to save time and space (see [Williams et al, 2005]).
  • Which documents can contain the sentence “to be or not to be” considering the following (incomplete) indexes ?

be ::=

1: 7, 18, 33, 72, 86, 231

2: 3, 149

4: 17, 191, 291, 430, 434

5: 363, 367

to ::=

2: 1, 17, 74, 222, 551

4: 8, 16, 190, 429, 433

7: 13, 23, 191

  • Given the following positional indexes, give the documents Ids corresponding to the query “world wide web” :
  • world ::=
  • 1: 7, 18, 33, 70, 85, 131
  • 2: 3, 149
  • 4: 17, 190, 291, 430, 434
  • wide ::=
  • 1: 12, 19, 40, 72, 86, 231
  • 2: 2, 17, 74, 150, 551
  • 3: 8, 16, 191, 429, 435
  • web ::=
  • 1: 20, 22, 41, 75, 87, 200
  • 2: 18, 32, 45, 56, 77, 151
  • 4: 25, 192, 300, 332, 440
The postings lists to access are: to, be, or, not.
  • We will examine intersecting the postings lists for to and be.
  • We first look for documents that contain both terms.
  • Then, we look for places in the lists where there is an occurrence of be with a token index one higher than a position of to
  • and then we look for another occurrence of each word with token index 4 higher than the first occurrence.
  • In the above lists, the pattern of occurrences that is a possible
  • match is:
  • to: <...;4:<...,429,433>...>
  • Be: <...;4:<...,430,434>...>

Consider the following index:

    • Language:
    • Loria:

Where dI refers to the document I, the other numbers being positions.

The infix operator NEAR/x refers to the proximity x between two term :

  • Give the solutions to the query language NEAR/2 Loria
  • Give the pairs (x,docids) for each x such that language NEAR/x Loria has at least one solution
  • Propose an algorithm for retrieving matching document for this operator
example westlaw
Example: WESTLAW
  • Large commercial system that serves legal and professional market since 1974
    • legal materials (court opinions, statutes, regulations, ...)
    • news (newspapers, magazines, journals, ...)
    • financial (stock quotes, financial analyses, ...)
  • Total collection size: 5-7 Terabytes
  • 700 000 users (they claim 56% of legal searchers as of 2002)
  • Best match added in 1992
westlaw query language features
WESTLAW query language features
  • Boolean and proximity operators
    • Phrases : West Publishing
    • Word Proximity : West /5 Publishing
    • Same sentence : Massachussets /s technology
    • Same paragraph - information retrieval /p
  • Restrictions : DATE(AFTER 1992 & BEFORE 1995)
  • Term expansion
    • wildcard (THOM*SON); truncation (THOM!); automatic expansion of plurals, possessive
  • Document structure (fields)
westlaw query example
WESTLAW query example
  • Information need: Information on the legal theories involved in preventing the disclosure of trade secrets by employees formerly employed by a competingcompany.

 Query: "trade secret" /s disclos! /s prevent /s employe!

  • Information need: Requirements for disabled people to be able to access a workplace.

 Query: disab! /p access! /s work-site work-place (employment /3 place)

  • Information need: Cases about a host’s responsibility for drunk guests.

 Query: host! /p (responsib! liab!) /p (intoxicat! drunk!) /p guest

boolean query languages are not dead
Boolean query languages are not dead
  • Exact match still prevalent in the commercial market (but then includes some type of ranking)
  • Many users prefer Boolean
  • For some queries/collections, boolean may work better
  • Boolean and free text queries find different documents

 Need retrieval models that support both

best match retrieval
Best-Match retrieval
  • Boolean retrieval is the archetypal example of exact-match retrieval
  • Best-match or ranking models are now more common
  • Advantages
    • easier to use
    • similar efficiency
    • provides ranking
    • best match generally has better retrieval performance
    • most relevant documents appear at the top of the ranking
  • But: comparison best- and exact-match is difficult
Boolean model: all documents matching the query are retrieved
  • The matching is binary: yes or no
  • Extreme cases: the list of retrieved documents can be empty, or huge
  • A ranking of the documents matching a query is needed
  • A score is computed for each pair (query, document)
vector space retrieval
Vector-space Retrieval
  • By far the most common retrieval systems
  • Key idea: Everything (document, queries) is a vector in a high dimensional space
  • Vector coefficients for an object (document, query, term) represent the degree to which this object embodies each of the basic dimensions
  • Relevance is measured using vector similarity: a document is relevant to a query if their representing vectors are similar
vector space representation
Vector-space Representation
  • Documents are vectors of terms
  • Terms are vectors of documents
  • A query is a vector of terms
graphic representation
Graphic Representation


D1 = 2T1 + 3T2 + 5T3

D2 = 3T1 + 7T2 + T3

Q = 0T1 + 0T2 + 2T3



D1 = 2T1+ 3T2 + 5T3

Q = 0T1 + 0T2 + 2T3




D2 = 3T1 + 7T2 + T3




similarity in the vector space
Similarity in the Vector-space
  • Vector can contain binary terms or weighted terms
    • Binary term vector: 1  term present, 0  term absent
    • Weighted term vector:indicates relative importance of terms in a document
  • Vector similarity can be measured in several ways:
    • Inner product (measure of overlap)
    • Cosine coefficient
    • Jacquard coefficient
    • Dice coefficient
    • Mikowski metric (dissimilarity)
    • Euclidian distance (dissimilarity)
using the inner product similarity measure
Using the inner product similarity measure
  • Given a query vector q and a doct vector d, both of length n,
  • similarity between q and d is defined by the inner product q • d of q and d :
  • where qi (di ) is the value of the i -th position of q(d)
  • With binary values this amounts to counting the matching terms between q and d
the effect of varying document lengths
The effect of varying document lengths
  • Problem :
    • Longer documents will be represented with longer vectors, but that does not mean they are more important
    • If two documents have the same score, the shorter one should be preferred
  • Solution : the length of a document must be taken into account when computing the similarity score
document length normalization
Document length normalization
  • The length of a document: euclidian length
  • If d= (x1, x2, ... Xn) then dw=
  • To normalize a document, we divide it by its own length : d/dw
  • Similarity given by the cosine measure between normalized vectors:


  • One problem is solved : shorter more focused documents receive a higher score than longer documents with the same matching terms
  • But: shorter documents are generally preferred over longer one!
  • More sophisticated weighting schemes are generally used
term weights
Term weights
  • qi is the weight of the term i in q
  • Up to now, we only considered binary term weight
    • 0: term absent
    • 1: term present
  • Two shortcomings:
    • Does not reflect how often a term occurs
    • All terms are equally important (president vs. the)
  • Remedy: use non binary term weights
    • tf-score: store the frequency of a term in the vector (e.g., 4 if the term occurs 4 times in the document)
    • idf-score: to distinguish meaningful terms ie terms that occur only in a few documents
term frequency
Term frequency
  • A document is treated as a set of words
  • Each word characterizes that document to some extent
  • When we have eliminated stop words, the most frequent words tend to be what the document is about
  • Therefore: fkd (Nb of occurrences of word k in document d) will be an important measure.

 Also called the term frequency (tf)

document frequency
Document frequency
  • What makes this document distinct from others in the corpus?
  • The terms which discriminate best are not those which occur with high document frequency!
  • Therefore: dk (nb of documents in which word k occurs) will also be an important measure.

 Also called the document frequency (idf)

tf idf
  • This can all be summarized as:
    • Words are best discriminators when :
      • they occur often in this document (term frequency)
      • do not occur in a lot of documents (document frequency)
    • One very common measure of the importance of a word to a document is :
      • TF.IDF: term frequency x inverse document frequency
    • There are multiple formulas for actually computing this. The underlying concept is the same in all of them.
term weights103
Term weights
  • tf-score : tfi,j = frequency of term i in document j
  • idf-score : idfi = Inversed document frequency of term i
    • idfi = log(N/ni) with
      • N, the size of the document collection (nb of documents)
      • ni , the number of documents in which the term i occurs
    • idfi = Proportion of the document collection in which termi occurs
  • Term weight of term i in document j (TF-IDF):
    • tfi,j. idfi
    • the rarity of a term in the document collection
boolean retrieval vs vector space retrieval
Boolean retrieval vs. Vector Space Retrieval
  • Boolean retrieval
    • Documents are not ranked
    • Boolean queries are not easy to manipulate
  • Vector space retrieval
    • Documents can be ranked
    • Issue 1: choice of comparison function. Usually cosine comparison.
    • Issue 2: choice of weighing scheme. Usuall variations on tfi,j. idfi
  • Issues
  • User-based evaluation
  • System-based evaluation
  • TREC
  • Precision and recall
evaluation methods
Evaluation methods
  • Two types of evaluation methods:
    • User-based: measures the user satisfaction
    • System-based: focuses on how well the system ranks the documents
user based evaluation
User based evaluation
  • More direct
  • Expensive
  • Difficult to do correctly
  • Need sufficiently large, representative sample of users
  • The compared systems must be equally well developed (complete with fully fonctional user interface)
  • Each user must be trained to control learning effects
  • Information, information needs, relevance are intangible concepts
system based evaluation
System based evaluation
  • Good system performance = good document rankings
  • Allows for fair comparative testing
  • Less expensive; can be reused
  • Test collection = Topics, Documents, Relevance judgments
  • System based evaluation goes back to Cranfield’s experiments (1960)
    • Rate relevance of retrieved bibliographic reference on a scale from 1 to 4
recall and precision
Recall and Precision
  • Three important performance metrics:
    • Precision : Proportion of retrieved documents that are relevant

 No penalty for selecting too few item

    • Recall : Proportion of relevant documents that have been retrieved

 No penalty for selecting too many items (e.g., everything)7

standard text collections
Standard Text Collections
  • Relevant documents must be identified
  • Given a document collection D and a set of queries Q, RELq is the set of document relevant to q
  • Whether a document d is relevant to a query q is decided by human judgement
standard text collections113
Standard Text Collections
  • CACM (computer science): 3024 abstracts, 64 queries
  • CF (medicine): 1239 abstracts, 100 queries
  • CISI (library science): 1460 abstracts, 112 queries
  • CRANFIELD (aeronautics): 1400 abstracts, 225 queries
  • LISA (library science): 6004 abstracts, 35 queries
  • TIME (newspaper): 423 abstracts, 83 queries
  • Ohsumed (medicine): 348 566 abstracts, 106 queries
building test collections
Building Test Collections
  • How to identify relevant documents?
  • How to assess relevance? (binary or finer-grained)
  • One vs several judges
  • Text REtrieval Conference
  • Proceedings at
  • Established in 1991 to evaluate large scale IR
  • Retrieving documents from a gigabyte collection
  • Organised by NIST and run continuously since 1991
  • Best known IR evaluation setting
    • 25 participants in 92
    • 109 participants from 4 continents in 2004
    • European (CLEF) and Asian counterparts (NTCIR)7
trec format
TREC Format
  • Several IR research tracks
    • ad-hoc retrieval
    • routing/filtering
    • cross languag
    • scanned document
    • spoken document
    • Video
    • Web
    • question answering
    • ...
trec notion of relevance
TREC notion of relevance

If you were writing a report on the subject of the topic and would use the information contained in the document in the report, then the document is relevant

  • Pooling is used for identifying relevant documents:
    • A set of possibly relevant documents is created automatically for each information need
    • The top 100 documents returned by each system are kept and inspected by judges who determine which documents are relevant
    • Inter-judge agreement is about 80%8
improving recall and precision
Improving Recall and Precision
  • The two big problems with short queries are:
    • Synonymy: Poor recall results from missing documents that contain synonyms of search terms, but not the terms themselves
    • Polysemy/Homonymy: Poor precision results from search terms that have multiple meanings leading to the retrieval of non-relevant documents.
query expansion
Query Expansion
  • Find a way to expand a users query to automatically include relevant terms (that they should have included themselves), in an effort to improve recall
    • Use a dictionary/thesaurus
    • Use relevance feedback
  • A thesaurus contains information about words (e.g., violin) such as :
    • Synonyms: similar words e.g., fiddle
    • Hyperonyms: more general words e.g., instrument
    • Hyponyms: more specific words e.g., Stradivari
    • Meronyms: parts, e.g., strings
  • A very popular machine readable thesaurus is Wordnet
problems of thesauri
Problems of Thesauri
  • Language dependent
  • Available only for a couple of languages
cooccurence models
Cooccurence models
  • Semantically or syntactically related terms
  • Cooccurence vs. Thesauri
    • Easy to adapt to other languages/domains
    • Also covers relations not expressed in thesaur
    • Not as reliable as manually edited thesauri
    • Can introduce considerable noise
  • Selection criteria: Mutual information, Expected mutual, information
relevance feedback
Relevance feedback
  • Ask user to identify a few documents which appear to be related to their information need
  • Extract terms from those documents and add them to the original query.
  • Run the new query and present those results to the user.
  • Typically converges quickly
blind feedback
Blind feedback
  • Assume that first few documents returned are most relevant rather than having users identify them
  • Proceed as for relevance feedback
  • Tends to improve recall at the expense of precision
post hoc analysis
Post-Hoc Analysis
  • When a set of documents has been returned, they can be analyzed to improve usefulness in addressing information need
    • Grouped by meaning for polysemic queries (using N-Gram-type approaches)
    • Grouped by extracted information (Named entities, for instance)
    • Group into existing hierarchy if structured fields available Filtering (e.g., eliminate spam)
  • Introduction to Information Retrieval, by C. Manning, P. Raghavan, and H. Schütze. To appear at Cambridge University Press (chapters available at the book website).
  • Information Retrieval, Second Edition, by C.J. van Rijsbergen, Butterworths, London, 1979. Available here.