Natural language processing applications
1 / 126

Natural Language Processing Applications - PowerPoint PPT Presentation

  • Updated On :

Natural Language Processing Applications. Lecture 7 Fabienne Venant Université Nancy2 / Loria. Information Retrieval. What is Information Retrieval?.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Natural Language Processing Applications' - Sharon_Dale

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Natural language processing applications l.jpg

Natural Language Processing Applications

Lecture 7

Fabienne Venant

Université Nancy2 / Loria

What is information retrieval l.jpg
What is Information Retrieval?

  • Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers)

  • Applications:

    • Many universities and public libraries use IR systems to provide access to books journals and other documents.

    • Web search

      • Large volumes of unstable, unstructured dat

      • Speed is important

    • Cross-language IR

      • Finding documents written in another language

      • Touches on Machine translation

    • ....

Concerns l.jpg

  • The set of texts can be very large hence hence efficiency is a concern

  • Textual data is noisy, incomplete and untrustworthy hence robustness is a concern

  • Information may be hidden:

    • Need to derive information from raw data

    • Need to derive information from vaguely expressed needs

Ir basic concepts l.jpg
IR Basic concepts

  • Information needs : queries and relevance

  • Indexing: helps speeding up retrieval

  • Retrieval models: describe how to search and recover relevant documents

  • Evaluation: IR systems are large and convincing evaluation is tricky

Information needs7 l.jpg
Information needs

  • INFORMATION NEED : the topic about which the user desires to know more

  • QUERY : what the user conveys to the computer in an attempt to communicate the information need

  • RELEVANCE : a document is relevant if it is one that the user perceives as containing information of value wrt their personal information need

    Ex :

    • topic “pipeline leaks”

    • relevant documents : doesn’t matter if they use those words or express the concept with other words such a « pipeline rupture ».

Capturing information needs l.jpg
Capturing information needs

  • Information needs can be hard to capture

  • One possibility : use natural language

    • Advantage: expressive enough to allow all needs to be described

    • Drawbacks:

      • Semantic analysis of arbitrary NL is very hard

      • Users may not want to type full blown sentences into a search engine

Queries10 l.jpg

  • Information needs are typically expressed as a query :

    • Where shall I go on holiday? holiday destinations

  • Two main types of possible queries

    • How much blood does the human heart pump in one minute?

      • Boolean queries :

         heart AND blood AND minutes

      • Web types queries :

         human biology

Remarks l.jpg

  • A query :

    • is usually quite short and incomplete;

    • may contain misspelled or poorly selected words

    • may contain too many or too few words

  • The information need :

    • may be difficult to describe precisely,especially when the user isn't familiar about the topic

  • Precise understanding of the document content is difficult.

Persistent vs one off queries l.jpg
Persistent vs one-off Queries

Queries might or not evolve over times

  • Persistent queries :

    • predefined and routinely performed :

      • Top ten performing shares today

      • Continuous queries : persistent queries that allow users to receive new results when they become available

    • typical of Information extraction and News Routing systems

  • One-off (or ad-hoc) queries

    • created to obtain information as the need arises

    • typical of Web searching

Relevance l.jpg

  • Relevance is subjective

    • ’python’ : ambiguous but not for user

    • Topicality vs. Utility: a document is relevant wrt a specific Goal

       A document is relevant if it addresses the stated information need, not because it just happens to contain all the words in the query.

  • Relevance is a gradual concept (a document is not just relevant or not; it is more or less relevant to a query)

  • IR systems usually rank retrieved documents by relevance

    • But many algorithm use a binary decision of relevance.

Terminology l.jpg

  • An IR system looks for data matching some criteria defined by the users in their queries.

  • The langage used to ask a question is called the query language.

  • These queries use keywords (atomic items characterizing some data).

  • The basic unit of data is a document (can be a file, an article, a paragraph, etc.).

  • A document corresponds to free text (may be unstructured).

  • All the documents are gathered into a collection (or corpus).

Searching for a given word in a document l.jpg
Searching for a given word in a document

  • One way to do that is to start at the beginning and to read through all the text

    • Pattern matching (re) + speed of modern computer grepping through tex can be a very effective

  • Enough for simple querying of modest collections (millions of words)

  • But for many purposes, you do need more:

    • To process large document collections (billions ot trillions of words) quickly.

    • To allow more flexible matching operations. For example, it is impractical to perform the query Romans NEAR countrymen with grep, where NEAR might be defined as “within 5 words” or “within the same sentence”.

    • To allow ranked retrieval: in many cases you want the best answer to an information need among many documents that contain certain words

      -- >You need an Index

Motivation for indexing l.jpg
Motivation for Indexing

  • Extremely large dataset

  • Only a tiny fraction of the dataset is relevant to a given query

  • Speed is essential (0.25 second for web searching)

  • Indexing helps speedup retrieval

Indexing documents l.jpg
Indexing documents

  • How to relate the user’s information need with some documents’ content ?

  • Idea : using an index to refer to documents

  • Usually an index is a list of terms that appear in a document, it can be represented mathematically as:

    index : doci→ {Uj keywordj}

  • Here, the kind of index we use maps keywords to the list of documents they appear in:

    index′ : keywordj → {Ui doci}

  • We call this an inverted index.

Indexing documents20 l.jpg
Indexing documents

  • The set of keywords is usually called the dictionary (or vocabulary)

  • A document identifier appearing in the list associated with a keyword is called a posting

  • The list of document identifiers associated with a given keyword is called a posting list

Inverted files l.jpg
Inverted files

The most common indexing technique

  • Source file: collection organised by documents

  • Inverted file: collection organised by terms

Inverted index l.jpg
Inverted Index

  • Given a dictionary of terms (also called vocabulary or vocabulary lexicon)

  • For each term, record in a list which documents the term occurs in

  • Each item in the list:

    • records that a term appeared in a document

    • and, later, often, the positions in the document

    • is conventionally called a posting

  • The list is then called a postings list (or inverted list),

Inverted index23 l.jpg
Inverted Index

From « an introduction to information retrieval », C.D. Manning,P. Raghavan and H.Schütze

Exercise l.jpg

Draw the inverted index that would be built for the following document collection

  • Doc 1 breakthrough drug for schizophrenia

  • Doc 2 new schizophrenia drug

  • Doc 3 new approach for treatment of schizophrenia

  • Doc 4 new hopes for schizophrenia patients

    For this document collection, what are the returned results for these queries:

    • schizophrenia AND drug

    • schizophrenia AND NOT(drug OR approach)

Indexing documents25 l.jpg
Indexing documents

  • Arising questions: how to build an index automatically ? What are the relevant keywords ?

  • Some additional desiderata:

    • fast processing of large collections of documents,

    • having flexible matching operations (robust retrieval),

    • having the possibility to rank the retrieved document in terms of relevance

  • To ensure these requirements (especially fast processing) are fulfilled, the indexes are computed in advance

  • Note that the format of the index has a huge impact on the performances of the system

Indexing documents26 l.jpg
Indexing documents

NB: an index is built in 4 steps:

  • Gathering of the collection (each document is given a unique identifier)

  • Segmentation of each document into a list of atomic tokens  tokenization

  • Linguistic processing of the tokens in order to normalize them lemmatizing.

  • Indexing the documents by computing the dictionary and lists of postings

Manual indexing l.jpg
Manual indexing

  • Advantages

    • Human judgement are most reliable

    • Retrieval is better

  • Drawbacks

    • Time consuming

    • Not always consistent

      • different people build different indexes for the same document.

Automatic indexing l.jpg
Automatic indexing

  • Using NLU?

    • Not fast enough in real world settings (e.g., web search)

    • Not robust enough (low coverage)

    • Difficulty : what to include and what to exclude.

      • Indexes should not contain headings for topics for which there is no information in the document

      • Can a machine parse full sentences of ideas and recognize the core ideas, the important terms, and the relationships between related concepts throughout the entire text?

Stop list l.jpg
Stop list

  • The members of which are discarded during indexing

    • some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely.

  • These words are called STOP WORDS

  • Collection strategy :

    • Sort the terms by collection frequency (the total number of times each term appears in the document collection),

    • Take the most frequent terms

      • often hand-filtered for their semantic content relative to the domain of the documents being indexed

    • What counts as a stop word depends on the collection

      • in a collection of legal article law can be considered a stop word

  • Ex:

    • a an and are as at be by for from has he in is it its of on that the to was were will with

Why eliminate stop words l.jpg
Why eliminate stop words?

  • Efficiency

    • Eliminating stop words reduces the size of the index considerably

    • Eliminating stop words reduces retrieval time considerably

  • “Quality of results”

    • Most of the time not indexing stop words does little harm

      • keyword searches with terms like the and by don’t seem very useful

    • BUT, this is not true for phrase searches.

      • The phrase query “President of the United States” is more precise than President AND “United States”.

      • The meaning of “ flights to London “ is likely to be lost if the word to is stopped out.

      • .....

Building the vocabulary32 l.jpg
Building the vocabulary

  • Processing a stream of characters to extract keywords

  • 1st task: tokenization, main difficulties:

    • token delimiters (ex: Chinese)

    • apostrophes (ex: O’neill, Finland’s capital)

    • hyphens (ex: Hewlett-Packard, state-of-the-art)

    • segmented compound nouns (ex: Los Angeles)

    • unsegmented compound nouns (icecream, breadknife)

    • numerical data (dates, IP addresses)

    • word order (ex: Arabic wrt nouns and numbers)

Solutions for tokenization issues l.jpg
Solutions for tokenization issues:

  • Using a pre-defined dictionary with largest matches and heuristics for unknown words

  • Using learning algorithms trained over hand-segmented words

Choosing keywords l.jpg
Choosing keywords

  • Selecting the words that are most likely to appear in a query

    • These words characterize the documents they appear in

    • Which are they?

The bag of words approach l.jpg
The bag of words approach

  • Extreme interpretation of the the principle of compositional semnaics

  • The meaning of documents resides solely in the words that are contained within them

  • The exact ordering of the terms in a document is ignored but the number of occurrences of each term is material

Slide36 l.jpg

“Not the same thing a bit!” said the Hatter.

“You might just as well say that ‘I see what Ieat’ is the same thing as ‘I eat what I see’!”

“You might just as well say,” added the March Hare, “that ‘I like what I get’ is the same thing as ‘I get what I like’!”

“You might just as well say,” added the Dormouse, who seemed to be talking in its sleep, “that ‘I breathe when I sleep’ is the same thing as ‘I sleep when I breathe’!”

Bags of words l.jpg
Bags of words

  • Nevertheless, it seems intuitive that two documents with similar bag of words representations are similar in content..

What s in a bag of words l.jpg
What’s in a bag of words?

  • Are all words in a document equally important?

    • stop words do not contribute in any way to retrieval and scoring

    • BoW contain terms

      • What should count as a term?

        • Words

        • Phrases (e.g., president of the US)

Morphological normalization l.jpg
Morphological normalization

  • Should index terms be word forms, lemmas or stems?

    • Matching morphological variants increase recall

    • Example morphological variants :

      • anticipate, anticipating, anticipated, anticipation

      • Company/Companies, sell/sold

      • USA vs U.S.A.,

      • 22/10/2007 vs 10/22/2007 vs 2007/10/22

      • university vs University

    • Idea: using equivalence classes of terms,

      • ex: { Opel, OPEL, opel }  opel

    • Two techniques:

      • stemming : refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time

      • Lemmatisation : refers to doing things,properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return a dictionary form of a word, which is known as the lemma.

  • NB: documents and queries have to be processed using the same tokenization process !

Stemming and lemmatization l.jpg
Stemming and Lemmatization

  • Role: reducing inflectional forms to common base forms,

  • Example:

    • car, cars, car’s, cars’  car

    • am, are, is  be

  • Stemming removes suffixes (surface markers) to produce root forms

  • Lemmatization reduces a word to a canonical form (using a dictionary and a morphological analyser)

  • Illustration of the difficulty:

    • plurals (woman/women, crisis/crisis)

    • derivational morphology (automatize/automate)

  • English  Porter stemming algorithm (University of Cambridge, UK, 1980)

Porter stemmer l.jpg
Porter stemmer

  • Algorithm based on a set of context-sensitive rewriting rules

  • Rules are composed of a pattern (left-hand-side) and a string (right-hand-side), example:

    (.*)sses  \1 ss sses  ss : caresses  caress

    (.* [aeiou].*)ies  \1i ies  i : ponies  poni, ties ti

    (.* [aeiou].*)ss  \1 ss ss  ss : caress  caress

  • Rules may be constrained by conditions on the word’s measure, example:

    (m > 1) (.*)ement  \1 replacement replac but not cement c

    (m>0) (.*)eed -> \1ee feed -> feed but agreed -> agree

    (*v*) ed -> \1 plastered -> plaster but bled -> bled

    (*v*) ing -> \1 motoring -> motor but sing -> sing

Porter stemmer word measure l.jpg
Porter StemmerWord measure

  • Assumed that a list of consonants is denoted by C, and a list of vowels by V

  • Any word, or part of a word has one of the four forms:

    • CVCV ... C

    • CVCV ... V

    • VCVC ... C

    • VCVC ... V

  • These may all be represented by the single form

    • [C]VCVC ... [V] where the square brackets denote arbitrary presence of their contents.

  • Using (VC)m to denote VC repeated m times, this may again be written as

    • [C](VC)m[V].

  • m will be called the measure of any word or word part when represented in this form.

  • Here are some examples:

    • m=0 TR,   EE,   TREE,   Y,   BY

    • m=1 TROUBLE,   OATS,   TREES,   IVY


    • (m > 1) EMENT ->

      • This would map REPLACEMENT to REPLAC, since REPLAC is a word part for which m = 2.

Exercise43 l.jpg

  • What is the Porter measure of the following words (give your computation) ?

    • crepuscular

    • rigorous

    • placement

      cr ep usc ul ar

      C VC VC VC VC

      m = 4

      r ig or ous

      C VC VC VC

      m = 3

      pl ac em ent

      C VC VC VC

      m = 3

Stemming l.jpg

  • Most stemmers also removes suffixes such as ed, ing, ational, ation, able, ism...

    • Relational  relate

  • Most stemmers don’t use lexical look up

  • There are shortcomings:

    • Stemming can result in non-words

      • Organization  Organ

      • Doing  doe

    • Unrelated words can be reduced to the same stem

      • police, policy polic

Stemming45 l.jpg

  • Popular stemmers

    • Porter’s

    • Lovin’s

    • Iterated Lovin’s

    • Kstem

Lemmatization l.jpg

  • Exceptions needs to be handled:

    • sought  seek, sheep sheep, feet foot

  • Computationally more expensive than stemming as it lookups words in a dictionnary

  • Lemmatizer for French


    • FLEMM (F. Namer)

  • POS taggers with lemmatization: TreeTagger, LT-POS

What is actually used l.jpg
What is actually used?

  • Most retrieval systems use stemming/lemmatising and stop word lists

    • Stemming increases recall while harming precision

  • Most web search engines do use stop word lists but not stemming/lemmatising because

    • the text collection is extremely large so that the change of matching morphogical variants is higher

    • recall is not an issue

    • stemming is imperfect and the size and diversity of the web increase the chance of a mismatch

    • stemming/tokenising tools are available for few languages

Example text representations l.jpg
Example Text Representations

Scientists have found compelling new evidence of possible ancient

microscopic life on mars, derived from magnetic crystals in a meteorit that fell to Earth from the red planet, NASA anounced on Monday.

Web search: scientists, found, compelling, new, evidence,

possible, ancient, microscopic, life, mars, derived, magnetic, crystals,

meteorite, fell, earth, red, planet, NASA, anounced, Monday

Information service or library search: scientist, find, compelling,

new, evidence, possible, ancient, microscopic, life, mars, derived,

magnetic, crystal, meteorite, fall, earth, red, planet, NASA,

anounce, Monday

Granularity l.jpg

  • Document unit :

    • An index can map terms

      • ... to documents

      • ... to paragraphs in documents

      • ... to sentences in document

      • ... to positions in documents

  • An IR system should be designed to offer choices of granularity.

  • For now, we will henceforth assume that a suitable size document unit has been chosen, together with an appropriate way of dividing or aggregating files, if needed.

Index content l.jpg
Index Content

  • The index usually stores some or all of the following information:

    • For each term:

      • Document count. How many documents the term occurs in.

      • Total Frequency count. How many times the term occurs accross all documents  “popularity measure”

    • For each term and for each document:

      • Frequency : How often the term occurs in that document.

      • Position. The offsets at which the term occurs in that document.

What is a retrieval model l.jpg
What is a retrieval model

  • A model is an abstraction of a process – here: retrieval

  • Conclusions derived by the model are good if the model provides a good approximation of the retrieval process

  • IR Model variables: queries, documents, terms, relevance, users, information needs

  • Existing types of retrieval models :

    • Boolean models

    • Vector space models

    • Probabilistic models

    • Models based on Belief nets

    • Models based on language models

Retrieval models the general intuition l.jpg
Retrieval Models: the general intuition

  • Documents and user information needs are represented using index terms

    • Index terms serve as links to documents

    • Queries consists of index terms

  • Relevance can be measured in terms of a match between queries and document index

Exact vs best match l.jpg
Exact vs. Best Match

  • Exact Match

    • A query specifies precise retrieval criteria

    • Each document either matches or fails to match the query

    • The result is a set of documents (no ranking)

  • Best match

    • A query describes good or best matching documents

    • The result is a ranked list of documents

Statistical models l.jpg
Statistical Models

A document is typically represented by a bag of words (unordered words with frequencies)

User specifies a set of desired terms with optional weights:

Weighted query terms:

Q = < database 0.5; text 0.8; information 0.2 >

Unweighted query terms:

Q = < database; text; information >

No Boolean conditions specified in the query.


Statistical retrieval l.jpg
Statistical Retrieval

Retrieval based on similaritybetween query and documents.

Output documents are ranked according to similarity to query

Similarity based on occurrence frequencies of keywords in query and document

Automatic relevance feedback can be supported

The user issues a (short, simple) query.

The system returns an initial set of retrieval results.

The user marks some returned documents as relevant or nonrelevant.

The systemcomputes a better representation of the information need base on the user feedback.

The system displays a revised set of retrieval results.


The boolean model l.jpg
The boolean model

Most common exact-match model

  • Basic assumptions:

    • An index term is either present or absent in a document

    • All index terms provide equal evidence wrt information needs

  • Queries are boolean combinations of index terms

    • x AND y: docts that contains both x and y (intersection of addresses)

    • x OR y: docts that contains x, y or both (union of addresses)

    • NOT x: docts that do not contain x (complement set of addresses)

  • Additionnally,

    • proximity operator

    • simple regular expressions

    • spelling variants

Boolean queries example l.jpg
Boolean queriesExample

  • User information need:

     interested in learning about vitamins that are antioxidant

  • User boolean query:

     antioxidant AND vitamin

The boolean model61 l.jpg
The boolean model

Example of input collection (Shakespeare’s plays):

  • Doc1

    I did enact Julius Caesar:

    I was killed in the Capitol;

    Brutus killed me.

  • Doc2

    So let it be with Caesar. The

    noble Brutus hath told you Caesar

    was ambitious

The boolean model index construction l.jpg
The boolean model index construction

  • First we build the list of pairs (keyword, docID)):

The boolean model index construction63 l.jpg
The boolean model index construction

  • Then the lists are sorted by keywords, frequency information is added:

The boolean model index construction64 l.jpg
The boolean model index construction

  • Multiple occurences of keywords are then merged to create a dictionary file and a postings file:

Processing boolean queries l.jpg
Processing Boolean queries

  • User boolean query: Brutus AND Calpurnia

    • over the inverted index :

      • Locate Brutus in the Dictionary

      • Retrieve its postings

      • Locate Calpurnia in the Dictionary

      • Retrieve its postings

      • Intersect the two postings lists

  • The intersection operation is the crucial one. It has to be we efficient so as to be able to quickly find documents that contain both terms.

    • sometimes referred to as merging postings lists because it uses a merge algorithm

    • Merge algortihm : general family of algorithms that combine multiple sorted lists by interleaved advancing of pointers through each list

Extended boolean queries l.jpg
Extended boolean queries

Merging algorithm (from Manning et al., 07)

NB: the posting lists HAVE to be sorted.

Extended boolean queries68 l.jpg
Extended boolean queries

  • Generalisation of the merging process:

    • Imagine more than 2 keywords appear in the query:

      • (Brutus AND Caesar) AND NOT (Capitol)

      • Brutus AND Caesar AND Capitol

      • (Brutus OR Caesar) AND (Capitol

      • ...

  • Ideas:

    • consider keywords with shorter posting lists first (to reduce the number of operations).

    • use the frequency information stored in the dictionary

       See Manning et al., 07 for the algorithm

Extended boolean queries69 l.jpg
Extended boolean queries

retrieved docs : D7, D5, D2

Exercise70 l.jpg

  • How would you process the following queries (main steps)

  • Brutus AND NOT Caesar

  • Try your algorithm on

Exercise71 l.jpg

  • How would you process the following query (main steps)

    Brutus OR NOT Caesar

Remarks on the boolean model l.jpg
Remarks on the boolean model

  • The boolean model allows to express precise queries (you know what you get, BUT you do not have flexibility → exact matches)

  • Boolean queries can be processed efficiently (time complexity of the merge algorithm is linear in the sum of the length of the lists to be merged)

  • Has been a reference model in IR for a long time

Advantages of exact match retrieval l.jpg
Advantages of exact-match retrieval

  • Predictable, easy to explain

  • Structured queries

  • Works well when information need is clear and precise

Drawbacks of exact match retrieval l.jpg
Drawbacks of exact-match retrieval

  • Unintuitive for non experts: adequate query formulation difficult for most users

  • no ranking of retrieved documents

  • exact matching may lead to too few or too many retrieved documents

    • too few: if not using synonyms

    • difficulty increases with collection size

    • large results sets need to be compensated by interactive query refinement

  • No notion of partial relevance (useful if query is overrestrictive)

  • All terms have equal importance (no term weighing)

  • Ranking models consistently better

Boolean model the story so far l.jpg
Boolean modelThe story so far

  • An inverted index associate keywords with posting lists

  • The postings lists contain document identifiers (and other useful information, such as total frequences, number of documents, etc.)

  • Boolean queries are processed by merging posting lists in order to find the documents satisfaying the query

  • The cost of this list merging is time linear in the total number of document Ids: O(m + n)

  • Question: how to process phrase queries (i.e. taking the word’s context into account) ?

Dealing with phrases queries l.jpg
Dealing with phrases queries

  • Many complex or technical concepts and many organization and product names are multiword compounds or phrases.

    • Stanford University

    • Graph Theory

    • Natural Language Processing

    • ...

  • The user wants documents were the whole phrase appears, and not only some parts of it (i.e. The inventor Stanford Ovshinsky never went to university is not a match )

  • About 10 % of the web queries are phrase queries (songs’ names, institutions...)

  • Such queries need either more complex dictionary terms, or more complex index (critical parameter: size of the index)

Biword indexes l.jpg
Biword indexes

Use key-phrases of length 2, example :

  • Text : Natural Language Processing

  • Dictionary:

    • Natural Language

    • Language Processing

    • The dictionary is made of biwords (notion of context)

  • Query : Information retrieval in Natural Langage Processing

    • (Information retrieval) and (retrieval Natural) and (Natural Language) and (Language Processing)

    • It might seem a better query to omit the middle biword.

    • Better results can be obtained by using more precise part-of-speech patterns that define which extended biwords should be indexed

Positionnal indexes l.jpg
Positionnal indexes

  • Store positions in the inverted indexes, example:

    termID ::=

    doc1: position1, position2, ...

    doc2: position1, position2, ..


  • Processing then corresponds to an extension of the merging algorithm (additional checkings while traversing the lists)

  • NB: such indexes can be used to process proximity queries (i.e. using constraints on proximity between words)

Slide79 l.jpg

  • Positional indexes need an entry per occurence (NB: classic inverted indexes need an entry per document Id)

  • The size of such indexes grows exponentially with the size of the document

  • The size of a positional index depends on the language being indexed and the type of document (books, articles, etc)

  • On average, a positional index is 2-4 times bigger than a inverted index, it can reach 35 to 50 % of the size of the original text (for English)

  • Positional indexes can be used in combination with classic indexes to save time and space (see [Williams et al, 2005]).

Exercise80 l.jpg
Exercise inverted indexes need an entry per document Id)

  • Which documents can contain the sentence “to be or not to be” considering the following (incomplete) indexes ?

    be ::=

    1: 7, 18, 33, 72, 86, 231

    2: 3, 149

    4: 17, 191, 291, 430, 434

    5: 363, 367

    to ::=

    2: 1, 17, 74, 222, 551

    4: 8, 16, 190, 429, 433

    7: 13, 23, 191

Exercise81 l.jpg
Exercise inverted indexes need an entry per document Id)

  • Given the following positional indexes, give the documents Ids corresponding to the query “world wide web” :

  • world ::=

  • 1: 7, 18, 33, 70, 85, 131

  • 2: 3, 149

  • 4: 17, 190, 291, 430, 434

  • wide ::=

  • 1: 12, 19, 40, 72, 86, 231

  • 2: 2, 17, 74, 150, 551

  • 3: 8, 16, 191, 429, 435

  • web ::=

  • 1: 20, 22, 41, 75, 87, 200

  • 2: 18, 32, 45, 56, 77, 151

  • 4: 25, 192, 300, 332, 440

Slide82 l.jpg

  • The postings lists to access are: to, be, or, not. inverted indexes need an entry per document Id)

  • We will examine intersecting the postings lists for to and be.

  • We first look for documents that contain both terms.

  • Then, we look for places in the lists where there is an occurrence of be with a token index one higher than a position of to

  • and then we look for another occurrence of each word with token index 4 higher than the first occurrence.

  • In the above lists, the pattern of occurrences that is a possible

  • match is:

  • to: <...;4:<...,429,433>...>

  • Be: <...;4:<...,430,434>...>

Exercise83 l.jpg
Exercise inverted indexes need an entry per document Id)

Consider the following index:

  • Language: <d1,12><d2,23-32-43><d3,53><d5,36-42-48>

  • Loria: <d1,25> <d2,34-40> <d5,38-51>

    Where dI refers to the document I, the other numbers being positions.

    The infix operator NEAR/x refers to the proximity x between two term :

  • Give the solutions to the query language NEAR/2 Loria

  • Give the pairs (x,docids) for each x such that language NEAR/x Loria has at least one solution

  • Propose an algorithm for retrieving matching document for this operator

  • Example westlaw l.jpg
    Example: WESTLAW inverted indexes need an entry per document Id)

    • Large commercial system that serves legal and professional market since 1974

      • legal materials (court opinions, statutes, regulations, ...)

      • news (newspapers, magazines, journals, ...)

      • financial (stock quotes, financial analyses, ...)

    • Total collection size: 5-7 Terabytes

    • 700 000 users (they claim 56% of legal searchers as of 2002)

    • Best match added in 1992

    Westlaw query language features l.jpg
    WESTLAW query language features inverted indexes need an entry per document Id)

    • Boolean and proximity operators

      • Phrases : West Publishing

      • Word Proximity : West /5 Publishing

      • Same sentence : Massachussets /s technology

      • Same paragraph - information retrieval /p

    • Restrictions : DATE(AFTER 1992 & BEFORE 1995)

    • Term expansion

      • wildcard (THOM*SON); truncation (THOM!); automatic expansion of plurals, possessive

    • Document structure (fields)

    Westlaw query example l.jpg
    WESTLAW query example inverted indexes need an entry per document Id)

    • Information need: Information on the legal theories involved in preventing the disclosure of trade secrets by employees formerly employed by a competingcompany.

       Query: "trade secret" /s disclos! /s prevent /s employe!

    • Information need: Requirements for disabled people to be able to access a workplace.

       Query: disab! /p access! /s work-site work-place (employment /3 place)

    • Information need: Cases about a host’s responsibility for drunk guests.

       Query: host! /p (responsib! liab!) /p (intoxicat! drunk!) /p guest

    Boolean query languages are not dead l.jpg
    Boolean query languages are not dead inverted indexes need an entry per document Id)

    • Exact match still prevalent in the commercial market (but then includes some type of ranking)

    • Many users prefer Boolean

    • For some queries/collections, boolean may work better

    • Boolean and free text queries find different documents

       Need retrieval models that support both

    The vector space model l.jpg

    The Vector Space Model inverted indexes need an entry per document Id)

    Best match retrieval l.jpg
    Best-Match retrieval inverted indexes need an entry per document Id)

    • Boolean retrieval is the archetypal example of exact-match retrieval

    • Best-match or ranking models are now more common

    • Advantages

      • easier to use

      • similar efficiency

      • provides ranking

      • best match generally has better retrieval performance

      • most relevant documents appear at the top of the ranking

    • But: comparison best- and exact-match is difficult

    Slide90 l.jpg

    • Boolean model: all documents matching the query are retrieved

    • The matching is binary: yes or no

    • Extreme cases: the list of retrieved documents can be empty, or huge

    • A ranking of the documents matching a query is needed

    • A score is computed for each pair (query, document)

    Vector space retrieval l.jpg
    Vector-space Retrieval retrieved

    • By far the most common retrieval systems

    • Key idea: Everything (document, queries) is a vector in a high dimensional space

    • Vector coefficients for an object (document, query, term) represent the degree to which this object embodies each of the basic dimensions

    • Relevance is measured using vector similarity: a document is relevant to a query if their representing vectors are similar

    Vector space representation l.jpg
    Vector-space Representation retrieved

    • Documents are vectors of terms

    • Terms are vectors of documents

    • A query is a vector of terms

    Graphic representation l.jpg
    Graphic Representation retrieved


    D1 = 2T1 + 3T2 + 5T3

    D2 = 3T1 + 7T2 + T3

    Q = 0T1 + 0T2 + 2T3



    D1 = 2T1+ 3T2 + 5T3

    Q = 0T1 + 0T2 + 2T3




    D2 = 3T1 + 7T2 + T3




    Similarity in the vector space l.jpg
    Similarity in the Vector-space retrieved

    • Vector can contain binary terms or weighted terms

      • Binary term vector: 1  term present, 0  term absent

      • Weighted term vector:indicates relative importance of terms in a document

    • Vector similarity can be measured in several ways:

      • Inner product (measure of overlap)

      • Cosine coefficient

      • Jacquard coefficient

      • Dice coefficient

      • Mikowski metric (dissimilarity)

      • Euclidian distance (dissimilarity)

    Using the inner product similarity measure l.jpg
    Using the inner product similarity measure retrieved

    • Given a query vector q and a doct vector d, both of length n,

    • similarity between q and d is defined by the inner product q • d of q and d :

    • where qi (di ) is the value of the i -th position of q(d)

    • With binary values this amounts to counting the matching terms between q and d

    The effect of varying document lengths l.jpg
    The effect of varying document lengths retrieved

    • Problem :

      • Longer documents will be represented with longer vectors, but that does not mean they are more important

      • If two documents have the same score, the shorter one should be preferred

    • Solution : the length of a document must be taken into account when computing the similarity score

    Document length normalization l.jpg
    Document length normalization retrieved

    • The length of a document: euclidian length

    • If d= (x1, x2, ... Xn) then dw=

    • To normalize a document, we divide it by its own length : d/dw

    • Similarity given by the cosine measure between normalized vectors:


    • One problem is solved : shorter more focused documents receive a higher score than longer documents with the same matching terms

    • But: shorter documents are generally preferred over longer one!

    • More sophisticated weighting schemes are generally used

    Term weights l.jpg
    Term weights retrieved

    • qi is the weight of the term i in q

    • Up to now, we only considered binary term weight

      • 0: term absent

      • 1: term present

    • Two shortcomings:

      • Does not reflect how often a term occurs

      • All terms are equally important (president vs. the)

    • Remedy: use non binary term weights

      • tf-score: store the frequency of a term in the vector (e.g., 4 if the term occurs 4 times in the document)

      • idf-score: to distinguish meaningful terms ie terms that occur only in a few documents

    Term frequency l.jpg
    Term frequency retrieved

    • A document is treated as a set of words

    • Each word characterizes that document to some extent

    • When we have eliminated stop words, the most frequent words tend to be what the document is about

    • Therefore: fkd (Nb of occurrences of word k in document d) will be an important measure.

       Also called the term frequency (tf)

    Document frequency l.jpg
    Document frequency retrieved

    • What makes this document distinct from others in the corpus?

    • The terms which discriminate best are not those which occur with high document frequency!

    • Therefore: dk (nb of documents in which word k occurs) will also be an important measure.

       Also called the document frequency (idf)

    Tf idf l.jpg
    TF.IDF retrieved

    • This can all be summarized as:

      • Words are best discriminators when :

        • they occur often in this document (term frequency)

        • do not occur in a lot of documents (document frequency)

      • One very common measure of the importance of a word to a document is :

        • TF.IDF: term frequency x inverse document frequency

      • There are multiple formulas for actually computing this. The underlying concept is the same in all of them.

    Term weights103 l.jpg
    Term weights retrieved

    • tf-score : tfi,j = frequency of term i in document j

    • idf-score : idfi = Inversed document frequency of term i

      • idfi = log(N/ni) with

        • N, the size of the document collection (nb of documents)

        • ni , the number of documents in which the term i occurs

      • idfi = Proportion of the document collection in which termi occurs

    • Term weight of term i in document j (TF-IDF):

      • tfi,j. idfi

      • the rarity of a term in the document collection

    Boolean retrieval vs vector space retrieval l.jpg
    Boolean retrieval vs. Vector Space Retrieval retrieved

    • Boolean retrieval

      • Documents are not ranked

      • Boolean queries are not easy to manipulate

    • Vector space retrieval

      • Documents can be ranked

      • Issue 1: choice of comparison function. Usually cosine comparison.

      • Issue 2: choice of weighing scheme. Usuall variations on tfi,j. idfi

    Evaluation l.jpg

    Evaluation retrieved

    Evaluation106 l.jpg
    Evaluation retrieved

    • Issues

    • User-based evaluation

    • System-based evaluation

    • TREC

    • Precision and recall

    Evaluation methods l.jpg
    Evaluation methods retrieved

    • Two types of evaluation methods:

      • User-based: measures the user satisfaction

      • System-based: focuses on how well the system ranks the documents

    User based evaluation l.jpg
    User based evaluation retrieved

    • More direct

    • Expensive

    • Difficult to do correctly

    • Need sufficiently large, representative sample of users

    • The compared systems must be equally well developed (complete with fully fonctional user interface)

    • Each user must be trained to control learning effects

    • Information, information needs, relevance are intangible concepts

    System based evaluation l.jpg
    System based evaluation retrieved

    • Good system performance = good document rankings

    • Allows for fair comparative testing

    • Less expensive; can be reused

    • Test collection = Topics, Documents, Relevance judgments

    • System based evaluation goes back to Cranfield’s experiments (1960)

      • Rate relevance of retrieved bibliographic reference on a scale from 1 to 4

    Recall and precision l.jpg
    Recall and Precision retrieved

    • Three important performance metrics:

      • Precision : Proportion of retrieved documents that are relevant

         No penalty for selecting too few item

      • Recall : Proportion of relevant documents that have been retrieved

         No penalty for selecting too many items (e.g., everything)7

    F measure l.jpg
    F-Measure retrieved

    Standard text collections l.jpg
    Standard Text Collections retrieved

    • Relevant documents must be identified

    • Given a document collection D and a set of queries Q, RELq is the set of document relevant to q

    • Whether a document d is relevant to a query q is decided by human judgement

    Standard text collections113 l.jpg
    Standard Text Collections retrieved

    • CACM (computer science): 3024 abstracts, 64 queries

    • CF (medicine): 1239 abstracts, 100 queries

    • CISI (library science): 1460 abstracts, 112 queries

    • CRANFIELD (aeronautics): 1400 abstracts, 225 queries

    • LISA (library science): 6004 abstracts, 35 queries

    • TIME (newspaper): 423 abstracts, 83 queries

    • Ohsumed (medicine): 348 566 abstracts, 106 queries

    Building test collections l.jpg
    Building Test Collections retrieved

    • How to identify relevant documents?

    • How to assess relevance? (binary or finer-grained)

    • One vs several judges

    Slide115 l.jpg
    TREC retrieved

    • Text REtrieval Conference

    • Proceedings at

    • Established in 1991 to evaluate large scale IR

    • Retrieving documents from a gigabyte collection

    • Organised by NIST and run continuously since 1991

    • Best known IR evaluation setting

      • 25 participants in 92

      • 109 participants from 4 continents in 2004

      • European (CLEF) and Asian counterparts (NTCIR)7

    Trec format l.jpg
    TREC Format retrieved

    • Several IR research tracks

      • ad-hoc retrieval

      • routing/filtering

      • cross languag

      • scanned document

      • spoken document

      • Video

      • Web

      • question answering

      • ...

    Trec notion of relevance l.jpg
    TREC notion of relevance retrieved

    If you were writing a report on the subject of the topic and would use the information contained in the document in the report, then the document is relevant

    • Pooling is used for identifying relevant documents:

      • A set of possibly relevant documents is created automatically for each information need

      • The top 100 documents returned by each system are kept and inspected by judges who determine which documents are relevant

      • Inter-judge agreement is about 80%8

    Improving recall and precision l.jpg
    Improving Recall and Precision retrieved

    • The two big problems with short queries are:

      • Synonymy: Poor recall results from missing documents that contain synonyms of search terms, but not the terms themselves

      • Polysemy/Homonymy: Poor precision results from search terms that have multiple meanings leading to the retrieval of non-relevant documents.

    Query expansion l.jpg
    Query Expansion retrieved

    • Find a way to expand a users query to automatically include relevant terms (that they should have included themselves), in an effort to improve recall

      • Use a dictionary/thesaurus

      • Use relevance feedback

    Thesauri l.jpg
    Thesauri retrieved

    • A thesaurus contains information about words (e.g., violin) such as :

      • Synonyms: similar words e.g., fiddle

      • Hyperonyms: more general words e.g., instrument

      • Hyponyms: more specific words e.g., Stradivari

      • Meronyms: parts, e.g., strings

    • A very popular machine readable thesaurus is Wordnet

    Problems of thesauri l.jpg
    Problems of Thesauri retrieved

    • Language dependent

    • Available only for a couple of languages

    Cooccurence models l.jpg
    Cooccurence models retrieved

    • Semantically or syntactically related terms

    • Cooccurence vs. Thesauri

      • Easy to adapt to other languages/domains

      • Also covers relations not expressed in thesaur

      • Not as reliable as manually edited thesauri

      • Can introduce considerable noise

    • Selection criteria: Mutual information, Expected mutual, information

    Relevance feedback l.jpg
    Relevance feedback retrieved

    • Ask user to identify a few documents which appear to be related to their information need

    • Extract terms from those documents and add them to the original query.

    • Run the new query and present those results to the user.

    • Typically converges quickly

    Blind feedback l.jpg
    Blind feedback retrieved

    • Assume that first few documents returned are most relevant rather than having users identify them

    • Proceed as for relevance feedback

    • Tends to improve recall at the expense of precision

    Post hoc analysis l.jpg
    Post-Hoc Analysis retrieved

    • When a set of documents has been returned, they can be analyzed to improve usefulness in addressing information need

      • Grouped by meaning for polysemic queries (using N-Gram-type approaches)

      • Grouped by extracted information (Named entities, for instance)

      • Group into existing hierarchy if structured fields available Filtering (e.g., eliminate spam)

    References l.jpg
    References retrieved

    • Introduction to Information Retrieval, by C. Manning, P. Raghavan, and H. Schütze. To appear at Cambridge University Press (chapters available at the book website).

    • Information Retrieval, Second Edition, by C.J. van Rijsbergen, Butterworths, London, 1979. Available here.