introduction to information retrieval and web based searching methods
Download
Skip this Video
Download Presentation
Introduction to Information Retrieval and Web-based Searching Methods

Loading in 2 Seconds...

play fullscreen
1 / 90

Introduction to Information Retrieval and Web-based Searching Methods - PowerPoint PPT Presentation


  • 378 Views
  • Uploaded on

Introduction to Information Retrieval and Web-based Searching Methods Mark Sanderson, University of Sheffield [email protected], dis.shef.ac.uk/mark/ Contents Introduction Ranked retrieval Models Evaluation Advanced ranking Future Sources Aims

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Introduction to Information Retrieval and Web-based Searching Methods' - oshin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
introduction to information retrieval and web based searching methods

Introduction to Information Retrieval and Web-based Searching Methods

Mark Sanderson, University of Sheffield

[email protected], dis.shef.ac.uk/mark/

©Mark Sanderson, Sheffield University

contents
Contents
  • Introduction
  • Ranked retrieval
  • Models
  • Evaluation
  • Advanced ranking
  • Future
  • Sources

©Mark Sanderson, Sheffield University

slide3
Aims
  • To introduce you to basic notions in the field of Information Retrieval with a focus on Web based retrieval issues.
    • To squeeze it all into 4 hours, including coffee breaks
    • If it’s not covered in here, hopefully there will at least be a reference

©Mark Sanderson, Sheffield University

objectives
Objectives
  • At the end of this you will be able to…
    • Demonstrate the workings of document ranking
      • Remove suffixes from words.
    • Explain how recall and precision are calculated.
    • Exploit Web specific information when searching.
    • Outline the means of automatically expanding users’ queries.
    • List IR publications.

©Mark Sanderson, Sheffield University

introduction
Introduction
  • What is IR?
    • General definition
      • Retrieval of unstructured data
    • Most often it is
      • Retrieval of text documents
        • Searching newspaper articles
        • Searching on the Web
    • Other types
      • Image retrieval

©Mark Sanderson, Sheffield University

typical interaction
Typical interaction
  • User has information need.
    • Expresses it as a query
      • in their natural language?
  • IR system find documents relevant to the query.

©Mark Sanderson, Sheffield University

slide7
Text
  • No computer understanding of document or query text
  • Use “bag of words” approach
    • Pay no heed to inter-word relations:
      • syntax, semantics
    • Bag does characterise document
    • Not perfect, words are
      • ambiguous
      • used in different forms or synonymously

©Mark Sanderson, Sheffield University

to recap
To recap

Documents

Documents

User

Query

Process

IR System

Process

Retrieved relevant(?)documents

Store

©Mark Sanderson, Sheffield University

Retrieval Part

processing
Processing
  • “The destruction of the amazon rain forests”
  • Case normalisation
  • Stop word removal.
    • From fixed list
    • “destruction amazon rain forests”
  • Suffix removal, also know as stemming.
    • “destruct amazon rain forest”
  • Documents processed as well

©Mark Sanderson, Sheffield University

different forms stemming
Different forms - stemming
  • Matching the query term “forests”
    • to “forest” and “forested”
  • Stemmers remove affixes
    • removal of suffixes - worker
    • prefixes? - megavolt
    • infixes? - un-bloody-likely
  • Stick with suffixes

©Mark Sanderson, Sheffield University

plural stemmer
Plural stemmer
  • Plurals in English
    • If word ends in “ies” but not “eies”, “aies”
      • “ies” -> “y”
    • if word ends in “es” but not “aes, “ees”, “oes”
      • “es” -> “e”
    • if word ends in “s” but not “us” or “ss”
      • “s” -> “”
    • First applicable rule is the one used

©Mark Sanderson, Sheffield University

plural stemmer reference
Plural stemmer reference
  • Good review of stemming
        • Frakes, W. (1992): Stemming algorithms, in Frakes, W. & Baeza-Yates, B. (eds.), Information Retrieval: Data Structures & Algorithms: 131-160

©Mark Sanderson, Sheffield University

plural stemmer13
Plural stemmer
  • Examples
    • Forests - ?
    • Statistics - ?
    • Queries - ?
    • Foes - ?
    • Does - ?
    • Is - ?
    • Plus - ?
    • Plusses - ?

©Mark Sanderson, Sheffield University

take more off
Take more off?
  • What about
    • “ed”, “ing”, “ational”, “ation”, “able”, “ism”, etc, etc.
        • Porter, M.F. (1980): An algorithm for suffix stripping, in Program - automated library and information systems, 14(3): 130-137
        • Three pages of rules
        • What about
          • “bring”, “table”, “prism”, “bed”, “thing”?
        • When to strip, when to stop

©Mark Sanderson, Sheffield University

slide15
CVCs
  • Porter used pattern of letters
    • [C*](VC)m[V*]
    • Tree - m=?
    • Trouble - m=?
    • Troubles - m=?
    • m = 0 or sometimes 1
      • stop
    • Syllables?
        • Pinker, S. (1994): The Language Instinct

©Mark Sanderson, Sheffield University

problems
Problems
  • Porter doesn’t always return words
    • “query”, “queries”, “querying”, etc
      • -> “queri”
        • Krovetz, R. (1993): Viewing morphology as an inference process, in Proceedings of the 16th annual international ACM SIGIR conference on Research and Development in Information Retrieval: 191-202
        • Xu, J., Croft, W.B. (1998): Corpus-Based Stemming using Co-occurrence of Word Variants, in ACM Transactions on Information Systems, 16(1): 61-81

©Mark Sanderson, Sheffield University

is it used
Is it used?
  • Research says it is useful
        • Hull, D.A. (1996): Stemming algorithms: A case study for detailed evaluation, in Journal of the American Society for Information Science, 47(1): 70-84
  • Web search engines hardly use it
    • Why?
      • Unexpected results
        • computer, computation, computing, computational, etc.
      • User expectation?
      • Foreign languages?

©Mark Sanderson, Sheffield University

ranked retrieval
Ranked retrieval
  • Everything processed into a bag…
  • …calculate relevance score between query and every document
  • Sort documents by their score
  • Present top scoring documents to user.

©Mark Sanderson, Sheffield University

the scoring
The scoring
  • For each document
    • Term frequency (tf)
      • t: Number of times term occurs in document
      • dl: Length of document (number of terms)
    • Inverse document frequency (idf)
      • n: Number of documents term occurs in
      • N: Number of documents in collection

©Mark Sanderson, Sheffield University

slide20
TF
  • More often a term is used in a document
    • More likely document is about that term
    • Depends on document length?
        • Harman, D. (1992): Ranking algorithms, in Frakes, W. & Baeza-Yates, B. (eds.), Information Retrieval: Data Structures & Algorithms: 363-392
          • Watch out for mistake: not unique terms.
      • Problems with spamming

©Mark Sanderson, Sheffield University

spamming the tf weight
Spamming the tf weight
  • Searching for Jennifer Anniston?

SEX SEXY MONICA LEWINSKY JENNIFER LOPEZ CLAUDIA SCHIFFER CINDY CRAWFORD JENNIFER ANNISTON GILLIAN ANDERSON MADONNA NIKI TAYLOR ELLE MACPHERSON KATE MOSS CAROL ALT TYRA BANKS FREDERIQUE KATHY IRELAND PAM ANDERSON KAREN MULDER VALERIA MAZZA SHALOM HARLOW AMBER VALLETTA LAETITA CASTA BETTIE PAGE HEIDI KLUM PATRICIA FORD DAISY FUENTES KELLY BROOK SEX SEXY MONICA LEWINSKY JENNIFER LOPEZ CLAUDIA SCHIFFER CINDY CRAWFORD JENNIFER ANNISTON GILLIAN ANDERSON MADONNA NIKI TAYLOR ELLE MACPHERSON KATE MOSS CAROL ALT TYRA BANKS FREDERIQUE KATHY IRELAND PAM ANDERSON KAREN MULDER VALERIA MAZZA SHALOM HARLOW AMBER VALLETTA LAETITA CASTA BETTIE PAGE HEIDI KLUM PATRICIA FORD DAISY FUENTES KELLY BROOK SEX SEXY MONICA LEWINSKY JENNIFER LOPEZ CLAUDIA SCHIFFER CINDY CRAWFORD JENNIFER ANNISTON GILLIAN ANDERSON MADONNA NIKI TAYLOR ELLE MACPHERSON KATE MOSS CAROL ALT TYRA BANKS FREDERIQUE KATHY IRELAND PAM ANDERSON KAREN MULDER VALERIA MAZZA SHALOM HARLOW AMBER VALLETTA LAETITA CASTA BETTIE PAGE HEIDI KLUM PATRICIA FORD DAISY FUENTES KELLY BROOK SEX SEXY MONICA LEWINSKY JENNIFER LOPEZ CLAUDIA SCHIFFER CINDY CRAWFORD JENNIFER ANNISTON GILLIAN ANDERSON MADONNA NIKI TAYLOR ELLE MACPHERSON KATE MOSS CAROL ALT TYRA BANKS FREDERIQUE KATHY IRELAND PAM ANDERSON KAREN MULDER VALERIA MAZZA SHALOM HARLOW AMBER VALLETTA LAETITA CASTA BETTIE PAGE HEIDI KLUM PATRICIA FORD DAISY FUENTES KELLY BROOK

©Mark Sanderson, Sheffield University

slide22
IDF
  • Some query terms better than others?
  • In general, fair to say that…
    • “amazon” > “forest”  “destruction” > “rain”

©Mark Sanderson, Sheffield University

to illustrate
To illustrate

All documents

Relevant documents

©Mark Sanderson, Sheffield University

to illustrate24
To illustrate

All documents

amazon

©Mark Sanderson, Sheffield University

to illustrate25
To illustrate

All documents

rain

©Mark Sanderson, Sheffield University

idf and collection context
IDF and collection context
  • IDF sensitive to the document collection content
    • General newspapers
      • “amazon” > “forest”  “destruction” > “rain”
    • Amazon book store press releases
      • “forest”  “destruction” > “rain” > “amazon”

©Mark Sanderson, Sheffield University

very successful
Very successful
  • Simple, but effective
  • Core of most weighting functions
    • tf (term frequency)
    • idf (inverse document frequency)
    • dl (document length)

©Mark Sanderson, Sheffield University

robertson s bm25
Robertson’s BM25
  • Q is a query containing terms T
  • w is a form of IDF
  • k1, b, k2, k3 are parameters.
  • tf is the document term frequency.
  • qtf is the query term frequency.
  • dl is the document length (arbitrary units).
  • avdl is the average document length.

©Mark Sanderson, Sheffield University

reference for bm25
Reference for BM25
  • Popular weighting scheme
        • Robertson, S.E., Walker, S., Beaulieu, M.M., Gatford, M., Payne, A. (1995): Okapi at TREC-4, in NIST Special Publication 500-236: The Fourth Text REtrieval Conference (TREC-4): 73-96

©Mark Sanderson, Sheffield University

getting the balance
Getting the balance
  • Documents with all the query terms?
  • Just those with high tf•idf terms?
    • What sorts of documents are these?
  • Search for a picture of Arbour Low
    • Stone circle near Sheffield
    • Try Google and AltaVista

©Mark Sanderson, Sheffield University

slide31
Very short “arbour” only

Longer, lots of “arbour”, no “low”

©Mark Sanderson, Sheffield University

slide32
“arbour low”

Arbour Low documents do exist

©Mark Sanderson, Sheffield University

slide33
Lots of Arbour Low documents

©Mark Sanderson, Sheffield University

Disambiguation?

result
Result
  • From Google
    • “The Stonehenge of the north”

©Mark Sanderson, Sheffield University

caveat
Caveat
  • Search engines don’t say much
    • Hard to know how they work

©Mark Sanderson, Sheffield University

boolean searching
Boolean searching?
  • Start with query
    • “amazon” & “rain forest*” & (“destroy” | “destruction”)
  • Break collection into two unordered sets
    • Documents that match the query
    • Documents that don’t
  • User has complete control but…
    • …not easy to use.

©Mark Sanderson, Sheffield University

boolean
Boolean
  • Two forms of query/retrieval system
    • Ranked retrieval
      • Long championed by academics
    • Boolean
      • Rooted in commercial systems from 1970s
        • Koenig, M.E. (1992): How close we came, in Information Processing and Management, 28(3): 433-436
  • Modern systems
    • Hybrid of both

©Mark Sanderson, Sheffield University

don t need boolean
Don’t need Boolean?
  • Ranking found to be better than Boolean
  • But lack of specificity in ranking
    • destruction AND (amazon OR south american) AND rain forest
    • destruction, amazon, south american, rain forest
        • Jansen, B.J., Spink, A., Bateman, J., and Saracevic, T. (1998): Real Life Information Retrieval: A Study Of User Queries On The Web, in SIGIR Forum: A Publication of the Special Interest Group on Information Retrieval, 32(1): 5-17

©Mark Sanderson, Sheffield University

models
Models
  • Mathematically modelling the retrieval process
    • So as to better understand it
    • Draw on work of others
  • Vector space
  • Probabilistic

©Mark Sanderson, Sheffield University

vector space
Vector Space
  • Document/query is a vector in N space
    • N = number of unique terms in collection
  • If term in doc/qry, set that element of its vector
  • Angle between vectors = similarity measure
    • Cosine of angle (cos(0) = 1)
  • Doesn’t model term dependence

D

Q

©Mark Sanderson, Sheffield University

model references
Model references
        • wx,y - weight of vector element
  • Vector space
        • Salton, G. & Lesk, M.E. (1968): Computer evaluation of indexing and text processing. Journal of the ACM, 15(1): 8-36
        • Any of the Salton SMART books

©Mark Sanderson, Sheffield University

modelling dependence
Modelling dependence
  • Latent Semantic Indexing (LSI)
    • Reduce dimensionality of N space
      • Bring related terms together.
        • Furnas, G.W., Deerwester, S., Dumais, S.T., Landauer, T.K., Harshman, R.A., Streeter, L.A., Lochbaum, K.E. (1988): Information retrieval using a singular value decomposition model of latent semantic structure, in Proceeding of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval: 465-480
        • Manning, C.D., Schütze, H. (1999): Foundations of Statistical Natural Language Processing: 554-566

©Mark Sanderson, Sheffield University

probabilistic
Probabilistic
  • Assume independence

©Mark Sanderson, Sheffield University

model references44
Model references
  • Probabilistic
    • Original papers
        • Robertson, S.E. & Sparck Jones, K. (1976): Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3): 129-146.
        • Van Rijsbergen, C.J. (1979): Information Retrieval
          • Chapter 6
    • Survey
        • Crestani, F., Lalmas, M., van Rijsbergen, C.J., Campbell, I. (1998): “Is This Document Relevant? ...Probably”: A Survey of Probabilistic Models in Information Retrieval, in ACM Computing Surveys, 30(4): 528-552

©Mark Sanderson, Sheffield University

recent developments
Recent developments
  • Probabilistic language models
        • Ponte, J., Croft, W.B. (1998): A Language Modelling Approach to Information Retrieval, in Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval: 275-281

©Mark Sanderson, Sheffield University

evaluation
Evaluation
  • Measure how well an IR system is doing
    • Effectiveness
      • Number of relevant documents retrieved
    • Also
      • Speed
      • Storage requirements
      • Usability

©Mark Sanderson, Sheffield University

effectiveness
Effectiveness
  • Two main measures
  • Precision is easy
    • P at rank 10.
  • Recall is hard
    • Total number of relevant documents?

©Mark Sanderson, Sheffield University

test collections
Test collections
  • Test collection
    • Set of documents (few thousand-few million)
    • Set of queries (50-400)
    • Set of relevance judgements
      • Humans check all documents!
      • Use pooling
        • Take top 100 from every submission
        • Remove duplicates
        • Manually assess these only.

©Mark Sanderson, Sheffield University

test collections49
Test collections
  • Small collections (~3Mb)
    • Cranfield, NPL, CACM - title (& abstract)
  • Medium (~4 Gb)
    • TREC - full text
  • Large (~100Gb)
    • VLC track of TREC
  • Compare with reality (~10Tb)
    • CIA, GCHQ, Large search services

©Mark Sanderson, Sheffield University

where to get them
Where to get them
  • Cranfield, NPL, CACM
    • www.dcs.gla.ac.uk/idom/
  • TREC, VLC
    • trec.nist.gov

©Mark Sanderson, Sheffield University

how to get r p figures
How to get r/p figures

Relevant documents

Document ranking

©Mark Sanderson, Sheffield University

another ranking
Another ranking

©Mark Sanderson, Sheffield University

graph these
Graph these

©Mark Sanderson, Sheffield University

how to average queries
How to average queries?
  • Macro evaluation
    • Interpolate to compute p at key recall points
      • Four
        • 0.25, 0.5, 0.75, 1.0
      • Ten
        • 0.1, 0.2, 0.3, … …, 0.9, 1.0
      • Eleven (most popular)
        • 0, 0.1, 0.2, 0.3, … …, 0.9, 1.0
    • Use a pessimistic interpolation

©Mark Sanderson, Sheffield University

graph this
Graph this

©Mark Sanderson, Sheffield University

graph the other
Graph the other

©Mark Sanderson, Sheffield University

can now average
Can now average

©Mark Sanderson, Sheffield University

homework
Homework
  • Random retrieval
    • For each query, randomly sort documents - relevant documents found (evenly) across ranking.
      • Measure precision at standard recall
    • Your assignment,
      • why does it look like this?

©Mark Sanderson, Sheffield University

slide59
Why?

©Mark Sanderson, Sheffield University

papers on evaluation
Papers on evaluation
  • Discusses a variety of measures
        • Hull, D. (1993) Using Statistical Testing in the Evaluation of Retrieval Experiments, in Proceedings of the 16th annual international ACM SIGIR conference on Research and Development in Information Retrieval: 329-338
  • Discuss accuracy of measures
        • Buckley, C. Voorhees, E. (2000), Evaluating evaluation measure stability in Proceedings of the 23rd annual international ACM SIGIR conference on Research and Development in Information Retrieval

©Mark Sanderson, Sheffield University

other test collection papers
Other test collection papers
  • Validity of test collections
        • Voorhees, E. (1998): Variations in Relevance Judgements and the Measurement of Retrieval Effectiveness, in Proceedings of the 21st annual international ACM-SIGIR conference on Research and development in information retrieval: 315-323
  • Overviews of TREC
    • Any of the introductory Harman/Voorhees papers for any TREC.
      • trec.nist.gov

©Mark Sanderson, Sheffield University

advanced ranking
Advanced ranking
  • Re-cap
    • Document ranking based on
      • Document frequency - idf
      • Term frequency - if
      • Document length - dl

©Mark Sanderson, Sheffield University

anything else
Anything else?
  • How to make ranking better?
    • Term location
    • Web link analysis
    • Popularity
    • Others?

©Mark Sanderson, Sheffield University

term location
Term location
  • Prefer documents where terms are closer together?
    • Passage retrieval
        • Callan, J. (1994): Passage­Level Evidence in Document Retrieval, in Proceedings of the 17th annual international ACM-SIGIR conference on Research and development in information retrieval: 302-310
        • Hearst, M.A., Plaunt, C. (1993): Subtopic structuring for full-length document access, in Proceedings of the 16th annual international ACM SIGIR conference on Research and Development in Information Retrieval: 59-68

©Mark Sanderson, Sheffield University

callan
Research interests/publications/lecturing/supervising

My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below.

I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising.

Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences.

My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system searching over two years of the Financial Times newspaper. The system, called NRT (News Retrieval Tool), was described in a paper by Donna Harman, "User Friendly Systems Instead of User-Friendly Front Ends". Donna's paper appears in JASIS and the "Readings in IR" book by Sparck-Jones and Willett.

1st

2nd

3rd

4th

Callan
  • Split document into passages
  • Rank a document based on score of its highest ranking passage
  • What is a passage?
    • Paragraph?
      • Bounded paragraph
    • (overlapping) Fixed window?

©Mark Sanderson, Sheffield University

results
Results?
  • Document ranking better with passages than without.
    • More disambiguation
  • Overlapping passage better than paragraph, bounded or otherwise.

©Mark Sanderson, Sheffield University

other location information
Other location information?
  • On the web
    • Title
    • First few lines…
      • Alta Vista, query help information
        • www.altavista.com
          • http://doc.altavista.com/adv_search/ast_i_index.shtml

©Mark Sanderson, Sheffield University

authority
Authority
  • In classic IR
    • authority not so important
  • On the web
    • very important
      • Query “Harvard”
        • Dwane’s Harvard home page
        • The Harvard University home page

©Mark Sanderson, Sheffield University

simple methods
Simple methods
  • URL length
  • Domain name

©Mark Sanderson, Sheffield University

authority70
Research interests/publications/lecturing/supervising

My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below.

I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising.

Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences.

My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system

Research interests/publications/lecturing/supervising

My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below.

I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising.

Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences.

My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system

Research interests/publications/lecturing/supervising

My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below.

I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising.

Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences.

My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system

Research interests/publications/lecturing/supervising

My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below.

I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising.

Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences.

My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system

Research interests/publications/lecturing/supervising

My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below.

I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising.

Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences.

My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system

Research interests/publications/lecturing/supervising

My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below.

I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising.

Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences.

My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system

Authority
  • Hubs and authorities
        • Brin, S., Page, L. (1998): The Anatomy of a Large-Scale Hypertextual Web Search Engine, in 7th International World Wide Web Conference
        • Gibson, D., Kleinberg, J., Raghavan, P. (1998): Inferring Web Communities from Link Topology, in Proceedings of The 9th ACM Conference on Hypertext and Hypermedia: links, objects, time and space—structure in hypermedia systems: 225-234

©Mark Sanderson, Sheffield University

popularity
Popularity?
  • Query IMDB (www.imdb.org) for “Titanic”

Titanic (1915)

Titanic (1997)

Titanic (1943)

Titanic (1953)

Titanic 2000 (1999)

Titanic: Anatomy of a Disaster (1997)

Titanic: Answers from the Abyss (1999)

Titanic Chronicles, The (1999)

Titanic in a Tub: The Golden Age of Toy Boats (1981)

Titanic Too: It Missed the Iceberg (2000)

Titanic Town (1998)

Titanic vals (1964)...aka Titanic Waltz (1964) (USA)

Atlantic (1929)...aka Titanic: Disaster in the Atlantic (1999) (USA: video title)

Night to Remember, A (1958)...aka Titanic latitudine 41 Nord (1958) (Italy)

Gigantic (2000)...aka Untitled Titanic Spoof (1998) (USA: working title)

Raise the Titanic (1980)

Saved From the Titanic (1912)

Search for the Titanic (1981)

Femme de chambre du Titanic, La (1997)...aka Camarera del Titanic, La (1997) (Spain) ...aka Chambermaid on the Titanic, The (1998) (USA) ...aka Chambermaid, The (1998) (USA: promotional title)

Doomed Sisters of the Titanic (1999)

©Mark Sanderson, Sheffield University

use popularity
Use popularity
  • Query “titanic” on IMDB
    • Titanic (1997)
    • Titanic Too: It Missed the Iceberg (2000)
  • On the Web
    • www.directhit.com
    • Increasingly popular
  • Why might popularity work on the Web?

©Mark Sanderson, Sheffield University

spamming
Spamming
  • Harder to spam a page to make it an authority?
  • Harder to spam a popularity system

©Mark Sanderson, Sheffield University

figure out why yet
Figure out why yet?
  • Why does popularity work well on the Web?

©Mark Sanderson, Sheffield University

advertising
Advertising
  • www.goto.com
    • Pay for higher ranking
      • price shown in result list
    • No spamming?
      • More honest than spamming?
        • www.1stplaceranking.com

©Mark Sanderson, Sheffield University

relevance feedback
Relevance feedback
  • User types in query
    • “The destruction of the amazon rain forests”
    • Gets back documents
    • Tells system some of them are relevant
    • Modify query to user’s wishes
      • How?

©Mark Sanderson, Sheffield University

which terms do you pick
Which terms do you pick?
  • From newspaper articles, user selects 4

the, fire, Brazil, hard, wood

the, Brazil, fire, greenhouse

the, greenhouse, warming

the, clearance, mahogany

©Mark Sanderson, Sheffield University

idf differences
IDF differences
  • Compute in non relevant
    • Approximate to main collection
  • Compute in relevant collection
    • Rank terms on their difference
    • Add top n terms to query
    • Really works
        • Harman, D. (1992): Relevance feedback revisited, in Proceedings of the 15th Annual International ACM SIGIR conference on Research and development in information retrieval: 1-10

©Mark Sanderson, Sheffield University

expansion before retrieval
Expansion before retrieval?
  • Query expansion good, use it other times?
    • Local analysis
      • (Pseudo|Local) relevance feedback
      • Local Content Analysis (LCA)
    • See also global analysis
        • Qiu, Y., Frei, H.P. (1993): Concept based query expansion, in Proceedings of the 16th annual international ACM SIGIR conference on Research and Development in Information Retrieval, ACM Press: 160-170

©Mark Sanderson, Sheffield University

pseudo relevance feedback
Pseudo-relevance feedback
  • Assume top ranked documents relevant
  • Automatically mark as relevant
    • Maybe others as non-relevant
  • Expand query
  • Do another retrieval
  • Use top ranked passages (LCA)
        • Xu, J., Croft, W.B. (1996): Query Expansion Using Local and Global Document Analysis, in Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval: 4-11

©Mark Sanderson, Sheffield University

example
Example
  • “Reporting on possibility of and search for extra-terrestrial life/intelligence.”
  • extraterrestrials, planetary society, universe, civilization, planet, radio signal, seti, sagan, search, earth, extraterrestrial intelligence, alien, astronomer, star, radio receiver, nasa, earthlings, e.t., galaxy, life, intelligence, meta receiver, radio search, discovery, northern hemisphere, national aeronautics, jet propulsion laboratory, soup, space, radio frequency, radio wave, klein, receiver, comet, steven spielberg, telescope, scientist, signal, mars, moises bermudez, extra terrestrial, harvard university, water hole, space administration, message, creature, astronomer carl sagan, intelligent life, meta ii, radioastronomy, meta project, cosmos, argentina, trillions, raul colomb, ufos, meta, evidence, ames research center, california institute, history, hydrogen atom, columbus discovery, hypothesis, third kind, institute, mop, chance, film, signs

©Mark Sanderson, Sheffield University

does it work
Does it work?
  • Works (mostly)
    • Synonymy being dealt with
    • Remember assumption
      • Query drift?
  • Equivalent to LSI?
    • Quicker

©Mark Sanderson, Sheffield University

what have i missed
What have I missed?
  • NLP?
    • Phrases?
  • Filtering
    • TREC
  • User “stuff”
    • trec.nist.gov, issue of IP&M
  • Implementation
    • Web crawling strategies, index files

©Mark Sanderson, Sheffield University

the future
The Future
  • Good place to look, TREC tracks
    • Cross language
    • Speech retrieval
    • Question answering
  • Image retrieval

©Mark Sanderson, Sheffield University

future on the web
Future on the Web
  • Specialisation of search engines
    • citeseer.nj.nec.com/cs/
  • Index all the Web?
        • Lawrence, S., Giles, C.L. (1999): Accessibility of information on the web, in Nature, 400: 107-109
    • Google, 1 billion pages

©Mark Sanderson, Sheffield University

slide86
Distributed searching
    • Tens of search engines
      • Meta Searching
        • Lawrence, S., Giles, C.L. (1998): Context and Page Analysis for Improved Web Search, in IEEE Internet Computing, 2(4): 38-46
    • Millions of search engines?
      • GNutella
        • Organised like terrorist cells.

©Mark Sanderson, Sheffield University

conferences
Conferences
  • ACM
    • SIGIR, CIKM, DL
  • TREC
  • BCS
    • IRSG
  • EuroDL
  • Less related
    • ACM CHI, ACL, AAAI

©Mark Sanderson, Sheffield University

web sites
Web sites
  • Organisations
    • sigir.org
      • SIGIR Forum, IRList
    • irsg.eu.org
      • Good list of web sites
  • Groups
    • ir.shef.ac.uk?
    • ir.dcs.gla.ac.uk, ciir.cs.umass.edu

©Mark Sanderson, Sheffield University

journals
Journals
  • Information Processing and Management
  • Journal of the American Society of Information Science
  • Transactions On Information Science
  • Information Retrieval
  • Journal of Documentation
  • Information Retrieval

©Mark Sanderson, Sheffield University

good books
Good books
  • Van Rijsbergen
    • “Information Retrieval”, ir.dcs.gla.ac.uk
  • Sparck Jones & Willett
    • “Readings in Information Retrieval”
  • Baeza-Yates & Ribeiro-Neto
    • “Modern Information Retrieval”
  • Witten, Moffat & Bell
    • “Managing Gigabytes”

©Mark Sanderson, Sheffield University

ad