Introduction to Information Retrieval and Web-based Searching Methods - PowerPoint PPT Presentation

Introduction to information retrieval and web based searching methods
Download
1 / 90

Introduction to Information Retrieval and Web-based Searching Methods Mark Sanderson, University of Sheffield m.sanderson@shef.ac.uk, dis.shef.ac.uk/mark/ Contents Introduction Ranked retrieval Models Evaluation Advanced ranking Future Sources Aims

Related searches for Introduction to Information Retrieval and Web-based Searching Methods

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentationdownload

Introduction to Information Retrieval and Web-based Searching Methods

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Introduction to information retrieval and web based searching methods l.jpg

Introduction to Information Retrieval and Web-based Searching Methods

Mark Sanderson, University of Sheffield

m.sanderson@shef.ac.uk, dis.shef.ac.uk/mark/

©Mark Sanderson, Sheffield University


Contents l.jpg

Contents

  • Introduction

  • Ranked retrieval

  • Models

  • Evaluation

  • Advanced ranking

  • Future

  • Sources

©Mark Sanderson, Sheffield University


Slide3 l.jpg

Aims

  • To introduce you to basic notions in the field of Information Retrieval with a focus on Web based retrieval issues.

    • To squeeze it all into 4 hours, including coffee breaks

    • If it’s not covered in here, hopefully there will at least be a reference

©Mark Sanderson, Sheffield University


Objectives l.jpg

Objectives

  • At the end of this you will be able to…

    • Demonstrate the workings of document ranking

      • Remove suffixes from words.

    • Explain how recall and precision are calculated.

    • Exploit Web specific information when searching.

    • Outline the means of automatically expanding users’ queries.

    • List IR publications.

©Mark Sanderson, Sheffield University


Introduction l.jpg

Introduction

  • What is IR?

    • General definition

      • Retrieval of unstructured data

    • Most often it is

      • Retrieval of text documents

        • Searching newspaper articles

        • Searching on the Web

    • Other types

      • Image retrieval

©Mark Sanderson, Sheffield University


Typical interaction l.jpg

Typical interaction

  • User has information need.

    • Expresses it as a query

      • in their natural language?

  • IR system find documents relevant to the query.

©Mark Sanderson, Sheffield University


Slide7 l.jpg

Text

  • No computer understanding of document or query text

  • Use “bag of words” approach

    • Pay no heed to inter-word relations:

      • syntax, semantics

    • Bag does characterise document

    • Not perfect, words are

      • ambiguous

      • used in different forms or synonymously

©Mark Sanderson, Sheffield University


To recap l.jpg

To recap

Documents

Documents

User

Query

Process

IR System

Process

Retrieved relevant(?)documents

Store

©Mark Sanderson, Sheffield University

Retrieval Part


Processing l.jpg

Processing

  • “The destruction of the amazon rain forests”

  • Case normalisation

  • Stop word removal.

    • From fixed list

    • “destruction amazon rain forests”

  • Suffix removal, also know as stemming.

    • “destruct amazon rain forest”

  • Documents processed as well

©Mark Sanderson, Sheffield University


Different forms stemming l.jpg

Different forms - stemming

  • Matching the query term “forests”

    • to “forest” and “forested”

  • Stemmers remove affixes

    • removal of suffixes - worker

    • prefixes? - megavolt

    • infixes? - un-bloody-likely

  • Stick with suffixes

©Mark Sanderson, Sheffield University


Plural stemmer l.jpg

Plural stemmer

  • Plurals in English

    • If word ends in “ies” but not “eies”, “aies”

      • “ies” -> “y”

    • if word ends in “es” but not “aes, “ees”, “oes”

      • “es” -> “e”

    • if word ends in “s” but not “us” or “ss”

      • “s” -> “”

    • First applicable rule is the one used

©Mark Sanderson, Sheffield University


Plural stemmer reference l.jpg

Plural stemmer reference

  • Good review of stemming

    • Frakes, W. (1992): Stemming algorithms, in Frakes, W. & Baeza-Yates, B. (eds.), Information Retrieval: Data Structures & Algorithms: 131-160

©Mark Sanderson, Sheffield University


Plural stemmer13 l.jpg

Plural stemmer

  • Examples

    • Forests - ?

    • Statistics - ?

    • Queries - ?

    • Foes - ?

    • Does - ?

    • Is - ?

    • Plus - ?

    • Plusses - ?

©Mark Sanderson, Sheffield University


Take more off l.jpg

Take more off?

  • What about

    • “ed”, “ing”, “ational”, “ation”, “able”, “ism”, etc, etc.

      • Porter, M.F. (1980): An algorithm for suffix stripping, in Program - automated library and information systems, 14(3): 130-137

      • Three pages of rules

      • What about

        • “bring”, “table”, “prism”, “bed”, “thing”?

      • When to strip, when to stop

©Mark Sanderson, Sheffield University


Slide15 l.jpg

CVCs

  • Porter used pattern of letters

    • [C*](VC)m[V*]

    • Tree - m=?

    • Trouble - m=?

    • Troubles - m=?

    • m = 0 or sometimes 1

      • stop

    • Syllables?

      • Pinker, S. (1994): The Language Instinct

©Mark Sanderson, Sheffield University


Problems l.jpg

Problems

  • Porter doesn’t always return words

    • “query”, “queries”, “querying”, etc

      • -> “queri”

        • Krovetz, R. (1993): Viewing morphology as an inference process, in Proceedings of the 16th annual international ACM SIGIR conference on Research and Development in Information Retrieval: 191-202

        • Xu, J., Croft, W.B. (1998): Corpus-Based Stemming using Co-occurrence of Word Variants, in ACM Transactions on Information Systems, 16(1): 61-81

©Mark Sanderson, Sheffield University


Is it used l.jpg

Is it used?

  • Research says it is useful

    • Hull, D.A. (1996): Stemming algorithms: A case study for detailed evaluation, in Journal of the American Society for Information Science, 47(1): 70-84

  • Web search engines hardly use it

    • Why?

      • Unexpected results

        • computer, computation, computing, computational, etc.

      • User expectation?

      • Foreign languages?

  • ©Mark Sanderson, Sheffield University


    Ranked retrieval l.jpg

    Ranked retrieval

    • Everything processed into a bag…

    • …calculate relevance score between query and every document

    • Sort documents by their score

    • Present top scoring documents to user.

    ©Mark Sanderson, Sheffield University


    The scoring l.jpg

    The scoring

    • For each document

      • Term frequency (tf)

        • t: Number of times term occurs in document

        • dl: Length of document (number of terms)

      • Inverse document frequency (idf)

        • n: Number of documents term occurs in

        • N: Number of documents in collection

    ©Mark Sanderson, Sheffield University


    Slide20 l.jpg

    TF

    • More often a term is used in a document

      • More likely document is about that term

      • Depends on document length?

        • Harman, D. (1992): Ranking algorithms, in Frakes, W. & Baeza-Yates, B. (eds.), Information Retrieval: Data Structures & Algorithms: 363-392

          • Watch out for mistake: not unique terms.

      • Problems with spamming

    ©Mark Sanderson, Sheffield University


    Spamming the tf weight l.jpg

    Spamming the tf weight

    • Searching for Jennifer Anniston?

    SEX SEXY MONICA LEWINSKY JENNIFER LOPEZ CLAUDIA SCHIFFER CINDY CRAWFORD JENNIFER ANNISTON GILLIAN ANDERSON MADONNA NIKI TAYLOR ELLE MACPHERSON KATE MOSS CAROL ALT TYRA BANKS FREDERIQUE KATHY IRELAND PAM ANDERSON KAREN MULDER VALERIA MAZZA SHALOM HARLOW AMBER VALLETTA LAETITA CASTA BETTIE PAGE HEIDI KLUM PATRICIA FORD DAISY FUENTES KELLY BROOK SEX SEXY MONICA LEWINSKY JENNIFER LOPEZ CLAUDIA SCHIFFER CINDY CRAWFORD JENNIFER ANNISTON GILLIAN ANDERSON MADONNA NIKI TAYLOR ELLE MACPHERSON KATE MOSS CAROL ALT TYRA BANKS FREDERIQUE KATHY IRELAND PAM ANDERSON KAREN MULDER VALERIA MAZZA SHALOM HARLOW AMBER VALLETTA LAETITA CASTA BETTIE PAGE HEIDI KLUM PATRICIA FORD DAISY FUENTES KELLY BROOK SEX SEXY MONICA LEWINSKY JENNIFER LOPEZ CLAUDIA SCHIFFER CINDY CRAWFORD JENNIFER ANNISTON GILLIAN ANDERSON MADONNA NIKI TAYLOR ELLE MACPHERSON KATE MOSS CAROL ALT TYRA BANKS FREDERIQUE KATHY IRELAND PAM ANDERSON KAREN MULDER VALERIA MAZZA SHALOM HARLOW AMBER VALLETTA LAETITA CASTA BETTIE PAGE HEIDI KLUM PATRICIA FORD DAISY FUENTES KELLY BROOK SEX SEXY MONICA LEWINSKY JENNIFER LOPEZ CLAUDIA SCHIFFER CINDY CRAWFORD JENNIFER ANNISTON GILLIAN ANDERSON MADONNA NIKI TAYLOR ELLE MACPHERSON KATE MOSS CAROL ALT TYRA BANKS FREDERIQUE KATHY IRELAND PAM ANDERSON KAREN MULDER VALERIA MAZZA SHALOM HARLOW AMBER VALLETTA LAETITA CASTA BETTIE PAGE HEIDI KLUM PATRICIA FORD DAISY FUENTES KELLY BROOK

    ©Mark Sanderson, Sheffield University


    Slide22 l.jpg

    IDF

    • Some query terms better than others?

    • In general, fair to say that…

      • “amazon” > “forest”  “destruction” > “rain”

    ©Mark Sanderson, Sheffield University


    To illustrate l.jpg

    To illustrate

    All documents

    Relevant documents

    ©Mark Sanderson, Sheffield University


    To illustrate24 l.jpg

    To illustrate

    All documents

    amazon

    ©Mark Sanderson, Sheffield University


    To illustrate25 l.jpg

    To illustrate

    All documents

    rain

    ©Mark Sanderson, Sheffield University


    Idf and collection context l.jpg

    IDF and collection context

    • IDF sensitive to the document collection content

      • General newspapers

        • “amazon” > “forest”  “destruction” > “rain”

      • Amazon book store press releases

        • “forest”  “destruction” > “rain” > “amazon”

    ©Mark Sanderson, Sheffield University


    Very successful l.jpg

    Very successful

    • Simple, but effective

    • Core of most weighting functions

      • tf (term frequency)

      • idf (inverse document frequency)

      • dl (document length)

    ©Mark Sanderson, Sheffield University


    Robertson s bm25 l.jpg

    Robertson’s BM25

    • Q is a query containing terms T

    • w is a form of IDF

    • k1, b, k2, k3 are parameters.

    • tf is the document term frequency.

    • qtf is the query term frequency.

    • dl is the document length (arbitrary units).

    • avdl is the average document length.

    ©Mark Sanderson, Sheffield University


    Reference for bm25 l.jpg

    Reference for BM25

    • Popular weighting scheme

      • Robertson, S.E., Walker, S., Beaulieu, M.M., Gatford, M., Payne, A. (1995): Okapi at TREC-4, in NIST Special Publication 500-236: The Fourth Text REtrieval Conference (TREC-4): 73-96

    ©Mark Sanderson, Sheffield University


    Getting the balance l.jpg

    Getting the balance

    • Documents with all the query terms?

    • Just those with high tf•idf terms?

      • What sorts of documents are these?

    • Search for a picture of Arbour Low

      • Stone circle near Sheffield

      • Try Google and AltaVista

    ©Mark Sanderson, Sheffield University


    Slide31 l.jpg

    Very short “arbour” only

    Longer, lots of “arbour”, no “low”

    ©Mark Sanderson, Sheffield University


    Slide32 l.jpg

    “arbour low”

    Arbour Low documents do exist

    ©Mark Sanderson, Sheffield University


    Slide33 l.jpg

    Lots of Arbour Low documents

    ©Mark Sanderson, Sheffield University

    Disambiguation?


    Result l.jpg

    Result

    • From Google

      • “The Stonehenge of the north”

    ©Mark Sanderson, Sheffield University


    Caveat l.jpg

    Caveat

    • Search engines don’t say much

      • Hard to know how they work

    ©Mark Sanderson, Sheffield University


    Boolean searching l.jpg

    Boolean searching?

    • Start with query

      • “amazon” & “rain forest*” & (“destroy” | “destruction”)

    • Break collection into two unordered sets

      • Documents that match the query

      • Documents that don’t

    • User has complete control but…

      • …not easy to use.

    ©Mark Sanderson, Sheffield University


    Boolean l.jpg

    Boolean

    • Two forms of query/retrieval system

      • Ranked retrieval

        • Long championed by academics

      • Boolean

        • Rooted in commercial systems from 1970s

          • Koenig, M.E. (1992): How close we came, in Information Processing and Management, 28(3): 433-436

    • Modern systems

      • Hybrid of both

    ©Mark Sanderson, Sheffield University


    Don t need boolean l.jpg

    Don’t need Boolean?

    • Ranking found to be better than Boolean

    • But lack of specificity in ranking

      • destruction AND (amazon OR south american) AND rain forest

      • destruction, amazon, south american, rain forest

        • Jansen, B.J., Spink, A., Bateman, J., and Saracevic, T. (1998): Real Life Information Retrieval: A Study Of User Queries On The Web, in SIGIR Forum: A Publication of the Special Interest Group on Information Retrieval, 32(1): 5-17

    ©Mark Sanderson, Sheffield University


    Models l.jpg

    Models

    • Mathematically modelling the retrieval process

      • So as to better understand it

      • Draw on work of others

    • Vector space

    • Probabilistic

    ©Mark Sanderson, Sheffield University


    Vector space l.jpg

    Vector Space

    • Document/query is a vector in N space

      • N = number of unique terms in collection

    • If term in doc/qry, set that element of its vector

    • Angle between vectors = similarity measure

      • Cosine of angle (cos(0) = 1)

    • Doesn’t model term dependence

    D

    Q

    ©Mark Sanderson, Sheffield University


    Model references l.jpg

    Model references

    • wx,y - weight of vector element

  • Vector space

    • Salton, G. & Lesk, M.E. (1968): Computer evaluation of indexing and text processing. Journal of the ACM, 15(1): 8-36

    • Any of the Salton SMART books

  • ©Mark Sanderson, Sheffield University


    Modelling dependence l.jpg

    Modelling dependence

    • Latent Semantic Indexing (LSI)

      • Reduce dimensionality of N space

        • Bring related terms together.

          • Furnas, G.W., Deerwester, S., Dumais, S.T., Landauer, T.K., Harshman, R.A., Streeter, L.A., Lochbaum, K.E. (1988): Information retrieval using a singular value decomposition model of latent semantic structure, in Proceeding of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval: 465-480

          • Manning, C.D., Schütze, H. (1999): Foundations of Statistical Natural Language Processing: 554-566

    ©Mark Sanderson, Sheffield University


    Probabilistic l.jpg

    Probabilistic

    • Assume independence

    ©Mark Sanderson, Sheffield University


    Model references44 l.jpg

    Model references

    • Probabilistic

      • Original papers

        • Robertson, S.E. & Sparck Jones, K. (1976): Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3): 129-146.

        • Van Rijsbergen, C.J. (1979): Information Retrieval

          • Chapter 6

    • Survey

      • Crestani, F., Lalmas, M., van Rijsbergen, C.J., Campbell, I. (1998): “Is This Document Relevant? ...Probably”: A Survey of Probabilistic Models in Information Retrieval, in ACM Computing Surveys, 30(4): 528-552

    ©Mark Sanderson, Sheffield University


    Recent developments l.jpg

    Recent developments

    • Probabilistic language models

      • Ponte, J., Croft, W.B. (1998): A Language Modelling Approach to Information Retrieval, in Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval: 275-281

    ©Mark Sanderson, Sheffield University


    Evaluation l.jpg

    Evaluation

    • Measure how well an IR system is doing

      • Effectiveness

        • Number of relevant documents retrieved

      • Also

        • Speed

        • Storage requirements

        • Usability

    ©Mark Sanderson, Sheffield University


    Effectiveness l.jpg

    Effectiveness

    • Two main measures

    • Precision is easy

      • P at rank 10.

    • Recall is hard

      • Total number of relevant documents?

    ©Mark Sanderson, Sheffield University


    Test collections l.jpg

    Test collections

    • Test collection

      • Set of documents (few thousand-few million)

      • Set of queries (50-400)

      • Set of relevance judgements

        • Humans check all documents!

        • Use pooling

          • Take top 100 from every submission

          • Remove duplicates

          • Manually assess these only.

    ©Mark Sanderson, Sheffield University


    Test collections49 l.jpg

    Test collections

    • Small collections (~3Mb)

      • Cranfield, NPL, CACM - title (& abstract)

    • Medium (~4 Gb)

      • TREC - full text

    • Large (~100Gb)

      • VLC track of TREC

    • Compare with reality (~10Tb)

      • CIA, GCHQ, Large search services

    ©Mark Sanderson, Sheffield University


    Where to get them l.jpg

    Where to get them

    • Cranfield, NPL, CACM

      • www.dcs.gla.ac.uk/idom/

    • TREC, VLC

      • trec.nist.gov

    ©Mark Sanderson, Sheffield University


    How to get r p figures l.jpg

    How to get r/p figures

    Relevant documents

    Document ranking

    ©Mark Sanderson, Sheffield University


    Another ranking l.jpg

    Another ranking

    ©Mark Sanderson, Sheffield University


    Graph these l.jpg

    Graph these

    ©Mark Sanderson, Sheffield University


    How to average queries l.jpg

    How to average queries?

    • Macro evaluation

      • Interpolate to compute p at key recall points

        • Four

          • 0.25, 0.5, 0.75, 1.0

        • Ten

          • 0.1, 0.2, 0.3, … …, 0.9, 1.0

        • Eleven (most popular)

          • 0, 0.1, 0.2, 0.3, … …, 0.9, 1.0

      • Use a pessimistic interpolation

    ©Mark Sanderson, Sheffield University


    Graph this l.jpg

    Graph this

    ©Mark Sanderson, Sheffield University


    Graph the other l.jpg

    Graph the other

    ©Mark Sanderson, Sheffield University


    Can now average l.jpg

    Can now average

    ©Mark Sanderson, Sheffield University


    Homework l.jpg

    Homework

    • Random retrieval

      • For each query, randomly sort documents - relevant documents found (evenly) across ranking.

        • Measure precision at standard recall

      • Your assignment,

        • why does it look like this?

    ©Mark Sanderson, Sheffield University


    Slide59 l.jpg

    Why?

    ©Mark Sanderson, Sheffield University


    Papers on evaluation l.jpg

    Papers on evaluation

    • Discusses a variety of measures

      • Hull, D. (1993) Using Statistical Testing in the Evaluation of Retrieval Experiments, in Proceedings of the 16th annual international ACM SIGIR conference on Research and Development in Information Retrieval: 329-338

  • Discuss accuracy of measures

    • Buckley, C. Voorhees, E. (2000), Evaluating evaluation measure stability in Proceedings of the 23rd annual international ACM SIGIR conference on Research and Development in Information Retrieval

  • ©Mark Sanderson, Sheffield University


    Other test collection papers l.jpg

    Other test collection papers

    • Validity of test collections

      • Voorhees, E. (1998): Variations in Relevance Judgements and the Measurement of Retrieval Effectiveness, in Proceedings of the 21st annual international ACM-SIGIR conference on Research and development in information retrieval: 315-323

  • Overviews of TREC

    • Any of the introductory Harman/Voorhees papers for any TREC.

      • trec.nist.gov

  • ©Mark Sanderson, Sheffield University


    Advanced ranking l.jpg

    Advanced ranking

    • Re-cap

      • Document ranking based on

        • Document frequency - idf

        • Term frequency - if

        • Document length - dl

    ©Mark Sanderson, Sheffield University


    Anything else l.jpg

    Anything else?

    • How to make ranking better?

      • Term location

      • Web link analysis

      • Popularity

      • Others?

    ©Mark Sanderson, Sheffield University


    Term location l.jpg

    Term location

    • Prefer documents where terms are closer together?

      • Passage retrieval

        • Callan, J. (1994): Passage­Level Evidence in Document Retrieval, in Proceedings of the 17th annual international ACM-SIGIR conference on Research and development in information retrieval: 302-310

        • Hearst, M.A., Plaunt, C. (1993): Subtopic structuring for full-length document access, in Proceedings of the 16th annual international ACM SIGIR conference on Research and Development in Information Retrieval: 59-68

    ©Mark Sanderson, Sheffield University


    Callan l.jpg

    Research interests/publications/lecturing/supervising

    My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below.

    I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising.

    Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences.

    My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system searching over two years of the Financial Times newspaper. The system, called NRT (News Retrieval Tool), was described in a paper by Donna Harman, "User Friendly Systems Instead of User-Friendly Front Ends". Donna's paper appears in JASIS and the "Readings in IR" book by Sparck-Jones and Willett.

    1st

    2nd

    3rd

    4th

    Callan

    • Split document into passages

    • Rank a document based on score of its highest ranking passage

    • What is a passage?

      • Paragraph?

        • Bounded paragraph

      • (overlapping) Fixed window?

    ©Mark Sanderson, Sheffield University


    Results l.jpg

    Results?

    • Document ranking better with passages than without.

      • More disambiguation

    • Overlapping passage better than paragraph, bounded or otherwise.

    ©Mark Sanderson, Sheffield University


    Other location information l.jpg

    Other location information?

    • On the web

      • Title

      • First few lines…

        • Alta Vista, query help information

          • www.altavista.com

            • http://doc.altavista.com/adv_search/ast_i_index.shtml

    ©Mark Sanderson, Sheffield University


    Authority l.jpg

    Authority

    • In classic IR

      • authority not so important

    • On the web

      • very important

        • Query “Harvard”

          • Dwane’s Harvard home page

          • The Harvard University home page

    ©Mark Sanderson, Sheffield University


    Simple methods l.jpg

    Simple methods

    • URL length

    • Domain name

    ©Mark Sanderson, Sheffield University


    Authority70 l.jpg

    Research interests/publications/lecturing/supervising

    My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below.

    I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising.

    Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences.

    My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system

    Research interests/publications/lecturing/supervising

    My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below.

    I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising.

    Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences.

    My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system

    Research interests/publications/lecturing/supervising

    My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below.

    I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising.

    Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences.

    My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system

    Research interests/publications/lecturing/supervising

    My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below.

    I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising.

    Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences.

    My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system

    Research interests/publications/lecturing/supervising

    My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below.

    I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising.

    Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences.

    My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system

    Research interests/publications/lecturing/supervising

    My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below.

    I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising.

    Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences.

    My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system

    Authority

    • Hubs and authorities

      • Brin, S., Page, L. (1998): The Anatomy of a Large-Scale Hypertextual Web Search Engine, in 7th International World Wide Web Conference

      • Gibson, D., Kleinberg, J., Raghavan, P. (1998): Inferring Web Communities from Link Topology, in Proceedings of The 9th ACM Conference on Hypertext and Hypermedia: links, objects, time and space—structure in hypermedia systems: 225-234

    ©Mark Sanderson, Sheffield University


    Popularity l.jpg

    Popularity?

    • Query IMDB (www.imdb.org) for “Titanic”

    Titanic (1915)

    Titanic (1997)

    Titanic (1943)

    Titanic (1953)

    Titanic 2000 (1999)

    Titanic: Anatomy of a Disaster (1997)

    Titanic: Answers from the Abyss (1999)

    Titanic Chronicles, The (1999)

    Titanic in a Tub: The Golden Age of Toy Boats (1981)

    Titanic Too: It Missed the Iceberg (2000)

    Titanic Town (1998)

    Titanic vals (1964)...aka Titanic Waltz (1964) (USA)

    Atlantic (1929)...aka Titanic: Disaster in the Atlantic (1999) (USA: video title)

    Night to Remember, A (1958)...aka Titanic latitudine 41 Nord (1958) (Italy)

    Gigantic (2000)...aka Untitled Titanic Spoof (1998) (USA: working title)

    Raise the Titanic (1980)

    Saved From the Titanic (1912)

    Search for the Titanic (1981)

    Femme de chambre du Titanic, La (1997)...aka Camarera del Titanic, La (1997) (Spain) ...aka Chambermaid on the Titanic, The (1998) (USA) ...aka Chambermaid, The (1998) (USA: promotional title)

    Doomed Sisters of the Titanic (1999)

    ©Mark Sanderson, Sheffield University


    Use popularity l.jpg

    Use popularity

    • Query “titanic” on IMDB

      • Titanic (1997)

      • Titanic Too: It Missed the Iceberg (2000)

    • On the Web

      • www.directhit.com

      • Increasingly popular

    • Why might popularity work on the Web?

    ©Mark Sanderson, Sheffield University


    Spamming l.jpg

    Spamming

    • Harder to spam a page to make it an authority?

    • Harder to spam a popularity system

    ©Mark Sanderson, Sheffield University


    Figure out why yet l.jpg

    Figure out why yet?

    • Why does popularity work well on the Web?

    ©Mark Sanderson, Sheffield University


    Advertising l.jpg

    Advertising

    • www.goto.com

      • Pay for higher ranking

        • price shown in result list

      • No spamming?

        • More honest than spamming?

          • www.1stplaceranking.com

    ©Mark Sanderson, Sheffield University


    Relevance feedback l.jpg

    Relevance feedback

    • User types in query

      • “The destruction of the amazon rain forests”

      • Gets back documents

      • Tells system some of them are relevant

      • Modify query to user’s wishes

        • How?

    ©Mark Sanderson, Sheffield University


    Which terms do you pick l.jpg

    Which terms do you pick?

    • From newspaper articles, user selects 4

    the, fire, Brazil, hard, wood

    the, Brazil, fire, greenhouse

    the, greenhouse, warming

    the, clearance, mahogany

    ©Mark Sanderson, Sheffield University


    Idf differences l.jpg

    IDF differences

    • Compute in non relevant

      • Approximate to main collection

    • Compute in relevant collection

      • Rank terms on their difference

      • Add top n terms to query

      • Really works

        • Harman, D. (1992): Relevance feedback revisited, in Proceedings of the 15th Annual International ACM SIGIR conference on Research and development in information retrieval: 1-10

    ©Mark Sanderson, Sheffield University


    Expansion before retrieval l.jpg

    Expansion before retrieval?

    • Query expansion good, use it other times?

      • Local analysis

        • (Pseudo|Local) relevance feedback

        • Local Content Analysis (LCA)

      • See also global analysis

        • Qiu, Y., Frei, H.P. (1993): Concept based query expansion, in Proceedings of the 16th annual international ACM SIGIR conference on Research and Development in Information Retrieval, ACM Press: 160-170

    ©Mark Sanderson, Sheffield University


    Pseudo relevance feedback l.jpg

    Pseudo-relevance feedback

    • Assume top ranked documents relevant

    • Automatically mark as relevant

      • Maybe others as non-relevant

    • Expand query

    • Do another retrieval

    • Use top ranked passages (LCA)

      • Xu, J., Croft, W.B. (1996): Query Expansion Using Local and Global Document Analysis, in Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval: 4-11

    ©Mark Sanderson, Sheffield University


    Example l.jpg

    Example

    • “Reporting on possibility of and search for extra-terrestrial life/intelligence.”

    • extraterrestrials, planetary society, universe, civilization, planet, radio signal, seti, sagan, search, earth, extraterrestrial intelligence, alien, astronomer, star, radio receiver, nasa, earthlings, e.t., galaxy, life, intelligence, meta receiver, radio search, discovery, northern hemisphere, national aeronautics, jet propulsion laboratory, soup, space, radio frequency, radio wave, klein, receiver, comet, steven spielberg, telescope, scientist, signal, mars, moises bermudez, extra terrestrial, harvard university, water hole, space administration, message, creature, astronomer carl sagan, intelligent life, meta ii, radioastronomy, meta project, cosmos, argentina, trillions, raul colomb, ufos, meta, evidence, ames research center, california institute, history, hydrogen atom, columbus discovery, hypothesis, third kind, institute, mop, chance, film, signs

    ©Mark Sanderson, Sheffield University


    Does it work l.jpg

    Does it work?

    • Works (mostly)

      • Synonymy being dealt with

      • Remember assumption

        • Query drift?

    • Equivalent to LSI?

      • Quicker

    ©Mark Sanderson, Sheffield University


    What have i missed l.jpg

    What have I missed?

    • NLP?

      • Phrases?

    • Filtering

      • TREC

    • User “stuff”

      • trec.nist.gov, issue of IP&M

    • Implementation

      • Web crawling strategies, index files

    ©Mark Sanderson, Sheffield University


    The future l.jpg

    The Future

    • Good place to look, TREC tracks

      • Cross language

      • Speech retrieval

      • Question answering

    • Image retrieval

    ©Mark Sanderson, Sheffield University


    Future on the web l.jpg

    Future on the Web

    • Specialisation of search engines

      • citeseer.nj.nec.com/cs/

    • Index all the Web?

      • Lawrence, S., Giles, C.L. (1999): Accessibility of information on the web, in Nature, 400: 107-109

  • Google, 1 billion pages

  • ©Mark Sanderson, Sheffield University


    Slide86 l.jpg

    • Distributed searching

      • Tens of search engines

        • Meta Searching

          • Lawrence, S., Giles, C.L. (1998): Context and Page Analysis for Improved Web Search, in IEEE Internet Computing, 2(4): 38-46

      • Millions of search engines?

        • GNutella

          • Organised like terrorist cells.

    ©Mark Sanderson, Sheffield University


    Conferences l.jpg

    Conferences

    • ACM

      • SIGIR, CIKM, DL

    • TREC

    • BCS

      • IRSG

    • EuroDL

    • Less related

      • ACM CHI, ACL, AAAI

    ©Mark Sanderson, Sheffield University


    Web sites l.jpg

    Web sites

    • Organisations

      • sigir.org

        • SIGIR Forum, IRList

      • irsg.eu.org

        • Good list of web sites

    • Groups

      • ir.shef.ac.uk?

      • ir.dcs.gla.ac.uk, ciir.cs.umass.edu

    ©Mark Sanderson, Sheffield University


    Journals l.jpg

    Journals

    • Information Processing and Management

    • Journal of the American Society of Information Science

    • Transactions On Information Science

    • Information Retrieval

    • Journal of Documentation

    • Information Retrieval

    ©Mark Sanderson, Sheffield University


    Good books l.jpg

    Good books

    • Van Rijsbergen

      • “Information Retrieval”, ir.dcs.gla.ac.uk

    • Sparck Jones & Willett

      • “Readings in Information Retrieval”

    • Baeza-Yates & Ribeiro-Neto

      • “Modern Information Retrieval”

    • Witten, Moffat & Bell

      • “Managing Gigabytes”

    ©Mark Sanderson, Sheffield University


    ad
  • Login