Introduction to information retrieval and web based searching methods
Download
1 / 90

Mark Sanderson - PowerPoint PPT Presentation


  • 377 Views
  • Updated On :

Introduction to Information Retrieval and Web-based Searching Methods Mark Sanderson, University of Sheffield [email protected], dis.shef.ac.uk/mark/ Contents Introduction Ranked retrieval Models Evaluation Advanced ranking Future Sources Aims

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Mark Sanderson' - oshin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Introduction to information retrieval and web based searching methods l.jpg

Introduction to Information Retrieval and Web-based Searching Methods

Mark Sanderson, University of Sheffield

[email protected], dis.shef.ac.uk/mark/

©Mark Sanderson, Sheffield University


Contents l.jpg
Contents Searching Methods

  • Introduction

  • Ranked retrieval

  • Models

  • Evaluation

  • Advanced ranking

  • Future

  • Sources

©Mark Sanderson, Sheffield University


Slide3 l.jpg
Aims Searching Methods

  • To introduce you to basic notions in the field of Information Retrieval with a focus on Web based retrieval issues.

    • To squeeze it all into 4 hours, including coffee breaks

    • If it’s not covered in here, hopefully there will at least be a reference

©Mark Sanderson, Sheffield University


Objectives l.jpg
Objectives Searching Methods

  • At the end of this you will be able to…

    • Demonstrate the workings of document ranking

      • Remove suffixes from words.

    • Explain how recall and precision are calculated.

    • Exploit Web specific information when searching.

    • Outline the means of automatically expanding users’ queries.

    • List IR publications.

©Mark Sanderson, Sheffield University


Introduction l.jpg
Introduction Searching Methods

  • What is IR?

    • General definition

      • Retrieval of unstructured data

    • Most often it is

      • Retrieval of text documents

        • Searching newspaper articles

        • Searching on the Web

    • Other types

      • Image retrieval

©Mark Sanderson, Sheffield University


Typical interaction l.jpg
Typical interaction Searching Methods

  • User has information need.

    • Expresses it as a query

      • in their natural language?

  • IR system find documents relevant to the query.

©Mark Sanderson, Sheffield University


Slide7 l.jpg
Text Searching Methods

  • No computer understanding of document or query text

  • Use “bag of words” approach

    • Pay no heed to inter-word relations:

      • syntax, semantics

    • Bag does characterise document

    • Not perfect, words are

      • ambiguous

      • used in different forms or synonymously

©Mark Sanderson, Sheffield University


To recap l.jpg
To recap Searching Methods

Documents

Documents

User

Query

Process

IR System

Process

Retrieved relevant(?)documents

Store

©Mark Sanderson, Sheffield University

Retrieval Part


Processing l.jpg
Processing Searching Methods

  • “The destruction of the amazon rain forests”

  • Case normalisation

  • Stop word removal.

    • From fixed list

    • “destruction amazon rain forests”

  • Suffix removal, also know as stemming.

    • “destruct amazon rain forest”

  • Documents processed as well

©Mark Sanderson, Sheffield University


Different forms stemming l.jpg
Different forms - stemming Searching Methods

  • Matching the query term “forests”

    • to “forest” and “forested”

  • Stemmers remove affixes

    • removal of suffixes - worker

    • prefixes? - megavolt

    • infixes? - un-bloody-likely

  • Stick with suffixes

©Mark Sanderson, Sheffield University


Plural stemmer l.jpg
Plural stemmer Searching Methods

  • Plurals in English

    • If word ends in “ies” but not “eies”, “aies”

      • “ies” -> “y”

    • if word ends in “es” but not “aes, “ees”, “oes”

      • “es” -> “e”

    • if word ends in “s” but not “us” or “ss”

      • “s” -> “”

    • First applicable rule is the one used

©Mark Sanderson, Sheffield University


Plural stemmer reference l.jpg
Plural stemmer reference Searching Methods

  • Good review of stemming

    • Frakes, W. (1992): Stemming algorithms, in Frakes, W. & Baeza-Yates, B. (eds.), Information Retrieval: Data Structures & Algorithms: 131-160

©Mark Sanderson, Sheffield University


Plural stemmer13 l.jpg
Plural stemmer Searching Methods

  • Examples

    • Forests - ?

    • Statistics - ?

    • Queries - ?

    • Foes - ?

    • Does - ?

    • Is - ?

    • Plus - ?

    • Plusses - ?

©Mark Sanderson, Sheffield University


Take more off l.jpg
Take more off? Searching Methods

  • What about

    • “ed”, “ing”, “ational”, “ation”, “able”, “ism”, etc, etc.

      • Porter, M.F. (1980): An algorithm for suffix stripping, in Program - automated library and information systems, 14(3): 130-137

      • Three pages of rules

      • What about

        • “bring”, “table”, “prism”, “bed”, “thing”?

      • When to strip, when to stop

©Mark Sanderson, Sheffield University


Slide15 l.jpg
CVCs Searching Methods

  • Porter used pattern of letters

    • [C*](VC)m[V*]

    • Tree - m=?

    • Trouble - m=?

    • Troubles - m=?

    • m = 0 or sometimes 1

      • stop

    • Syllables?

      • Pinker, S. (1994): The Language Instinct

©Mark Sanderson, Sheffield University


Problems l.jpg
Problems Searching Methods

  • Porter doesn’t always return words

    • “query”, “queries”, “querying”, etc

      • -> “queri”

        • Krovetz, R. (1993): Viewing morphology as an inference process, in Proceedings of the 16th annual international ACM SIGIR conference on Research and Development in Information Retrieval: 191-202

        • Xu, J., Croft, W.B. (1998): Corpus-Based Stemming using Co-occurrence of Word Variants, in ACM Transactions on Information Systems, 16(1): 61-81

©Mark Sanderson, Sheffield University


Is it used l.jpg
Is it used? Searching Methods

  • Research says it is useful

    • Hull, D.A. (1996): Stemming algorithms: A case study for detailed evaluation, in Journal of the American Society for Information Science, 47(1): 70-84

  • Web search engines hardly use it

    • Why?

      • Unexpected results

        • computer, computation, computing, computational, etc.

      • User expectation?

      • Foreign languages?

  • ©Mark Sanderson, Sheffield University


    Ranked retrieval l.jpg
    Ranked retrieval Searching Methods

    • Everything processed into a bag…

    • …calculate relevance score between query and every document

    • Sort documents by their score

    • Present top scoring documents to user.

    ©Mark Sanderson, Sheffield University


    The scoring l.jpg
    The scoring Searching Methods

    • For each document

      • Term frequency (tf)

        • t: Number of times term occurs in document

        • dl: Length of document (number of terms)

      • Inverse document frequency (idf)

        • n: Number of documents term occurs in

        • N: Number of documents in collection

    ©Mark Sanderson, Sheffield University


    Slide20 l.jpg
    TF Searching Methods

    • More often a term is used in a document

      • More likely document is about that term

      • Depends on document length?

        • Harman, D. (1992): Ranking algorithms, in Frakes, W. & Baeza-Yates, B. (eds.), Information Retrieval: Data Structures & Algorithms: 363-392

          • Watch out for mistake: not unique terms.

      • Problems with spamming

    ©Mark Sanderson, Sheffield University


    Spamming the tf weight l.jpg
    Spamming the Searching Methodstf weight

    • Searching for Jennifer Anniston?

    SEX SEXY MONICA LEWINSKY JENNIFER LOPEZ CLAUDIA SCHIFFER CINDY CRAWFORD JENNIFER ANNISTON GILLIAN ANDERSON MADONNA NIKI TAYLOR ELLE MACPHERSON KATE MOSS CAROL ALT TYRA BANKS FREDERIQUE KATHY IRELAND PAM ANDERSON KAREN MULDER VALERIA MAZZA SHALOM HARLOW AMBER VALLETTA LAETITA CASTA BETTIE PAGE HEIDI KLUM PATRICIA FORD DAISY FUENTES KELLY BROOK SEX SEXY MONICA LEWINSKY JENNIFER LOPEZ CLAUDIA SCHIFFER CINDY CRAWFORD JENNIFER ANNISTON GILLIAN ANDERSON MADONNA NIKI TAYLOR ELLE MACPHERSON KATE MOSS CAROL ALT TYRA BANKS FREDERIQUE KATHY IRELAND PAM ANDERSON KAREN MULDER VALERIA MAZZA SHALOM HARLOW AMBER VALLETTA LAETITA CASTA BETTIE PAGE HEIDI KLUM PATRICIA FORD DAISY FUENTES KELLY BROOK SEX SEXY MONICA LEWINSKY JENNIFER LOPEZ CLAUDIA SCHIFFER CINDY CRAWFORD JENNIFER ANNISTON GILLIAN ANDERSON MADONNA NIKI TAYLOR ELLE MACPHERSON KATE MOSS CAROL ALT TYRA BANKS FREDERIQUE KATHY IRELAND PAM ANDERSON KAREN MULDER VALERIA MAZZA SHALOM HARLOW AMBER VALLETTA LAETITA CASTA BETTIE PAGE HEIDI KLUM PATRICIA FORD DAISY FUENTES KELLY BROOK SEX SEXY MONICA LEWINSKY JENNIFER LOPEZ CLAUDIA SCHIFFER CINDY CRAWFORD JENNIFER ANNISTON GILLIAN ANDERSON MADONNA NIKI TAYLOR ELLE MACPHERSON KATE MOSS CAROL ALT TYRA BANKS FREDERIQUE KATHY IRELAND PAM ANDERSON KAREN MULDER VALERIA MAZZA SHALOM HARLOW AMBER VALLETTA LAETITA CASTA BETTIE PAGE HEIDI KLUM PATRICIA FORD DAISY FUENTES KELLY BROOK

    ©Mark Sanderson, Sheffield University


    Slide22 l.jpg
    IDF Searching Methods

    • Some query terms better than others?

    • In general, fair to say that…

      • “amazon” > “forest”  “destruction” > “rain”

    ©Mark Sanderson, Sheffield University


    To illustrate l.jpg
    To illustrate Searching Methods

    All documents

    Relevant documents

    ©Mark Sanderson, Sheffield University


    To illustrate24 l.jpg
    To illustrate Searching Methods

    All documents

    amazon

    ©Mark Sanderson, Sheffield University


    To illustrate25 l.jpg
    To illustrate Searching Methods

    All documents

    rain

    ©Mark Sanderson, Sheffield University


    Idf and collection context l.jpg
    IDF and collection context Searching Methods

    • IDF sensitive to the document collection content

      • General newspapers

        • “amazon” > “forest”  “destruction” > “rain”

      • Amazon book store press releases

        • “forest”  “destruction” > “rain” > “amazon”

    ©Mark Sanderson, Sheffield University


    Very successful l.jpg
    Very successful Searching Methods

    • Simple, but effective

    • Core of most weighting functions

      • tf (term frequency)

      • idf (inverse document frequency)

      • dl (document length)

    ©Mark Sanderson, Sheffield University


    Robertson s bm25 l.jpg
    Robertson’s BM25 Searching Methods

    • Q is a query containing terms T

    • w is a form of IDF

    • k1, b, k2, k3 are parameters.

    • tf is the document term frequency.

    • qtf is the query term frequency.

    • dl is the document length (arbitrary units).

    • avdl is the average document length.

    ©Mark Sanderson, Sheffield University


    Reference for bm25 l.jpg
    Reference for BM25 Searching Methods

    • Popular weighting scheme

      • Robertson, S.E., Walker, S., Beaulieu, M.M., Gatford, M., Payne, A. (1995): Okapi at TREC-4, in NIST Special Publication 500-236: The Fourth Text REtrieval Conference (TREC-4): 73-96

    ©Mark Sanderson, Sheffield University


    Getting the balance l.jpg
    Getting the balance Searching Methods

    • Documents with all the query terms?

    • Just those with high tf•idf terms?

      • What sorts of documents are these?

    • Search for a picture of Arbour Low

      • Stone circle near Sheffield

      • Try Google and AltaVista

    ©Mark Sanderson, Sheffield University


    Slide31 l.jpg

    Very short “arbour” only Searching Methods

    Longer, lots of “arbour”, no “low”

    ©Mark Sanderson, Sheffield University


    Slide32 l.jpg

    “arbour low” Searching Methods

    Arbour Low documents do exist

    ©Mark Sanderson, Sheffield University


    Slide33 l.jpg

    Lots of Arbour Low documents Searching Methods

    ©Mark Sanderson, Sheffield University

    Disambiguation?


    Result l.jpg
    Result Searching Methods

    • From Google

      • “The Stonehenge of the north”

    ©Mark Sanderson, Sheffield University


    Caveat l.jpg
    Caveat Searching Methods

    • Search engines don’t say much

      • Hard to know how they work

    ©Mark Sanderson, Sheffield University


    Boolean searching l.jpg
    Boolean searching? Searching Methods

    • Start with query

      • “amazon” & “rain forest*” & (“destroy” | “destruction”)

    • Break collection into two unordered sets

      • Documents that match the query

      • Documents that don’t

    • User has complete control but…

      • …not easy to use.

    ©Mark Sanderson, Sheffield University


    Boolean l.jpg
    Boolean Searching Methods

    • Two forms of query/retrieval system

      • Ranked retrieval

        • Long championed by academics

      • Boolean

        • Rooted in commercial systems from 1970s

          • Koenig, M.E. (1992): How close we came, in Information Processing and Management, 28(3): 433-436

    • Modern systems

      • Hybrid of both

    ©Mark Sanderson, Sheffield University


    Don t need boolean l.jpg
    Don’t need Boolean? Searching Methods

    • Ranking found to be better than Boolean

    • But lack of specificity in ranking

      • destruction AND (amazon OR south american) AND rain forest

      • destruction, amazon, south american, rain forest

        • Jansen, B.J., Spink, A., Bateman, J., and Saracevic, T. (1998): Real Life Information Retrieval: A Study Of User Queries On The Web, in SIGIR Forum: A Publication of the Special Interest Group on Information Retrieval, 32(1): 5-17

    ©Mark Sanderson, Sheffield University


    Models l.jpg
    Models Searching Methods

    • Mathematically modelling the retrieval process

      • So as to better understand it

      • Draw on work of others

    • Vector space

    • Probabilistic

    ©Mark Sanderson, Sheffield University


    Vector space l.jpg
    Vector Space Searching Methods

    • Document/query is a vector in N space

      • N = number of unique terms in collection

    • If term in doc/qry, set that element of its vector

    • Angle between vectors = similarity measure

      • Cosine of angle (cos(0) = 1)

    • Doesn’t model term dependence

    D

    Q

    ©Mark Sanderson, Sheffield University


    Model references l.jpg
    Model references Searching Methods

    • wx,y - weight of vector element

  • Vector space

    • Salton, G. & Lesk, M.E. (1968): Computer evaluation of indexing and text processing. Journal of the ACM, 15(1): 8-36

    • Any of the Salton SMART books

  • ©Mark Sanderson, Sheffield University


    Modelling dependence l.jpg
    Modelling dependence Searching Methods

    • Latent Semantic Indexing (LSI)

      • Reduce dimensionality of N space

        • Bring related terms together.

          • Furnas, G.W., Deerwester, S., Dumais, S.T., Landauer, T.K., Harshman, R.A., Streeter, L.A., Lochbaum, K.E. (1988): Information retrieval using a singular value decomposition model of latent semantic structure, in Proceeding of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval: 465-480

          • Manning, C.D., Schütze, H. (1999): Foundations of Statistical Natural Language Processing: 554-566

    ©Mark Sanderson, Sheffield University


    Probabilistic l.jpg
    Probabilistic Searching Methods

    • Assume independence

    ©Mark Sanderson, Sheffield University


    Model references44 l.jpg
    Model references Searching Methods

    • Probabilistic

      • Original papers

        • Robertson, S.E. & Sparck Jones, K. (1976): Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3): 129-146.

        • Van Rijsbergen, C.J. (1979): Information Retrieval

          • Chapter 6

    • Survey

      • Crestani, F., Lalmas, M., van Rijsbergen, C.J., Campbell, I. (1998): “Is This Document Relevant? ...Probably”: A Survey of Probabilistic Models in Information Retrieval, in ACM Computing Surveys, 30(4): 528-552

    ©Mark Sanderson, Sheffield University


    Recent developments l.jpg
    Recent developments Searching Methods

    • Probabilistic language models

      • Ponte, J., Croft, W.B. (1998): A Language Modelling Approach to Information Retrieval, in Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval: 275-281

    ©Mark Sanderson, Sheffield University


    Evaluation l.jpg
    Evaluation Searching Methods

    • Measure how well an IR system is doing

      • Effectiveness

        • Number of relevant documents retrieved

      • Also

        • Speed

        • Storage requirements

        • Usability

    ©Mark Sanderson, Sheffield University


    Effectiveness l.jpg
    Effectiveness Searching Methods

    • Two main measures

    • Precision is easy

      • P at rank 10.

    • Recall is hard

      • Total number of relevant documents?

    ©Mark Sanderson, Sheffield University


    Test collections l.jpg
    Test collections Searching Methods

    • Test collection

      • Set of documents (few thousand-few million)

      • Set of queries (50-400)

      • Set of relevance judgements

        • Humans check all documents!

        • Use pooling

          • Take top 100 from every submission

          • Remove duplicates

          • Manually assess these only.

    ©Mark Sanderson, Sheffield University


    Test collections49 l.jpg
    Test collections Searching Methods

    • Small collections (~3Mb)

      • Cranfield, NPL, CACM - title (& abstract)

    • Medium (~4 Gb)

      • TREC - full text

    • Large (~100Gb)

      • VLC track of TREC

    • Compare with reality (~10Tb)

      • CIA, GCHQ, Large search services

    ©Mark Sanderson, Sheffield University


    Where to get them l.jpg
    Where to get them Searching Methods

    • Cranfield, NPL, CACM

      • www.dcs.gla.ac.uk/idom/

    • TREC, VLC

      • trec.nist.gov

    ©Mark Sanderson, Sheffield University


    How to get r p figures l.jpg
    How to get r/p figures Searching Methods

    Relevant documents

    Document ranking

    ©Mark Sanderson, Sheffield University


    Another ranking l.jpg
    Another ranking Searching Methods

    ©Mark Sanderson, Sheffield University


    Graph these l.jpg
    Graph these Searching Methods

    ©Mark Sanderson, Sheffield University


    How to average queries l.jpg
    How to average queries? Searching Methods

    • Macro evaluation

      • Interpolate to compute p at key recall points

        • Four

          • 0.25, 0.5, 0.75, 1.0

        • Ten

          • 0.1, 0.2, 0.3, … …, 0.9, 1.0

        • Eleven (most popular)

          • 0, 0.1, 0.2, 0.3, … …, 0.9, 1.0

      • Use a pessimistic interpolation

    ©Mark Sanderson, Sheffield University


    Graph this l.jpg
    Graph this Searching Methods

    ©Mark Sanderson, Sheffield University


    Graph the other l.jpg
    Graph the other Searching Methods

    ©Mark Sanderson, Sheffield University


    Can now average l.jpg
    Can now average Searching Methods

    ©Mark Sanderson, Sheffield University


    Homework l.jpg
    Homework Searching Methods

    • Random retrieval

      • For each query, randomly sort documents - relevant documents found (evenly) across ranking.

        • Measure precision at standard recall

      • Your assignment,

        • why does it look like this?

    ©Mark Sanderson, Sheffield University


    Slide59 l.jpg
    Why? Searching Methods

    ©Mark Sanderson, Sheffield University


    Papers on evaluation l.jpg
    Papers on evaluation Searching Methods

    • Discusses a variety of measures

      • Hull, D. (1993) Using Statistical Testing in the Evaluation of Retrieval Experiments, in Proceedings of the 16th annual international ACM SIGIR conference on Research and Development in Information Retrieval: 329-338

  • Discuss accuracy of measures

    • Buckley, C. Voorhees, E. (2000), Evaluating evaluation measure stability in Proceedings of the 23rd annual international ACM SIGIR conference on Research and Development in Information Retrieval

  • ©Mark Sanderson, Sheffield University


    Other test collection papers l.jpg
    Other test collection papers Searching Methods

    • Validity of test collections

      • Voorhees, E. (1998): Variations in Relevance Judgements and the Measurement of Retrieval Effectiveness, in Proceedings of the 21st annual international ACM-SIGIR conference on Research and development in information retrieval: 315-323

  • Overviews of TREC

    • Any of the introductory Harman/Voorhees papers for any TREC.

      • trec.nist.gov

  • ©Mark Sanderson, Sheffield University


    Advanced ranking l.jpg
    Advanced ranking Searching Methods

    • Re-cap

      • Document ranking based on

        • Document frequency - idf

        • Term frequency - if

        • Document length - dl

    ©Mark Sanderson, Sheffield University


    Anything else l.jpg
    Anything else? Searching Methods

    • How to make ranking better?

      • Term location

      • Web link analysis

      • Popularity

      • Others?

    ©Mark Sanderson, Sheffield University


    Term location l.jpg
    Term location Searching Methods

    • Prefer documents where terms are closer together?

      • Passage retrieval

        • Callan, J. (1994): Passage­Level Evidence in Document Retrieval, in Proceedings of the 17th annual international ACM-SIGIR conference on Research and development in information retrieval: 302-310

        • Hearst, M.A., Plaunt, C. (1993): Subtopic structuring for full-length document access, in Proceedings of the 16th annual international ACM SIGIR conference on Research and Development in Information Retrieval: 59-68

    ©Mark Sanderson, Sheffield University


    Callan l.jpg

    Research interests/publications/lecturing/supervising Searching Methods

    My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below.

    I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising.

    Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences.

    My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system searching over two years of the Financial Times newspaper. The system, called NRT (News Retrieval Tool), was described in a paper by Donna Harman, "User Friendly Systems Instead of User-Friendly Front Ends". Donna's paper appears in JASIS and the "Readings in IR" book by Sparck-Jones and Willett.

    1st

    2nd

    3rd

    4th

    Callan

    • Split document into passages

    • Rank a document based on score of its highest ranking passage

    • What is a passage?

      • Paragraph?

        • Bounded paragraph

      • (overlapping) Fixed window?

    ©Mark Sanderson, Sheffield University


    Results l.jpg
    Results? Searching Methods

    • Document ranking better with passages than without.

      • More disambiguation

    • Overlapping passage better than paragraph, bounded or otherwise.

    ©Mark Sanderson, Sheffield University


    Other location information l.jpg
    Other location information? Searching Methods

    • On the web

      • Title

      • First few lines…

        • Alta Vista, query help information

          • www.altavista.com

            • http://doc.altavista.com/adv_search/ast_i_index.shtml

    ©Mark Sanderson, Sheffield University


    Authority l.jpg
    Authority Searching Methods

    • In classic IR

      • authority not so important

    • On the web

      • very important

        • Query “Harvard”

          • Dwane’s Harvard home page

          • The Harvard University home page

    ©Mark Sanderson, Sheffield University


    Simple methods l.jpg
    Simple methods Searching Methods

    • URL length

    • Domain name

    ©Mark Sanderson, Sheffield University


    Authority70 l.jpg

    Research interests/publications/lecturing/supervising Searching Methods

    My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below.

    I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising.

    Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences.

    My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system

    Research interests/publications/lecturing/supervising

    My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below.

    I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising.

    Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences.

    My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system

    Research interests/publications/lecturing/supervising

    My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below.

    I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising.

    Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences.

    My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system

    Research interests/publications/lecturing/supervising

    My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below.

    I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising.

    Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences.

    My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system

    Research interests/publications/lecturing/supervising

    My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below.

    I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising.

    Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences.

    My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system

    Research interests/publications/lecturing/supervising

    My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below.

    I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising.

    Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences.

    My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system

    Authority

    • Hubs and authorities

      • Brin, S., Page, L. (1998): The Anatomy of a Large-Scale Hypertextual Web Search Engine, in 7th International World Wide Web Conference

      • Gibson, D., Kleinberg, J., Raghavan, P. (1998): Inferring Web Communities from Link Topology, in Proceedings of The 9th ACM Conference on Hypertext and Hypermedia: links, objects, time and space—structure in hypermedia systems: 225-234

    ©Mark Sanderson, Sheffield University


    Popularity l.jpg
    Popularity? Searching Methods

    • Query IMDB (www.imdb.org) for “Titanic”

    Titanic (1915)

    Titanic (1997)

    Titanic (1943)

    Titanic (1953)

    Titanic 2000 (1999)

    Titanic: Anatomy of a Disaster (1997)

    Titanic: Answers from the Abyss (1999)

    Titanic Chronicles, The (1999)

    Titanic in a Tub: The Golden Age of Toy Boats (1981)

    Titanic Too: It Missed the Iceberg (2000)

    Titanic Town (1998)

    Titanic vals (1964)...aka Titanic Waltz (1964) (USA)

    Atlantic (1929)...aka Titanic: Disaster in the Atlantic (1999) (USA: video title)

    Night to Remember, A (1958)...aka Titanic latitudine 41 Nord (1958) (Italy)

    Gigantic (2000)...aka Untitled Titanic Spoof (1998) (USA: working title)

    Raise the Titanic (1980)

    Saved From the Titanic (1912)

    Search for the Titanic (1981)

    Femme de chambre du Titanic, La (1997)...aka Camarera del Titanic, La (1997) (Spain) ...aka Chambermaid on the Titanic, The (1998) (USA) ...aka Chambermaid, The (1998) (USA: promotional title)

    Doomed Sisters of the Titanic (1999)

    ©Mark Sanderson, Sheffield University


    Use popularity l.jpg
    Use popularity Searching Methods

    • Query “titanic” on IMDB

      • Titanic (1997)

      • Titanic Too: It Missed the Iceberg (2000)

    • On the Web

      • www.directhit.com

      • Increasingly popular

    • Why might popularity work on the Web?

    ©Mark Sanderson, Sheffield University


    Spamming l.jpg
    Spamming Searching Methods

    • Harder to spam a page to make it an authority?

    • Harder to spam a popularity system

    ©Mark Sanderson, Sheffield University


    Figure out why yet l.jpg
    Figure out why yet? Searching Methods

    • Why does popularity work well on the Web?

    ©Mark Sanderson, Sheffield University


    Advertising l.jpg
    Advertising Searching Methods

    • www.goto.com

      • Pay for higher ranking

        • price shown in result list

      • No spamming?

        • More honest than spamming?

          • www.1stplaceranking.com

    ©Mark Sanderson, Sheffield University


    Relevance feedback l.jpg
    Relevance feedback Searching Methods

    • User types in query

      • “The destruction of the amazon rain forests”

      • Gets back documents

      • Tells system some of them are relevant

      • Modify query to user’s wishes

        • How?

    ©Mark Sanderson, Sheffield University


    Which terms do you pick l.jpg
    Which terms do you pick? Searching Methods

    • From newspaper articles, user selects 4

    the, fire, Brazil, hard, wood

    the, Brazil, fire, greenhouse

    the, greenhouse, warming

    the, clearance, mahogany

    ©Mark Sanderson, Sheffield University


    Idf differences l.jpg
    IDF Searching Methods differences

    • Compute in non relevant

      • Approximate to main collection

    • Compute in relevant collection

      • Rank terms on their difference

      • Add top n terms to query

      • Really works

        • Harman, D. (1992): Relevance feedback revisited, in Proceedings of the 15th Annual International ACM SIGIR conference on Research and development in information retrieval: 1-10

    ©Mark Sanderson, Sheffield University


    Expansion before retrieval l.jpg
    Expansion before retrieval? Searching Methods

    • Query expansion good, use it other times?

      • Local analysis

        • (Pseudo|Local) relevance feedback

        • Local Content Analysis (LCA)

      • See also global analysis

        • Qiu, Y., Frei, H.P. (1993): Concept based query expansion, in Proceedings of the 16th annual international ACM SIGIR conference on Research and Development in Information Retrieval, ACM Press: 160-170

    ©Mark Sanderson, Sheffield University


    Pseudo relevance feedback l.jpg
    Pseudo-relevance feedback Searching Methods

    • Assume top ranked documents relevant

    • Automatically mark as relevant

      • Maybe others as non-relevant

    • Expand query

    • Do another retrieval

    • Use top ranked passages (LCA)

      • Xu, J., Croft, W.B. (1996): Query Expansion Using Local and Global Document Analysis, in Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval: 4-11

    ©Mark Sanderson, Sheffield University


    Example l.jpg
    Example Searching Methods

    • “Reporting on possibility of and search for extra-terrestrial life/intelligence.”

    • extraterrestrials, planetary society, universe, civilization, planet, radio signal, seti, sagan, search, earth, extraterrestrial intelligence, alien, astronomer, star, radio receiver, nasa, earthlings, e.t., galaxy, life, intelligence, meta receiver, radio search, discovery, northern hemisphere, national aeronautics, jet propulsion laboratory, soup, space, radio frequency, radio wave, klein, receiver, comet, steven spielberg, telescope, scientist, signal, mars, moises bermudez, extra terrestrial, harvard university, water hole, space administration, message, creature, astronomer carl sagan, intelligent life, meta ii, radioastronomy, meta project, cosmos, argentina, trillions, raul colomb, ufos, meta, evidence, ames research center, california institute, history, hydrogen atom, columbus discovery, hypothesis, third kind, institute, mop, chance, film, signs

    ©Mark Sanderson, Sheffield University


    Does it work l.jpg
    Does it work? Searching Methods

    • Works (mostly)

      • Synonymy being dealt with

      • Remember assumption

        • Query drift?

    • Equivalent to LSI?

      • Quicker

    ©Mark Sanderson, Sheffield University


    What have i missed l.jpg
    What have I missed? Searching Methods

    • NLP?

      • Phrases?

    • Filtering

      • TREC

    • User “stuff”

      • trec.nist.gov, issue of IP&M

    • Implementation

      • Web crawling strategies, index files

    ©Mark Sanderson, Sheffield University


    The future l.jpg
    The Future Searching Methods

    • Good place to look, TREC tracks

      • Cross language

      • Speech retrieval

      • Question answering

    • Image retrieval

    ©Mark Sanderson, Sheffield University


    Future on the web l.jpg
    Future on the Web Searching Methods

    • Specialisation of search engines

      • citeseer.nj.nec.com/cs/

    • Index all the Web?

      • Lawrence, S., Giles, C.L. (1999): Accessibility of information on the web, in Nature, 400: 107-109

  • Google, 1 billion pages

  • ©Mark Sanderson, Sheffield University


    Slide86 l.jpg

    • Distributed searching Searching Methods

      • Tens of search engines

        • Meta Searching

          • Lawrence, S., Giles, C.L. (1998): Context and Page Analysis for Improved Web Search, in IEEE Internet Computing, 2(4): 38-46

      • Millions of search engines?

        • GNutella

          • Organised like terrorist cells.

    ©Mark Sanderson, Sheffield University


    Conferences l.jpg
    Conferences Searching Methods

    • ACM

      • SIGIR, CIKM, DL

    • TREC

    • BCS

      • IRSG

    • EuroDL

    • Less related

      • ACM CHI, ACL, AAAI

    ©Mark Sanderson, Sheffield University


    Web sites l.jpg
    Web sites Searching Methods

    • Organisations

      • sigir.org

        • SIGIR Forum, IRList

      • irsg.eu.org

        • Good list of web sites

    • Groups

      • ir.shef.ac.uk?

      • ir.dcs.gla.ac.uk, ciir.cs.umass.edu

    ©Mark Sanderson, Sheffield University


    Journals l.jpg
    Journals Searching Methods

    • Information Processing and Management

    • Journal of the American Society of Information Science

    • Transactions On Information Science

    • Information Retrieval

    • Journal of Documentation

    • Information Retrieval

    ©Mark Sanderson, Sheffield University


    Good books l.jpg
    Good books Searching Methods

    • Van Rijsbergen

      • “Information Retrieval”, ir.dcs.gla.ac.uk

    • Sparck Jones & Willett

      • “Readings in Information Retrieval”

    • Baeza-Yates & Ribeiro-Neto

      • “Modern Information Retrieval”

    • Witten, Moffat & Bell

      • “Managing Gigabytes”

    ©Mark Sanderson, Sheffield University


    ad