Cs798 information retrieval
Sponsored Links
This presentation is the property of its rightful owner.
1 / 30

CS798: Information Retrieval PowerPoint PPT Presentation


  • 89 Views
  • Uploaded on
  • Presentation posted in: General

CS798: Information Retrieval. Charlie Clarke claclark@plg.uwaterloo.ca Information retrieval is concerned with representing, searching, and manipulating large collections of human-language data. Housekeeping. Web page : http://plg.uwaterloo.ca/~claclark/cs798. Area : “Applications/Databases”

Download Presentation

CS798: Information Retrieval

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


CS798: Information Retrieval

  • Charlie Clarke

  • claclark@plg.uwaterloo.ca

    • Information retrieval is concerned with representing, searching, and manipulating large collections of human-language data.


Housekeeping

Web page:http://plg.uwaterloo.ca/~claclark/cs798

Area: “Applications/Databases”

Meeting times: Mondays, 2:00-5:00, MC2036


NLP

DB

IR

ML


Topics

  • Basic techniques

  • Searching, browsing, ranking, retrieval

  • Indexing algorithms and data structures

  • Evaluation

  • Application areas


1. Basic Techniques

  • Text representation & Tokenization

  • Inverted indices

  • Phrase searching example

  • Vector space model

  • Boolean retrieval

  • Simple proximity ranking

  • Test collections & Evaluation


2. Retrieval and Ranking

  • Probabilistic retrieval and Okapi BM25F

  • Language modeling

  • Divergence from randomness

  • Passage retrieval

  • Classification

  • Learning to rank

  • Implicit user feedback


3. Indexing

  • Algorithms and data structures

  • Index creation

  • Dynamic update

  • Index compression

  • Query processing

  • Query optimization


4. Evaluation

  • Statistical foundations of evaluation

  • Measuring Efficiency

  • Measuring Effectiveness

    • Recall/Precision

    • NDCG

    • Other measures

  • Building a test collection


5. Application Areas

  • Parallel retrieval architectures

  • Web search (Link analysis/Pagerank)

  • XML retrieval

  • Filesystem search

  • Spam filtering


Other Topics (student projects)

  • Image/video/speech retrieval

  • Web spam

  • Cross- and multi-lingual IR

  • Clustering

  • Advertising/Recommendation

  • Distributed IR/Meta-search

  • Question answering

  • etc.


Resources

Textbook (partial draft on Website):

Büttcher, Clarke & Cormack. Information Retrieval: Data Structures, Algorithms and Evaluation.

(start reading ch. 1-3)

Wumpus:

www.wumpus-search.org


Grading

  • Short homework exercises from text (10%)

  • A literature review based on a topic area selected by the student with the agreement of the instructor (30%)

  • 30-minute presentation on your selected topic (20%)

  • Class project (40%) – details coming up..


“Documents”

  • Documents are the basic units of retrieval in an IR system.

  • In practice they might be: Web pages, email messages, LaTeX files, news articles, phone message, etc.

  • Update: add, delete, append(?), modify(?)

  • Passages and XML elements are other possible units of retrieval.


Probability Ranking Principle

If an IR system’s response to a query is a ranking of the documents in the collection in order of decreasing probability of relevance, the overall effectiveness of the system to its users will be maximized.


Evaluating IR systems

  • Efficiency vs. effectiveness

  • Manual evaluation

    • Topic creation and judging

    • TREC (Text REtreival Conference)

    • Google Has 10,000 Human Evaluators?

  • Evaluation through implicit user feedback

  • Specificity vs. exhaustivity


  • <topic>

  • <title> shark attacks </title>

  • <desc>

    • Where do shark attacks occur in the world?

  • </desc>

  • <narr>

    • Are there beaches or other areas that are particularly prone to shark attacks? Documents comparing areas and providing statistics are relevant. Documents describing shark attacks at a single location are not relevant.

  • </narr>

  • </topic>


Class Project:Wikipedia Search

  • Can we outperform Google on the Wikipedia?

  • Basic project: Build a search engine for the Wikipedia (using any tools you can find).

  • Ideas: Pagerank, spelling, structure, element retrieval, summarization, external information, user interfaces


Class Project: Evaluation

  • Each student will create and judge n topics.

  • The value of n depends on the number of students. (But workload stays the same.)

  • Quantitative measure of effectiveness.

  • Qualitative assessment of user interfaces.

  • Volunteer needed to operate the judging interface (for credit).


Class Project: Organization

  • You may work in groups (check with me).

  • You may work individually (check with me).

  • You may create and share tools with other students. You get the credit. (e.g. Volunteer needed to set up a class wiki.)

  • Programming can’t be avoided, but can be minimized. ☺

  • Programming can also be maximized.


Class Project: Grading

  • Topic creation and judging: 10%

  • Other project work: 30%

    • You are responsible for submitting one experimental run for evaluation.

    • Other activities are up to you.


One line?


Tokenization

  • For English text: Treat each string of alphanumeric characters as a token.

  • Number sequentially from the start of the text collection.

  • For non-English text: Depends on the language (possible student projects)

  • Other considerations: Stemming, stopwords, etc.


Inverted Indices

  • Basic data structure

  • More next day…


Plan

  • Sept 17:

    • Inverted indices (from Chapter 3)

    • Index construction/Wumpus (Stefan)

  • Sept 24:

    • Vector space model, Boolean retrieval, proximity

    • Basic evaluation methods

  • October 1:

    • Probabilistic retrieval, language modeling

    • Start topic creation for class project

  • October 8: Web search


  • Login