Reference collections task characteristics
This presentation is the property of its rightful owner.
Sponsored Links
1 / 17

Reference Collections: Task Characteristics PowerPoint PPT Presentation


  • 65 Views
  • Uploaded on
  • Presentation posted in: General

Reference Collections: Task Characteristics. TREC Collection. Text REtrieval Conference (TREC) sponsored by NIST and DARPA (1992-?) Comparing approaches for information retrieval from large text collections: Uniform scoring procedures Large corpus of news and technical texts

Download Presentation

Reference Collections: Task Characteristics

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Reference collections task characteristics

Reference Collections:Task Characteristics


Trec collection

TREC Collection

  • Text REtrieval Conference (TREC)

    • sponsored by NIST and DARPA (1992-?)

  • Comparing approaches for information retrieval from large text collections:

    • Uniform scoring procedures

    • Large corpus of news and technical texts

    • Texts tagged in SGML (includes some metadata and document structure)

    • Specified tasks


Example task

Example Task

  • <top>

  • <num> Number: 168

  • <title> Topic: Financing AMTRAK

  • <desc> Description:

    A document will address the role of the Federal Government in financing the operation of the National Railroad Transportation Corporation (AMTRAK).

  • <narr> Narrative:

    A relevant document must provide information on the government’s responsibility to make AMTRAK an economically viable entity. It could also discuss the privatization of AMTRAK as an alternative to continuous government subsidies. Documents comparing government subsidies given to air and bus transportation with those provided to AMTRAK would also be relevant.

  • </top>


Deciding what is relevant

Deciding What is Relevant

  • Pooling method

    • Set (pool) of potentially relevant documents is obtained by combining top N results from various retrieval systems.

    • Humans then examine these to determine which are truly relevant

    • Assumes relevant documents will be in the pool and that documents not in the pool are not relevant.

    • Assumptions have been verified (at least for evaluation purposes)


Types of trec tasks

Types of TREC Tasks

  • Ad hoc tasks:

    • New queries against static collection

    • IR systems return ranked results

    • Systems get task and collection

  • Routing tasks:

    • Standing queries for changing collection

    • Basically a batch-mode filtering task

    • Example: identifying topic from AP newswire

    • Results must be ranked

    • Systems get task and two collections, one for training and one for evaluation


Secondary tasks at trec

Secondary Tasks at TREC

  • Chinese

    • Documents and queries in Chinese

  • Filtering

    • Determine whether each new document is relevant (no rank order)

  • Interactive

    • Human searcher interacts with system to determine relevant (no rank order)

  • NLP

    • Examining value of NLP in IR


Secondary tasks at trec1

Secondary Tasks at TREC

  • Cross Languages

    • Documents in one language while tasks in another

  • High Precision

    • Retrieve 10 documents that answer a given information request in 5 minutes.

  • Spoken Document Retrieval

    • Documents are transcripts of radio broadcasts

  • Very Large Corpus

    • > 20 GB collection


Evaluation measures

Evaluation Measures

  • Summary Table Statistics

    • # of requests in task, # of documents retrieved, # of relevant docs retrieved, total # of relevant docs

  • Recall-Precision Averages

    • 11 standard recall levels

  • Document Level Averages

    • Avg. precision for specified # of retrieved docs (R)

  • Average Precision Histogram

    • Graph showing how algorithm did for each request compared to average of all algorithms


Reference collections collection characteristics

Reference Collections:Collection Characteristics


Cacm collection

CACM Collection

  • 3204 Communications of the ACM articles

  • Focus of collection: computer science

  • Structured subfields:

    • Author names

    • Date information

    • Word stems from title and abstract

    • Categories from hierarchical classification

    • Direct references between articles

    • Bibliographic coupling connections

    • Number of co-citations for each pair of articles


Cacm collection1

CACM Collection

  • 3204 Communications of the ACM articles

  • Test information requests:

    • 52 information requests in natural language with two Boolean query expressions

    • Average of 11.4 terms per query

    • Requests are rather specific with an average of about 15 relevant documents

    • Result in relatively low precision and recall


Isi collection

ISI Collection

  • 1460 documents from the Institute of Scientific Information

  • Focus of collection: information science

  • Structured subfields:

    • Author names

    • Word stems from title and abstract

    • Number of co-citations for each pair of articles


Isi collection1

ISI Collection

  • 1460 documents from the Institute of Scientific Information

  • Test information requests:

    • 35 information requests in natural language with Boolean query expressions

    • Average of 8.1 terms per query

    • 41 information requests in NL without Boolean query expression

    • Requests are fairly general with an average of about 50 relevant documents

    • Higher precision and recall


Observation

Observation

Number of terms increases slowly with number of documents


Cystic fibrosis collection

Cystic Fibrosis Collection

  • 1239 articles with “Cystic Fibrosis” index in MEDLINE

  • Structured subfields:

    • MEDLINE accession number

    • Author

    • Title

    • Source

    • Major subjects

    • Minor subjects

    • Abstract (or extract)

    • References in the document

    • Citations to the document


Cystic fibrosis collection1

Cystic Fibrosis Collection

  • 1239 articles with “Cystic Fibrosis” index in MEDLINE

  • Test information requests:

    • 100 information requests

    • Relevance assessed by four experts with a scale of 0 (not relevant), 1 (marginal relevance), and 2 (high relevance)

    • Overall relevance is sum (0-8)


Discussion questions

Discussion Questions

  • In developing a search engine:

    • How would you use metadata (e.g. author, title, abstract)?

    • How would you use document structure?

    • How would you use references, citations, co-citations?

    • How would you use hyperlinks?


  • Login