1 / 12

ARISTOTLE UNIVERSITY OF THESSALONIKI

ARISTOTLE UNIVERSITY OF THESSALONIKI. Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas. 20/07/2000, Page 1. ARISTOTLE UNIVERSITY OF THESSALONIKI. Introduction

denver
Download Presentation

ARISTOTLE UNIVERSITY OF THESSALONIKI

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas HYPERGEO 1st technical verification 20/07/2000, Page 1

  2. ARISTOTLE UNIVERSITY OF THESSALONIKI • Introduction • Information Retrieval: development of algorithms and models for retrieving information from document repositories (speech, image,video) • Ad-hoc retrieval problem: query submitted by the user describing the desired information • Return list of documents: exact match or ranking according to their estimated relevance to the query • Relevance Feedback – Text Categorization HYPERGEO 1st technical verification 20/07/2000, Page 2

  3. ARISTOTLE UNIVERSITY OF THESSALONIKI • Common design features of IR Systems • Techniques introduced by Robertson and S.Jones • use of simple terms for indexing both request and document texts • term weighting exploiting statistical information about term occurrences • scoring for request document matching, using these weights or term sets in iterative searching HYPERGEO 1st technical verification 20/07/2000, Page 3

  4. ARISTOTLE UNIVERSITY OF THESSALONIKI • Common design features of IR Systems (cont.) • Techniques introduced by Robertson and S.Jones (cont.) • Normal implementation: via an inverted file organization using term list with linked document identifiers plus counting data, and pointers to the actual text • Basic Features: • Terms and matching: • stemmed content words  terms used for indexing • Stop words are excluded HYPERGEO 1st technical verification 20/07/2000, Page 4

  5. ARISTOTLE UNIVERSITY OF THESSALONIKI • Basic Features (cont.): • Weights= selectivity • Weighting Measures: • a. Collection Frequency: • N : number of document term t (i) occurs in • n : the number of documents in the collection • b. Term Frequency: terms occurring more often in a document is more likely to be important for that document HYPERGEO 1st technical verification 20/07/2000, Page 5

  6. ARISTOTLE UNIVERSITY OF THESSALONIKI • Basic Features (cont.): • Weighting Measures (cont.): • c. Document Length: serves for the evaluation of Term Frequency (the same Term Frequency of a term in a short document and in a long one shows that this term is more valuable for the short one) • d. Combined Weight: combination of the weight measures described above • k1(=2) : affects the extent of the influence of Term Frequency • b(=0.75) : affects the extent of Document Length’s influence. HYPERGEO 1st technical verification 20/07/2000, Page 6

  7. ARISTOTLE UNIVERSITY OF THESSALONIKI • Implementation of IR Component in HyperGeo Corpus • Based on all the statistic measures described above • Basic Characteristics: • First Part: Training  calculation of all the necessary statistics for each document in the corpus and for each term appearing in these documents • Term dependent measures (CFW(i)) • Document dependent measures (DL(j)) • Term - Document dependent measures (TF(i,j), CW(i,j)) • Storage of statistics in files HYPERGEO 1st technical verification 20/07/2000, Page 7

  8. ARISTOTLE UNIVERSITY OF THESSALONIKI • Basic Characteristics (cont): • Second Part: Document Retrieval • Query terms are given by the user • Stemming of the query terms (Simple and Porter Stemmer) • Look up of each query term in the structure that holds term-document-combined weight • Document’s score calculation: sum of the combined weights of all the query terms in the specific document • Document Ranking: determined by the user • a. according to their estimated score • b. according to i) the number of query terms that appear in it and ii) their estimated score HYPERGEO 1st technical verification 20/07/2000, Page 8

  9. ARISTOTLE UNIVERSITY OF THESSALONIKI • Output • Output files: TermFrequency file, Combined Weight file, Idf file (number and names of documents each term occurs in), QueryResult file (contains the ranked document returned by the query) HYPERGEO 1st technical verification 20/07/2000, Page 9

  10. ARISTOTLE UNIVERSITY OF THESSALONIKI • Results • Frequencies of the first 20 terms of the corpus • museum 2050 collect 766 home 582 book 534 • hotel 1348 citi 758 page 573 build 483 • room 781 town 653 hous 556 new 481 • open 779 art 650 servic 548 place 479 • centuri 775 reserv 591 year 548 work 477 • Number of documents first 20 terms occur in • museum 298 includ 238 open 230 collect 217 • centuri 263 room 235 hous 229 hotel 215 • year 258 offer 234 place 229 new 214 • citi 251 inform 232 build 226 visit 206 • time 247 locat 230 servic 225 dai 203 HYPERGEO 1st technical verification 20/07/2000, Page 10

  11. ARISTOTLE UNIVERSITY OF THESSALONIKI • Recall – Precision Graph for the query “museum” HYPERGEO 1st technical verification 20/07/2000, Page 11

  12. ARISTOTLE UNIVERSITY OF THESSALONIKI • Future Developping • Iterative Searching • Relevance Weighting: modification of the the request terms weights • Query Expansion: modification of the request composition by adding more terms (reweighting of original terms) • Probabilistic Approaches HYPERGEO 1st technical verification 20/07/2000, Page 12

More Related