1 / 10

Using TF-IDF to Determine Word Relevance in Document Queries

Using TF-IDF to Determine Word Relevance in Document Queries. Juan Ramos juramos@cs.rutgers.edu Department of Computer Science, Rutgers University, 23515 BPO Way, Piscataway, NJ, 08855. Information Retrieval Problem.

kiley
Download Presentation

Using TF-IDF to Determine Word Relevance in Document Queries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using TF-IDF to Determine Word Relevance in Document Queries Juan Ramos juramos@cs.rutgers.edu Department of Computer Science, Rutgers University, 23515 BPO Way, Piscataway, NJ, 08855

  2. Information Retrieval Problem • Given corpus D, query q = w1, w2, … wn, return documents d that maximize Pr(d | q, D). • Easy to dismiss given widespread use of query retrieval today (web searches, database management, etc.)

  3. Approaches to Ad Hoc Retrieval • Probability and Statistics • Naïve Bayes • Approaches include the user’s mindset. • Vector Models • Latent Semantic Indexing • Reduce n-dimensional vector space of documents • Return documents whose distance to query is small

  4. TF-IDF Weighing Scheme • Given corpus D, word w, document d, calculate wd = fw, d * log (|D|/fw, D) • Many varieties of basic mathematical scheme • Procedure • Scan each d, compute each wi, d, return set D’ that maximizes Σi wi, d

  5. Experiment • Documents from Linguistic Data Consortium’s United Nations Parallel Text Corpus • Support noise by enforcing case-sensitivity, no parsing of SGML symbols • Brute force approach- consider only fw, d

  6. Results

  7. Extensions and Further Research • Genetic TF-IDF: evolve weighing schemes that compete with TF-IDF. • Hillclimbing, gradient descent TF-IDF. • Cross-language settings: return documents in different language than query.

  8. References • Berger, A & Lafferty, J. (1999). Information Retrieval as Statistical Translation. In Proceedings of the 22nd ACM Conference on Research and Development in Information Retrieval (SIGIR’99), 222-229. • Berger, A et al (2000). Bridging the Lexical Chasm: Statistical Approaches to Answer Finding. In Proc. Int. Conf. Research and Development in Information Retrieval, 192-199.

  9. References pt. 2 • Berry, Michael W. et al. (1995). Using Linear Algebra for Intelligent Information Retrieval. SIAM Review, 37(4):177-196. • Brown, Peter F. et al. (1990). A Statistical Approach to Machine Translation. In Computational Linguistics 16(2): 79-85.

  10. References Pt. 3 • Oren, Nir. (2002). Reexamining tf.idf based information retrieval with Genetic Programming. In Proceedings of SAICSIT 2002, 1-10. • Salton, G. & Buckley, C. (1988). Term-weighing approache sin automatic text retrieval. In Information Processing & Management, 24(5): 513-523.

More Related