Automatically Building a Stopword List for an Information Retrieval System

  1. Automatically Building a Stopword List for an Information Retrieval System University of Glasgow Rachel Tsz-Wai Lo, Ben He, Iadh Ounis

  2. Outline • Stopwords • Investigation of two approaches • Approach based on Zipf’s Law • New Term-based random sampling approach • Experimental Setup • Results and Analysis • Conclusion

  3. What is a Stopword? • Common words in a document • e.g. the, is, and, am, to, it • Contains no information about documents • Low discrimination value in terms of IR • meaningless, no contribution • Search with stopwords will usually result in retrieving irrelevant documents

  4. Objective • Different collection contains different contents and word patterns • Different collections may require a different set of stopwords • Given a collection of documents • Investigate ways to automatically create a stopword list

  5. Baseline Approach (benchmark) 4 variants inspired by Zipf’s Law TF Normalised TF IDF Normalised IDF How informative a term is (new proposed approach) Objective (cont)

  6. Fox’s Classical Stopword List and Its Weakness • Contains 733 stopwords • > 20 years old • Lacks potentially new words • Defined for General Purpose • different collections require different stopword lists • Outdated

  7. Zipf’s Law • Based on the term frequencies of terms, rank these terms accordingly • term with highest TF will have rank = 1, next highest term with rank = 2 etc • Zipf’s Law

  8. Zipf’s Law

  9. Baseline Approach Algorithm • Generate a list of frequencies vs terms based on corpus • Sort the frequencies in descending order • Rank the terms according to their frequencies. Highest frequencies would have rank=1 and next highest would have rank=2 etc. • Draw a graph of frequencies vs rank

  10. Baseline Approach Algorithm (cont.)

  11. Baseline Approach Algorithm (cont.) • Choose a threshold and any words that appear above the threshold are treated as stopwords • Run the queries with the above said stopword list, all stopwords in the queries will be removed • Evaluate system with Average Precision

  12. Baseline Approach - Variants • Term Frequency • Normalised Term Frequency • Inverse Document Frequency (IDF) • Normalised IDF

  13. Baseline Approach – Choosing Threshold • Produce best set of stopwords • > 50 stopword lists for each variant • Investigate the frequencies difference between two consecutive ranks • big difference (i.e. sudden jump) • Important to choose appropriate threshold

  14. Term-Based Random Sampling Approach (TBRSA) • Our proposed new approach • Depends on how informative a term is • Based on the Kullback-Leibler divergence measure • Similar to the idea of query expansion

  15. Kullback-Leibler Divergence Measure • Used to measure the distance between two distributions. • In our case, distribution of two terms, one of which is a random term • The weight of a term t in the sampled document set is given by: • where and

  16. Repeat Y times Random term Retrieve KL divergence measure 0.0 0.0 0.1 0.1 0.3 0.3 0.5 0.7 Normalise weights by max weight Rank in ascending order Top X ranked TBRAS Algorithm

  17. 0.0 0.1 0.3 0.15 0.5 1.0 0.8 0.9 0.85 0.05 0.7 merge 0.0 0.1 0.3 0.15 0.75 1.0 0.8 0.05 0.7 0.05 0.1 0.15 0.75 0.8 1.0 0.7 0.0 0.3 sort 0.0 0.05 0.1 0.15 0.3 0.7 Extract top L ranked as stopwords TBRSA Algorithm (cont.)

  18. Advantages / Disadvantages • Advantages • based on how informative a term is • computational effort minimal, compared to baselines • better coverage of collection • No need to monitor progress • Disadvantages • Generates first term randomly, could retrieve a small data set • Repeat experiments Y times

  19. Experimental Setup • Four TREC collections • http://trec.nist.gov/data/docs_eng.html • Each collection is indexed and stemmed with no pre-defined stopwords removed • No assumption of stopwords in the beginning • Long queries were used • Title, Description and Narrative • Maximise our chances of using the new stopword lists

  20. Experimental Platform • Terrier - TERabyte RetrIEveR • IR Group, University of Glasgow • Based on Divergence From Randomness (DFR) framework • Deriving parameter-free probabilistic models • PL2 model • http://ir.dcs.gla.ac.uk/terrier/

  21. PL2 Model • One of the DFR document weighting models • Relevance score of a document d for query Q is: • where

  22. Collections • disk45, WT2G, WT10G and DOTGOV

  23. Queries

  24. Merging Stopword Lists • Merging classical with best generated using baseline and novel approach respectively • Adding 2 lists together, removing duplicates • Might be stronger in terms of effectiveness • Follows from classical IR technique of combining evidence

  25. Results and Analysis • Produce as many sets of stopwords (by choosing different thresholds for baseline approach) • Compare results obtained to Fox’s classical stopword list, based on average precision

  26. Baseline Approach – Overall Results • * indicates significant difference at 0.05 level • Normalised IDF and for every collection

  27. Baseline Approach – Additional Terms Produced

  28. TBRSA – Overall Results • * indicates significant difference at 0.05 level • disk45 and WT2G both show improvements

  29. TBRSA – Additional Terms Produced

  30. Refinement - Merging • New approach (TBRSA) gives comparable results • Computation effort is less • Fox’s classical stopword list was very effective, despite its old age • Worth using • Queries were quite “conservative”

  31. Merging – Baseline Approach • * indicates significant difference at 0.05 level • Produced a more effective stopword list

  32. Merging – TBRSA • * indicates significant difference at 0.05 level • Produced an improved stopword list with less computational effort 

  33. Conclusion & Future Work • Proposed a novel approach for automatically generating a stopword list • Effectiveness and robustness • Compared to 4 baseline variants, based on Zipf’s Law • Merge classical stopword list with best found result to produce a more effective stopword list

  34. Conclusion & Future Work (cont.) • Investigate other divergence metrics • Poisson-based approach • Verb vs Noun • “I can open a can of tuna with a can opener” • “to be or not to be” • Detect nature of context • Might have to keep some of the terms but remove others

  35. Thank you! • Any questions? • Thank you for your attention 

