1 / 25

INFORMATION RETRIEVAL AND WEB SEARCH

INFORMATION RETRIEVAL AND WEB SEARCH. CC437. (Based on original material by Udo Kruschwitz). INFORMATION RETRIEVAL. GOAL: Find the documents most relevant to a certain QUERY Latest development: WEB SEARCH Use the Web as the collection of documents Related: QUESTION-ANSWERING

Pat_Xavi
Download Presentation

INFORMATION RETRIEVAL AND WEB SEARCH

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)

  2. INFORMATION RETRIEVAL • GOAL: Find the documents most relevant to a certain QUERY • Latest development: WEB SEARCH • Use the Web as the collection of documents • Related: • QUESTION-ANSWERING • DOCUMENT CLASSIFICATION

  3. INFORMATION RETRIEVAL:SUBTASKS • INDEX the documents in the collection • (offline) • PROCESS the query • EVALUATE SIMILARITY and find RANKs • Find documents most closely matching the query • DISPLAY results / enter a DIALOGUE • E.g., user may refine the query

  4. DOCUMENTS AS BAGS OF WORDS INDEX DOCUMENT broadmay rallyrallied signal stockstocks techtechnology traderstraders trend broad tech stock rally may signal trend - traders. technology stocks rallied on tuesday, with gains scored broadly across many sectors, amid what some traders called a recovery from recent doldrums.

  5. SUBTASKS I: INDEXING • PREPROCESSING • Deletion of STOPWORDS • STEMMING • Selection of INDEX TERMS

  6. INDEXING I: PREPROCESSING • PUNCTUATION REMOVAL • (Crestani et al) • CASE FOLDING • London  london • LONDON  london • DIGIT REMOVAL • But: SPARCStation 5

  7. INDEXING II: STOPWORD REMOVAL • Very frequent words are not good discriminators • Many of these are CLOSED CLASS words • INQUERY’s list of stop words beginning with letter “a”: • a, about, above, according, across, after, afterwards, again, against, albeit, all, almost, alone, already, also, although, always, among, amongst, am, an, and, another, any, anybody, anyhow, anyone, anything, anyway, anywhere, apart, are, around, as, at • Domain-specific stopwords • search, webmaster, copyright, www

  8. INDEXING III:STEMMING • Simplest: suffix stripping • PORTER STEMMER: inflectional & derivational morphology • develop  develop • developing  develop • development  develop • developments  develop • BUT: photography  photographi • The effectiveness of stemming: • For English: increase in recall doesn’t compensate loss in precision • For other languages: necessary • E.g., Abdul Goweder’s dissertation

  9. STORAGE • Requirements • Huge amounts of data • Lots of redundancy • Quick random access necessary • Indexing techniques: • Inverted index files • Suffix trees / suffix arrays • Signature files

  10. STORAGE TECHNIQUES:INVERTED INDEX DOCUMENT1 INVERTED INDEX broad tech stock rally may signal trend - traders. broad  {1}gain  {2}rally  {1,2}score  {2}signal  {1} stock  {1,2}tech  {1}technology  {2}traders  {1,2}trend  {1}tuesday  {2} DOCUMENT2 technology stocks rallied on tuesday, with gains scored broadly across many sectors, amid what some traders called a recovery from recent doldrums.

  11. SIMILARITY MODELS • Boolean model • Probabilistic model • Vector-space model

  12. THE BOOLEAN MODEL • Each index term is either present or absent • Documents are either RELEVANT or NOT RELEVANT (no grading of results) • Advantages • Clean formalism, simple to implement • Disadvantages • Exact matching only • All index terms equal weight

  13. THE VECTOR SPACE MODEL • Query and documents represented as vectors of index terms, assigned non-binary WEIGHTS • Similarity calculated using vector algebra: COSINE (cfr. lexical similarity models) • RANKED similarity • Most popular of all models (cfr. Salton and Lesk’s SMART)

  14. dj θ qk SIMILARITY IN VECTOR SPACE MODELS: THE COSINE MEASURE

  15. TERM WEIGHTING IN VECTOR SPACE MODELS: THE TF.IDF MEASURE FREQUENCY of term i in document k Number of documents with term i

  16. EVALUATION • One of the most important contributions of IR to NLE has been the development of better ways of evaluating systems than simple accuracy

  17. Simplest quantitative evaluation metrics • ACCURACY: percentage correct(against some gold standard)- e.g., tagger gets 96.7% of tags correct when evaluated using the Penn Treebank • Problem with accuracy: only really useful when classes of approximately equal size (not the case in IR) ERROR: percentage wrong- ERROR REDUCTION most typical metric in ASR

  18. A more general form of evaluation: precision & recall sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj affrvjfkjfkbvKFKRQVFsjfanvnf sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj affrvjfkjfkbvKFKRQVFsjfanvnf sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj affrvjfkjfkbvKFKRQVFsjfanvnf sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj affrvjfkjfkbvKFKRQVFsjfanvnf sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj affrvjfkjfkbvKFKRQVFsjfanvnf CDKBCWDK

  19. Positives and negatives FALSE NEGATIVES TP FP TRUE NEGATIVES

  20. Precision and recall PRECISION: proportion correct AMONG SELECTED ITEMS RECALL: proportion of correct items selected

  21. The tradeoff between precision and recall Easy to get high precision: never classify anything Easy to get high recall: return everything Really need to report BOTH, or F-measure

  22. WEB SEARCH • In many senses, just a form of IR • But: • Further information one has to take into account • Markup • Hyperlinks • Meta tags • Extra problems • Document highly heterogeneous • Multimedia • Quality of data

  23. GOOGLE • Key aspects of Google’s search algorithm (as far as we know!) • Analyze link structure: PAGE RANK • Exploit visual presentation • Page Rank used to rank retrieved documents in addition to similarity measures • Page Rank motivations: • Most important papers are those cited most often • Not all sources of citations are equally reliable

  24. PAGE RANK Probability q of randomly jumping to that page Page p Pages pointing to p

  25. READINGS AND REFERENCES • Jurafsky and Martin, chapter 10.1-10.4 • Other references • Brin, S. and Page, L. 1998, “The anatomy of a large-scale hypertextual web search engine”, In Proc. Of the 7th WWW conference (WWW7),Brisbane • F. Crestani et al, 1998, “Is this document relevant? …probably”, ACM Computing Surveys, 30(4):528-552 • Goweder, A, 2004, The role of stemming in IR: the case of Arabic, PhD dissertation, University of Essex • Porter, M.F., 1980, “An algorithm for suffix stripping”, Program, 14(3) :130-137 • G. Salton and M. E. Lesk, 1968. “Computer evaluation of indexing and text processing”, Journal of the ACM, 15(1),8-36

More Related