1 / 39

EE3J2 Data Mining Lecture 5 More on the Index Martin Russell

EE3J2 Data Mining Lecture 5 More on the Index Martin Russell. Objectives. Revise TF-IDF weighting Example practical calculation of TF-IDF based similarity measure More about the function of the document index Assessing the retrieval. Text query. Stop Word Removal. Stemming. Match.

hinda
Download Presentation

EE3J2 Data Mining Lecture 5 More on the Index Martin Russell

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EE3J2 Data Mining 2008 – lecture 5

  2. EE3J2 Data MiningLecture 5More on the IndexMartin Russell EE3J2 Data Mining 2008 – lecture 5

  3. Objectives • Revise TF-IDF weighting • Example practical calculation of TF-IDF based similarity measure • More about the function of the document index • Assessing the retrieval EE3J2 Data Mining 2008 – lecture 5

  4. Text query Stop Word Removal Stemming Match Summary of the IR process Document Stop Word Removal Stemming EE3J2 Data Mining 2008 – lecture 5

  5. IDF weighting • One commonly used measure of the significance of a token for discriminating between documents is the Inverse Document Frequency (IDF) • For a token t define: • ND is the total number of documents • NDtis the number of documents which include t EE3J2 Data Mining 2008 – lecture 5

  6. Query and document weights • Now suppose t is a term • If d is document or a long query: where ftdis the (document) term frequency – the number of times the term t occurs in the document d • If d is a short query, define EE3J2 Data Mining 2008 – lecture 5

  7. Sum over all terms in both q and d ‘Length’ of query q ‘Length’ of document d TF-IDF Similarity • Define the similarity between query q and document d as: EE3J2 Data Mining 2008 – lecture 5

  8. Document length • Suppose d is a document • For each term t in d we can define the TF-IDF weight wtd • The length of document d is defined by: EE3J2 Data Mining 2008 – lecture 5

  9. Practical Considerations • Given a query q: • Calculate and wtq for each term t in q • Not too much computation! • For each document d • can be computed in advance • wtd can be computed in advance for each term t in d • Potential number of documents is huge • Potential time to compute all values Sim(q,d) is huge! EE3J2 Data Mining 2008 – lecture 5

  10. The Document Index Corpus of documents terms query t wtd q d ||d|| EE3J2 Data Mining 2008 – lecture 5

  11. 12 8 3 1 doc name1 doc name2 doc name3 doc name3 ‘magnesium’ 20 65 The Document Index text num docs num occ. EE3J2 Data Mining 2008 – lecture 5

  12. doc name3 doc name2 doc name1 3 8 12 weight1 weight3 weight2 The Document Index term document list text ‘magnesium’ 20 num docs 65 num occ. IDF doc nameN 1 weightN EE3J2 Data Mining 2008 – lecture 5

  13. Practical considerations • Order terms according to decreasing IDF • For each term, order documents according to decreasing weight • For each term in the query • Identify term in index • Increment similarity scores for documents in the list for this term • Stop when weight falls below some threshold EE3J2 Data Mining 2008 – lecture 5

  14. Example • Text: • The data mining course describes a set of methods for data mining and information retrieval • Text with stop-words removed (stopList50): • data mining course describes set methods data mining information retrieval • Stemmed text (Porter Stemmer): • data mine cours describ set method data mine inform retriev EE3J2 Data Mining 2008 – lecture 5

  15. Example - query • Question: • Is there a module on data mining or information retrieval? • Question – stop words removed: • module data mining text retrieval • Question – stemmed: • modul data mine text retriev EE3J2 Data Mining 2008 – lecture 5

  16. Text f IDF data 2 1.5 mine 2 2.5 cours 1 1.2 describ 1 0.8 set 1 0.6 method 1 0.8 inform 1 1.1 retriev 1 2.6 Query f IDF modul 1 1.6 data 1 1.5 mine 1 2.5 text 1 1.2 retriev 1 2.6 Example - terms EE3J2 Data Mining 2008 – lecture 5

  17. Text fIDF weight = f * IDF data 2 1.5 3.0 mine 2 2.5 5.0 cours 1 1.2 1.2 describ 1 0.8 0.8 set 1 0.6 0.6 method 1 0.8 0.8 inform 1 1.1 1.1 retriev 1 2.6 2.6 Weight calculation - document EE3J2 Data Mining 2008 – lecture 5

  18. Weight calculation - query • Query fIDF weight = f * IDF • modul 1 1.6 1.6 • data 1 1.5 1.5 • mine 1 2.5 2.5 • text 1 1.2 1.2 • retriev 1 2.6 2.6 EE3J2 Data Mining 2008 – lecture 5

  19. Document length • Suppose d is a document • For each term t in d we can define the TF-IDF weight wtd • The length of document d is defined by: EE3J2 Data Mining 2008 – lecture 5

  20. Text f IDF weight weight2 data 2 1.5 3.0 9.0 mine 2 2.5 5.0 25.0 cours 1 1.2 1.2 1.44 describ 1 0.8 0.8 0.64 set 1 0.6 0.6 0.36 method 1 0.8 0.8 0.64 inform 1 1.1 1.1 1.21 retriev 1 2.6 2.6 6.76 SUM 45.05 Document Length 6.71 Length calculation - document EE3J2 Data Mining 2008 – lecture 5

  21. Length calculation - query • Query f IDF weight weight2 • modul 1 1.6 1.6 2.56 • data 1 1.5 1.5 2.25 • mine 1 2.5 2.5 6.25 • text 1 1.2 1.2 1.44 • retriev 1 2.6 2.6 6.76 • SUM 19.26 • Query length 4.39 EE3J2 Data Mining 2008 – lecture 5

  22. ‘Length’ of q = 4.39 ‘Length’ of d = 6.71 TF-IDF Similarity • Define the similarity between query q and document d as: EE3J2 Data Mining 2008 – lecture 5

  23. Example – common terms • Terms which occur in both the document and the query • Query • modul data mine text retriev • Document • data mine cours describ set method data mine inform retriev • Common terms • data, mine, retrieve EE3J2 Data Mining 2008 – lecture 5

  24. Example – common terms • Term wt,d* wt,q • data 3.0*1.5 = 4.5 • mine 5.0*2.5 = 12.5 • retrieve 2.6*2.6 = 6.76 SUM = 23.76 EE3J2 Data Mining 2008 – lecture 5

  25. Sum over all terms in both q and d = 23.76 ‘Length’ q = 4.39 ‘Length’ d = 6.71 TF-IDF Similarity • Define the similarity between query q and document d as: EE3J2 Data Mining 2008 – lecture 5

  26. Example – final calculation EE3J2 Data Mining 2008 – lecture 5

  27. Retrieved Relevant Assessing the Retrieval • Two measures typically used: • Recall • Precision EE3J2 Data Mining

  28. Recall high recall retrieval Retrieved Relevant EE3J2 Data Mining

  29. Retrieved Relevant Precision high precision retrieval EE3J2 Data Mining

  30. Example 1 Doc1 Doc2 Doc3 Doc4 Doc5 Doc6 Doc7 Doc8 Doc9 Doc10 Doc11 Doc12 Doc13 Doc14 Doc15 Doc16 Doc17 Doc18 Doc19 Doc20 • 20 documents, 2 ‘about’ Birmingham • System 1 retrieves all 20 documents • Recall = 2/2 = 1 • Precision = 2/20 = 0.1 • System 1 has perfect recall, but low precision EE3J2 Data Mining

  31. Example 2 Doc1 Doc2 Doc3 Doc4 Doc5 Doc6 Doc7 Doc8 Doc9 Doc10 Doc11 Doc12 Doc13 Doc14 Doc15 Doc16 Doc17 Doc18 Doc19 Doc20 • System 2 retrieves Doc5 and Doc7 • Recall = 2/2 = 1 • Precision = 2/2 = 1 • System 2 has perfect recall and precision EE3J2 Data Mining

  32. Example 3 Doc1 Doc2 Doc3 Doc4 Doc5 Doc6 Doc7 Doc8 Doc9 Doc10 Doc11 Doc12 Doc13 Doc14 Doc15 Doc16 Doc17 Doc18 Doc19 Doc20 • System 3 retrieves Doc5 • Recall = 1/2 = 0.5, Precision = 1/1 = 1 • System 3 has poor recall but perfect precision EE3J2 Data Mining

  33. Example 4 Doc1 Doc2 Doc3 Doc4 Doc5 Doc6 Doc7 Doc8 Doc9 Doc10 Doc11 Doc12 Doc13 Doc14 Doc15 Doc16 Doc17 Doc18 Doc19 Doc20 • System 4 retrieves Doc14 • Recall = 0/2 = 0, Precision = 0/1 = 0 • System 3 has poor recall and precision EE3J2 Data Mining

  34. Example 5 Doc1 Doc2 Doc3 Doc4 Doc5 Doc6 Doc7 Doc8 Doc9 Doc10 Doc11 Doc12 Doc13 Doc14 Doc15 Doc16 Doc17 Doc18 Doc19 Doc20 • System 5 retrieves Doc5, Doc8, Doc1 • Recall = ½ = 0.5, Precision = 1/3 = 0.33 EE3J2 Data Mining

  35. Precision & Recall • In general, as number of documents retrieved increases: • Recall increases • Precision decreases • In many systems: • Each query q anddocument d is assigned a similarity score Sim(q,d), • d is retrieved if Sim(q,d) is bigger that some threshold T • By changing T can trade Recall against Precision EE3J2 Data Mining

  36. Precision / Recall Tradeoff • If the threshold is 0, all documents will be accepted: • High recall • Low precision • As the threshold increases, system becomes more ‘discerning’ • Fewer documents retrieved • Retrieved documents tend to be relevant - but lots missed • Low recall • High precision EE3J2 Data Mining

  37. 1 Precision Score Recall 0 0 Threshold T ROC Curves Received Operating Characteristic EE3J2 Data Mining

  38. 0 1 Recall 0 1 Precision ‘Precision – Recall’ graph Also called a DET Curve Better system High recall, low precision. Many irrelevant documents retrieved High precision, low recall. ‘Selective’ system retrieves few, relevant documents Worse system EE3J2 Data Mining

  39. Summary • Revision of TF-IDF similarity • Example TF-IDF similarity calculation • More detail about the Document Index • Assessment: • Precision • Recall • DET curves EE3J2 Data Mining 2008 – lecture 5

More Related