1 / 11

CSE 450 – Web Mining Seminar Professor Brian D. Davison Fall 2005

CSE 450 – Web Mining Seminar Professor Brian D. Davison Fall 2005. A Project Presentation on Identifying most descriptive terms by Osama Ahmed Khan 12/16/2005. Problem. Finding the most descriptive terms for a particular document in a collection of documents (webpages)

vicki
Download Presentation

CSE 450 – Web Mining Seminar Professor Brian D. Davison Fall 2005

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE 450 – Web Mining SeminarProfessor Brian D. DavisonFall 2005 A Project Presentation on Identifying most descriptive terms by Osama Ahmed Khan 12/16/2005

  2. Problem • Finding the most descriptive terms for a particular document in a collection of documents (webpages) • Estimating the best description for a new location in a higher-dimensional space

  3. Terminology • Term: Adjective Noun (bi-gram) -- ti • Document: Content -- di

  4. Creates a 2-D matrix A (t x d), representing the frequency of each term ti for each document di Creates a 3-D matrix B (d x t x t), representing the frequency of co-occurrence of each term ti with every other term tj for each document di Sorts the pairs titj for each document di in descending order of frequency, where titj represents the descriptive terms for that document di Extracts the first n pairs in the sorted index for each document di, where n represents the user input Algorithm

  5. A document is represented in a higher-dimensional space by plotting its t(t-1)/2 coordinates, where each dimension is a titj pair Any missing coordinate for a document di is assigned a value of zero A new document dj located in t(t-1)/2-dimensional space is best described by using Mahalanobis Distance metric to find the minimum distance between dj and (d-1) documents A new document dj identified in t(t-1)/2-dimensional space without its coordinates being known is best described by using k-Nearest Neighbors approach Algorithm (contd.)

  6. Dataset • Xiaoguang Qi provided pre-processed data http://wume.cse.lehigh.edu/~xiq204/topics/

  7. Implementation • Code • Text Mining Infrastructure (TMI) http://hddi.cse.lehigh.edu • C++ • Metrics • Precision • Recall

  8. Topic Detection through search engines Finding document representation in different domains Applications

  9. Finding an approximate transformation from t-dimensional space to a new k-dimensional space (if any exists), when the set of documents D is also represented in k-dimensional space, where k is equal to t(t-1)/2 dimensions Estimating the best description of a document in either of the two spaces when one set of space coordinates are missing Open Problems

  10. References • Improved Automatic Keyword Extraction Given More Linguistic Knowledge, Annette Hulth, Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing • Using Web Structure for Classifying and Describing Web Pages. E.J.Glover, K.Tsioutsiouliklis, S.Lawrence, D.M.Pennock & G.W.Flake, WWW2002, Hawaii, USA • Lexically-Generated Subject Hierarchies for Browsing Large Collections, C.G.Nevill-Manning, I.H.Witten & G.W.Paynter

  11. Thank You

More Related