CSE 450 – Web Mining Seminar Professor Brian D. Davison Fall 2005

CSE 450 – Web Mining SeminarProfessor Brian D. DavisonFall 2005 A Project Presentation on Identifying most descriptive terms by Osama Ahmed Khan 12/16/2005

Problem • Finding the most descriptive terms for a particular document in a collection of documents (webpages) • Estimating the best description for a new location in a higher-dimensional space

Terminology • Term: Adjective Noun (bi-gram) -- ti • Document: Content -- di

Creates a 2-D matrix A (t x d), representing the frequency of each term ti for each document di Creates a 3-D matrix B (d x t x t), representing the frequency of co-occurrence of each term ti with every other term tj for each document di Sorts the pairs titj for each document di in descending order of frequency, where titj represents the descriptive terms for that document di Extracts the first n pairs in the sorted index for each document di, where n represents the user input Algorithm

A document is represented in a higher-dimensional space by plotting its t(t-1)/2 coordinates, where each dimension is a titj pair Any missing coordinate for a document di is assigned a value of zero A new document dj located in t(t-1)/2-dimensional space is best described by using Mahalanobis Distance metric to find the minimum distance between dj and (d-1) documents A new document dj identified in t(t-1)/2-dimensional space without its coordinates being known is best described by using k-Nearest Neighbors approach Algorithm (contd.)

Dataset • Xiaoguang Qi provided pre-processed data http://wume.cse.lehigh.edu/~xiq204/topics/

Implementation • Code • Text Mining Infrastructure (TMI) http://hddi.cse.lehigh.edu • C++ • Metrics • Precision • Recall

Topic Detection through search engines Finding document representation in different domains Applications

Finding an approximate transformation from t-dimensional space to a new k-dimensional space (if any exists), when the set of documents D is also represented in k-dimensional space, where k is equal to t(t-1)/2 dimensions Estimating the best description of a document in either of the two spaces when one set of space coordinates are missing Open Problems

References • Improved Automatic Keyword Extraction Given More Linguistic Knowledge, Annette Hulth, Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing • Using Web Structure for Classifying and Describing Web Pages. E.J.Glover, K.Tsioutsiouliklis, S.Lawrence, D.M.Pennock & G.W.Flake, WWW2002, Hawaii, USA • Lexically-Generated Subject Hierarchies for Browsing Large Collections, C.G.Nevill-Manning, I.H.Witten & G.W.Paynter

Thank You

CSE 450 – Web Mining Seminar Professor Brian D. Davison Fall 2005

CSE 450 – Web Mining Seminar Professor Brian D. Davison Fall 2005

Presentation Transcript

Web Mining

CSE 450 – Web Mining Seminar Professor Brian D. Davison Fall 2005

Fall 2005

CSE 5331/7331 Fall 2007 Image Mining

CS276B Web Search and Mining Winter 2005

Social Media Mining CSE 494/598, Fall 2011

Web Mining

CSE 538 Web Search and Mining Web Crawling

Liangjie Hong , Zaihan Yang and Brian D. Davison Computer Science and Engineering

Spatial Data Mining CSE 6331, Fall 1999

CS276B Web Search and Mining Winter 2005

CS276B Web Search and Mining Winter 2005

Web mining

CS276B Web Search and Mining Winter 2005

Web Mining

Web Mining

CSE 881: Data Mining

Computer Security Primer CSE 291 Fall 2005

Brian D. Boyer

CSE 881: Data Mining

WEB MINING

CSE 592: Data Mining