Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence - PowerPoint PPT Presentation

mike thelwall professor of information science university of wolverhampton n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence PowerPoint Presentation
Download Presentation
Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence

play fullscreen
1 / 24
Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence
91 Views
Download Presentation
raanan
Download Presentation

Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Mike Thelwall Professor of Information Science University of Wolverhampton Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence

  2. Contents • Introduction to Scientific Web Intelligence • Introduction to the Vector Space Model • Vocabulary Spectral Analysis • Low frequency words

  3. Part 1 Scientific Web Intelligence

  4. Scientific Web Intelligence • Applying web mining and web intelligence techniques to collections of academic/scientific web sites • Uses links and text • Objective: to identify patterns and visualize relationships between web sites and subsites • Objective: to report to users causal information about relationships and patterns

  5. Academic Web Mining • Step 1: Cluster domains by subject content, using text and links • Step 2: Identify patterns and create visualizations for relationships • Step 3: Incorporate user feedback and reason reporting into visualization This presentation deals with Step 1, deriving subject-based clusters of academic webs from text analysis

  6. Part 2 Introduction to the Vector Space Model

  7. Overview • The Vector Space Model (VSM) is a way of representing documents through the words that they contain • It is a standard technique in Information Retrieval • The VSM allows decisions to be made about which documents are similar to each other and to keyword queries

  8. How it works: Overview • Each document is broken down into a word frequency table • The tables are called vectors and can be stored as arrays • A vocabulary is built from all the words in all documents in the system • Each document is represented as a vector based against the vocabulary

  9. Example • Document A • “A dog and a cat.” • Document B • “A frog.”

  10. Example, continued • The vocabulary contains all words used • a, dog, and, cat, frog • The vocabulary needs to be sorted • a, and, cat, dog, frog

  11. Example, continued • Document A: “A dog and a cat.” • Vector: (2,1,1,1,0) • Document B: “A frog.” • Vector: (1,0,0,0,1)

  12. Measuring inter-document similarity • For two vectors d and d’ the cosine similarity between d and d’ is given by: • Here d X d’ is the vector product of d and d’, calculated by multiplying corresponding frequencies together • The cosine measure calculates the angle between the vectors in a high-dimensional virtual space

  13. Stopword lists • Commonly occurring words are unlikely to give useful information and may be removed from the vocabulary to speed processing • E.g. “in”, “a”, “the”

  14. Normalised term frequency (tf) • A normalised measure of the importance of a word to a document is its frequency, divided by the maximum frequency of any term in the document • This is known as the tf factor. • Document A: raw frequency vector: (2,1,1,1,0), tf vector: (1, 0.5, 0.5, 0.5, 0)

  15. Inverse document frequency (idf) • A calculation designed to make rare words more important than common words • The idf of word i is given by • Where N is the total number of documents and ni is the number that contain word i

  16. tf-idf • The tf-idf weighting scheme is to multiply the tf factor and idf factors for each word • Words are important for a document if they are frequent relative to other words in the document and rare in other documents

  17. Part 3 Vocabulary Spectral Analysis

  18. Subject-clustering academic webs through text similarity 1 • Create a collection of virtual documents consisting of all web pages sharing a common domain name in a university. • Doc. 1 = cs.auckland.ac.uk 14,521 pgs • Doc. 2 = www.auckland.ac.nz 3,463 pgs • … • Doc. 760 = www.vuw.ac.nz 4,125 pgs

  19. Subject-clustering academic webs through text similarity 2 • Convert each virtual document into a tf-idf word vector • Identify clusters using k-means and VSM cosine measures • Rank words for importance in each ‘natural’ cluster Cluster Membership Indicator • Manually filter out high-ranking words in undesired clusters • Destroys the natural clustering of the data to uncover weaker subject clustering

  20. Cluster Membership Indicator For a cluster C of documents and tdf-idf weights wij The next slide shows the top CMI weights for an undesired non-subject cluster

  21. Eliminating low frequency words • Can test whether removing low frequency words increases or decreases subject clustering tendency • E.g. are spelling mistakes? • Need partially correct subject clusters • Compare similarity of documents within cluster to similarity with documents outside cluster

  22. Eliminating low frequency words

  23. Summary • For text based academic subject web site clustering: • need to select vocabularies to break natural clustering and allow subject clustering • consider ignoring low frequency words because they do not have high clustering power • Need to automate the manual element as far as possible • The results can then form the basis of a visualization that can give feedback to the user on inter-subject connections