using tf idf anomalies to cluster documents on subject matter
Download
Skip this Video
Download Presentation
Using TF-IDF Anomalies to Cluster Documents on Subject Matter

Loading in 2 Seconds...

play fullscreen
1 / 27

Using TF-IDF Anomalies to Cluster Documents on Subject Matter - PowerPoint PPT Presentation


  • 127 Views
  • Uploaded on

Natural Language Processing And Computational Linguistics. Using TF-IDF Anomalies to Cluster Documents on Subject Matter. An Analysis using Word, Simple Noun Phrase, and Complex Noun Phrase Frequencies. Whitney St.Charles Research Alliance in Math and Science 2007 Mentors:

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Using TF-IDF Anomalies to Cluster Documents on Subject Matter' - ohio


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
using tf idf anomalies to cluster documents on subject matter

Natural Language Processing

And Computational Linguistics

Using TF-IDF Anomalies to Cluster Documents on Subject Matter

An Analysis using Word, Simple Noun Phrase, and Complex Noun Phrase Frequencies

Whitney St.Charles

Research Alliance in Math and Science 2007

Mentors:

Yu (Cathy) Jiao, Ph.D.

Robert Patton, Ph.D.

Computational Sciences and Engineering Division

purposes of document clustering
Purposes of document clustering
  • Data overabundance
    • YouTube generates 200 terabytes of data per day
  • How do we sift through those kinds of quantities?
    • Searching
      • Reduces the set tremendously
    • Document Clustering
      • Is a knowledge discovery technique
      • Categorizes results into meaningful groups
      • Allows the user to browse quickly to the target
document clustering users
Document clustering users
  • Financial analysts
    • Identify certain trends to develop forecasts about a particular company
  • Business Intelligence
    • Identify products that are associated with or dependent upon one another
  • Military
    • Identify terrorist cells from blog activity and movement of materials
  • You!
    • Narrow down hundreds of thousands of internet search results to find the kinds of sites you want
current document clustering technique
Current document clustering technique
  • A word-by-word comparison of each document is made to determine similarity
  • Unfortunately, this method…
    • Does not handle context very well
    • Compares several hundred/ several thousand words for each document
      • Is very computationally expensive
      • Requires expensive SIMD machines
contributions to the field
Contributions to the field
  • Identify only those words which are more indicative of the subject matter
    • If airline occurs 20% more than is “normal,” it has something to do with the subject
  • Examine both simple and complex noun phrases to address the context of the document
  • Generate much smaller vectors, containing an average of 82% fewer terms!
  • Cluster more accurately because only “important” words are chosen
establishing the baseline
Establishing the baseline
  • Train the program to recognize what is “normal” for a given term
    • Need an entire English language corpus
  • Corpus: a large, structured set of texts compiled to be representative of a language
      • uses hundreds of thousands of words in every allowable way
  • Using a corpus, the program can
      • Establish usage statistics
      • Learn linguistic rules

Example: The Brown Corpus http://www.edict.com.hk/concordance/WWWConcappE.htm

part of speech tagging
Part-of-speech tagging
  • Tags every word in the sentence with the correct part-of-speech
  • Achieves an accuracy of 97.24%
    • Is necessary because token extraction methods are each dependent upon correct tagging
  • Passes the tagged sentence to the token extractor

The/dtdesperate/adjsummer/n intern/n

tried/vbdto/to keep/vb everyone/n awake/adj.

token extractor
Token extractor
  • Extracts
    • Words
    • Simple noun phrases
    • Complex noun phrases

Document

Words

Noun phrases

word extraction
Word extraction
  • Uses POS tagged data to identify only adjectives, verbs, and nouns
  • Uses the Porter stemmer to identify unique words
    • cut common suffixes such as –ing, -tion, -e, -es, -s
      • Example: “recreation” and “recreational” are both identified as “recreat”
why nouns
Why nouns?
  • Are named entities
  • Answer the question “What”
  • Are less ambiguous than verbs
    • Example: “cook up a good meal” or “cook up a new solution”
simple noun phrase extraction
Simple noun phrase extraction
  • Accepts only consecutive nouns
    • Example: summer intern, union representative
  • Provides a set of short, highly descriptive phrases
complex noun phrase extraction techniques
Complex noun phrase extraction techniques
  • Static Rule-based/ Finite State Automata
    • Rely on the aptitude of linguist formulating rule set
  • Machine Learning
    • Rely on the “completeness” of the training set
static rule based extraction

noun/ pronoun/ determiner

determiner/adjective

noun/ pronoun

NP

S0

S1

adjective

Relative clause/

Prepositional phrase/

noun

Static rule-based extraction
  • Establishes a list of linguistic rules
    • A determiner preceding a noun marks the beginning of a noun phrase
    • A determiner may not precede a noun phrase
static extraction shortcomings
Static extraction shortcomings
  • Unanticipated rules
    • The subjective nature of language
  • Difficulty finding non-recursive, base NP’s
    • [The man [whose red hat [I borrowed yesterday]RC ]RC [in the street]PP [that is next to my house]RC]NPlives [next door]NP.
    • [The man]NPwhose [red hat]NPI borrowed [yesterday]NPin[the street]NPthat is next to [my house]NPlives [next door]NP.
  • Structural ambiguity
structural ambiguity example
Structural ambiguity example

“I saw the man with the telescope.”

machine learning extraction
Machine learning extraction

TRAINING

  • Is all about
    • Uses a corpus
  • Is based on statistics
    • The more it sees a particular occurrence, the more likely it is to prefer it
      • Makes better educated guesses about structural ambiguity
      • Discovers thousands of unanticipated rules
transformation based complex noun phrase extraction
Transformation-based complex noun phrase extraction

An ‘error-driven’ approach for learning an ordered set of rules

1. Generate all rules that correct at least one error.

2. For each rule:

(a) Apply to a copy of the most recent state of the training set.

(b) Score result

3. Select rule with best score.

4. Update training set by applying selected rule.

5. Stop if score is smaller than some pre-set threshold T; otherwise repeat from step 1.

determining anomaly sets
Determining anomaly sets
  • TF-IDF: Term Frequency – Inverse Document Frequency
    • Number of local occurrences of term multiplied by uniqueness measure of term in document set
  • TF-ICF: Term Frequency – Inverse Corpus Frequency
    • Average number of corpus occurrences of term multiplied by uniqueness measure of term in the corpus
clustering the data
Clustering the data
  • Unweighted Pair Group Method with Average means
performance metrics used
Performance Metrics Used
  • Precision = number of correct responses

number of responses

  • Recall = number of correct responses

number correct in key

  • F-measure = 2RP

R+ P

slide24

RESULTS

80%

89%

With 82% fewer comparisons!

future work
Future Work
  • Determine clustering results for both simple and complex noun phrases
  • Could be applied to other clustering techniques, such as swarming
acknowledgements
Acknowledgements
  • The Research Alliance in Math and Science program
  • Computational Sciences and Engineering Division, Office of Advanced Scientific Computing Research, U.S. Department of Energy.
  • Dr. Cathy Jiao
  • Dr. Robert Patton
  • Dr. Thomas Potok
ad