Using tf idf anomalies to cluster documents on subject matter
Download
1 / 27

Using TF-IDF Anomalies to Cluster Documents on Subject Matter - PowerPoint PPT Presentation


  • 129 Views
  • Uploaded on

Natural Language Processing And Computational Linguistics. Using TF-IDF Anomalies to Cluster Documents on Subject Matter. An Analysis using Word, Simple Noun Phrase, and Complex Noun Phrase Frequencies. Whitney St.Charles Research Alliance in Math and Science 2007 Mentors:

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Using TF-IDF Anomalies to Cluster Documents on Subject Matter' - ohio


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Using tf idf anomalies to cluster documents on subject matter l.jpg

Natural Language Processing

And Computational Linguistics

Using TF-IDF Anomalies to Cluster Documents on Subject Matter

An Analysis using Word, Simple Noun Phrase, and Complex Noun Phrase Frequencies

Whitney St.Charles

Research Alliance in Math and Science 2007

Mentors:

Yu (Cathy) Jiao, Ph.D.

Robert Patton, Ph.D.

Computational Sciences and Engineering Division


Purposes of document clustering l.jpg
Purposes of document clustering

  • Data overabundance

    • YouTube generates 200 terabytes of data per day

  • How do we sift through those kinds of quantities?

    • Searching

      • Reduces the set tremendously

    • Document Clustering

      • Is a knowledge discovery technique

      • Categorizes results into meaningful groups

      • Allows the user to browse quickly to the target


Document clustering users l.jpg
Document clustering users

  • Financial analysts

    • Identify certain trends to develop forecasts about a particular company

  • Business Intelligence

    • Identify products that are associated with or dependent upon one another

  • Military

    • Identify terrorist cells from blog activity and movement of materials

  • You!

    • Narrow down hundreds of thousands of internet search results to find the kinds of sites you want


Current document clustering technique l.jpg
Current document clustering technique

  • A word-by-word comparison of each document is made to determine similarity

  • Unfortunately, this method…

    • Does not handle context very well

    • Compares several hundred/ several thousand words for each document

      • Is very computationally expensive

      • Requires expensive SIMD machines


Contributions to the field l.jpg
Contributions to the field

  • Identify only those words which are more indicative of the subject matter

    • If airline occurs 20% more than is “normal,” it has something to do with the subject

  • Examine both simple and complex noun phrases to address the context of the document

  • Generate much smaller vectors, containing an average of 82% fewer terms!

  • Cluster more accurately because only “important” words are chosen



Establishing the baseline l.jpg
Establishing the baseline

  • Train the program to recognize what is “normal” for a given term

    • Need an entire English language corpus

  • Corpus: a large, structured set of texts compiled to be representative of a language

    • uses hundreds of thousands of words in every allowable way

  • Using a corpus, the program can

    • Establish usage statistics

    • Learn linguistic rules

      Example: The Brown Corpus http://www.edict.com.hk/concordance/WWWConcappE.htm



  • Part of speech tagging l.jpg
    Part-of-speech tagging

    • Tags every word in the sentence with the correct part-of-speech

    • Achieves an accuracy of 97.24%

      • Is necessary because token extraction methods are each dependent upon correct tagging

    • Passes the tagged sentence to the token extractor

    The/dtdesperate/adjsummer/n intern/n

    tried/vbdto/to keep/vb everyone/n awake/adj.


    Token extractor l.jpg
    Token extractor

    • Extracts

      • Words

      • Simple noun phrases

      • Complex noun phrases

    Document

    Words

    Noun phrases


    Word extraction l.jpg
    Word extraction

    • Uses POS tagged data to identify only adjectives, verbs, and nouns

    • Uses the Porter stemmer to identify unique words

      • cut common suffixes such as –ing, -tion, -e, -es, -s

        • Example: “recreation” and “recreational” are both identified as “recreat”


    Why nouns l.jpg
    Why nouns?

    • Are named entities

    • Answer the question “What”

    • Are less ambiguous than verbs

      • Example: “cook up a good meal” or “cook up a new solution”


    Simple noun phrase extraction l.jpg
    Simple noun phrase extraction

    • Accepts only consecutive nouns

      • Example: summer intern, union representative

    • Provides a set of short, highly descriptive phrases


    Complex noun phrase extraction techniques l.jpg
    Complex noun phrase extraction techniques

    • Static Rule-based/ Finite State Automata

      • Rely on the aptitude of linguist formulating rule set

    • Machine Learning

      • Rely on the “completeness” of the training set


    Static rule based extraction l.jpg

    noun/ pronoun/ determiner

    determiner/adjective

    noun/ pronoun

    NP

    S0

    S1

    adjective

    Relative clause/

    Prepositional phrase/

    noun

    Static rule-based extraction

    • Establishes a list of linguistic rules

      • A determiner preceding a noun marks the beginning of a noun phrase

      • A determiner may not precede a noun phrase


    Static extraction shortcomings l.jpg
    Static extraction shortcomings

    • Unanticipated rules

      • The subjective nature of language

    • Difficulty finding non-recursive, base NP’s

      • [The man [whose red hat [I borrowed yesterday]RC ]RC [in the street]PP [that is next to my house]RC]NPlives [next door]NP.

      • [The man]NPwhose [red hat]NPI borrowed [yesterday]NPin[the street]NPthat is next to [my house]NPlives [next door]NP.

    • Structural ambiguity


    Structural ambiguity example l.jpg
    Structural ambiguity example

    “I saw the man with the telescope.”


    Machine learning extraction l.jpg
    Machine learning extraction

    TRAINING

    • Is all about

      • Uses a corpus

    • Is based on statistics

      • The more it sees a particular occurrence, the more likely it is to prefer it

        • Makes better educated guesses about structural ambiguity

        • Discovers thousands of unanticipated rules


    Transformation based complex noun phrase extraction l.jpg
    Transformation-based complex noun phrase extraction

    An ‘error-driven’ approach for learning an ordered set of rules

    1. Generate all rules that correct at least one error.

    2. For each rule:

    (a) Apply to a copy of the most recent state of the training set.

    (b) Score result

    3. Select rule with best score.

    4. Update training set by applying selected rule.

    5. Stop if score is smaller than some pre-set threshold T; otherwise repeat from step 1.


    Determining anomaly sets l.jpg
    Determining anomaly sets

    • TF-IDF: Term Frequency – Inverse Document Frequency

      • Number of local occurrences of term multiplied by uniqueness measure of term in document set

    • TF-ICF: Term Frequency – Inverse Corpus Frequency

      • Average number of corpus occurrences of term multiplied by uniqueness measure of term in the corpus



    Clustering the data l.jpg
    Clustering the data

    • Unweighted Pair Group Method with Average means


    Performance metrics used l.jpg
    Performance Metrics Used

    • Precision = number of correct responses

      number of responses

    • Recall = number of correct responses

      number correct in key

    • F-measure = 2RP

      R+ P


    Slide24 l.jpg

    RESULTS

    80%

    89%

    With 82% fewer comparisons!


    Future work l.jpg
    Future Work

    • Determine clustering results for both simple and complex noun phrases

    • Could be applied to other clustering techniques, such as swarming


    Acknowledgements l.jpg
    Acknowledgements

    • The Research Alliance in Math and Science program

    • Computational Sciences and Engineering Division, Office of Advanced Scientific Computing Research, U.S. Department of Energy.

    • Dr. Cathy Jiao

    • Dr. Robert Patton

    • Dr. Thomas Potok