an efficient concept based mining model for enhancing text clustering n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
An Efficient Concept-Based Mining Model for Enhancing Text Clustering PowerPoint Presentation
Download Presentation
An Efficient Concept-Based Mining Model for Enhancing Text Clustering

Loading in 2 Seconds...

play fullscreen
1 / 17

An Efficient Concept-Based Mining Model for Enhancing Text Clustering - PowerPoint PPT Presentation


  • 144 Views
  • Uploaded on

An Efficient Concept-Based Mining Model for Enhancing Text Clustering. Shady Shehata, Fakhri Karray, and Mohamed S. Kamel TKDE, 2010 Presented by Wen-Chung Liao 2010/11/03. Outlines. Motivation Objectives THEMATIC ROLES BACKGROUND CONCEPT-BASED MINING MODEL Experiments Conclusions

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'An Efficient Concept-Based Mining Model for Enhancing Text Clustering' - laksha


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
an efficient concept based mining model for enhancing text clustering

An Efficient Concept-Based Mining Model for Enhancing Text Clustering

Shady Shehata, Fakhri Karray, and Mohamed S. Kamel

TKDE, 2010

Presented by Wen-Chung Liao

2010/11/03

outlines
Outlines
  • Motivation
  • Objectives
  • THEMATIC ROLES BACKGROUND
  • CONCEPT-BASED MINING MODEL
  • Experiments
  • Conclusions
  • Comments
motivation
Motivation
  • Vector Space Model (VSM)
    • represents each document as a feature vector of the terms (words or phrases) in the document.
    • Each feature vector contains term weights (usually term frequencies) of the terms in the document.
    • term frequencycaptures the importance of the term within a document only.
  • However, two terms can have the same frequency in their documents, but one term contributes more to the meaningof its sentences than the other term.
  • Thus, the underlying text mining model should indicate terms thatcapture the semantics of text.
objectives
Objectives
  • A new concept-based mining model is introduced.
    • captures the semantic structure of each term within a sentence and document rather than the frequency of the term within a document only
    • effectively discriminate between nonimportant terms and terms which hold the concepts that represent the sentence meaning.
    • three measures for analyzing concepts on the sentence, document, and corpus levels are computed
    • a new concept-based similarity measure is proposed.
      • based on a combination of sentence-based, document-based, and corpus-based concept analysis.
    • more significant effect on the clustering quality due to the similarity’s insensitivity to noisy terms.
thematic roles background
THEMATIC ROLES BACKGROUND
  • Verb argument structure: (e.g., John hits the ball).
    • “hits” is the verb.
    • “John” and “the ball” are the arguments of the verb “hits,”
  • Label: A label is assigned to an argument,
    • e.g.: “John” has subject (or Agent) label. “the ball” has object (or theme) label,
  • Term: is either an argument or a verb.
    • either a word or a phrase
  • Concept: a labeled term.
  • Generally, the semantic structure of a sentencecan be characterized by a form of verb argument structure
concept based mining model1
CONCEPT-BASED MINING MODEL
  • Sentence-Based Concept Analysis
    • Calculating ctf of Concept c in Sentence s
      • the conceptual term frequency, ctf
        • the number of occurrences of concept c in verb argument structures of sentence s.
        • has the principal role of contributing to the meaning of s
        • a local measure on the sentence level
    • Calculating ctf of Concept c in Document d
      • the overall importance of concept c to the meaning of its sentences in document d.
concept based mining model2
CONCEPT-BASED MINING MODEL
  • Document-Based Concept Analysis
    • the concept-based term frequencytf
      • the number of occurrences of a concept (word or phrase) c in the original document.
      • a local measure on the document level
  • Corpus-Based Concept Analysis
    • the concept-based document frequencydf
      • the number of documents containing concept c
      • used to reward the concepts that only appear in a small number of documents
example of calculating ctf measure
Example of Calculating ctf Measure

Texas and Australia researchers have created industry-ready sheets of materials made from nanotubes that could lead to the development of artificial muscles.

  • Three verbs, colored by red, that represent the semantic structure of the meaning of the sentence.
  • Each has its own arguments:
    • [ARG0 Texas and Australia researchers] have [TARGETcreated] [ARG1 industry-ready sheets of materials made from nanotubes that could lead to the development of artificial muscles].
    • Texas and Australia researchers have created industry-ready sheets of [ARG1 materials] [TARGETmade] [ARG2 from nanotubes that could lead to the development of artificial muscles].
    • Texas and Australia researchers have created industry-ready sheets of materials made from [ARG1 nanotubes] [R-ARG1 that] [ARGM-MOD could] [TARGETlead] [ARG2 to the development of artificial muscles].
slide10
A clean step
  • To remove stop words
  • To stem the words
a concept based similarity measure
A Concept-Based Similarity Measure
  • The concept-based similarity between two documents, d1 and d2 is calculated by:

mmatching concepts

d1

d2

  • The single-term similarity measure is:

(using the TF-IDF weighting scheme)

mathematical framework
Mathematical Framework
  • Assume that the content of document d2is changed by △
  • Sensitivity analysis:
  • Assume that each concept consists of one word.
  • In this case, each concept is a word and A =1. (?)
  • By approximation, the d1c value is bigger than d1w and the △d2c value is bigger than the △d2w value.
  • Hence, the sensitivity of the concept-based similarity is higher than the cosine similarity.
  • This means that the concept-based model is deeper in analyzing the similarity between two documents than the traditional approaches.
concept based analysis algorithm
Concept-Based Analysis Algorithm

d1

d2

d3

d4

d1

d2

L

d3

L

L

d4

L

L

L

experimental results
EXPERIMENTAL RESULTS

Evaluation methods

  • Four data sets
    • 23,115 ACM abstract articles collected from the ACM digital library
      • five main categories
    • 12,902 documents from the Reuters 21,578 data set
      • five category sets
    • 361 samples from the Brown corpus
      • main categories were press: reportage; press: reviews, religion, skills and hobbies, popular lore, belles-letters, and learned; fiction: science; fiction: romance and humor.
    • 20,000 messages collected from 20 Usenet newsgroups
  • Three standard document clustering techniques:
    • Hierarchical Agglomerative Clustering (HAC),
    • Single-Pass Clustering
    • k-Nearest Neighbor (k-NN)
conclusions
Conclusions
  • Bridges the gap between natural language processing and text mining disciplines. (?)
  • By exploiting the semantic structure of the sentences in documents, a better text clustering result is achieved.
  • A number of possibilities for extending this paper.
    • link this work to Web document clustering.
    • apply the same model to text classification.
comments
Comments
  • Advantages
    • Better similarity considering the semantic structure of sentences in documents.
  • Shortages
    • Ambiguous algorithm
  • Applications
    • Text clustering
    • Text classification