Text mining
Download
1 / 53

- PowerPoint PPT Presentation


  • 242 Views
  • Updated On :

Text Mining. Huaizhong KOU PHD Student of Georges Gardarin PRiSM Laboratory. 1. Introduction What happens What is Text Mining Text Mining vs Data Mining Applications 2. Feature Extraction Task Indexing Dimensionality Reduction 3. Document Categorization Task Architecture

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about '' - duena


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Text mining l.jpg

Text Mining

Huaizhong KOU

PHD Student of Georges Gardarin

PRiSM Laboratory


0 content l.jpg

1. Introduction

What happens

What is Text Mining

Text Mining vs Data Mining

Applications

2. Feature Extraction

Task

Indexing

Dimensionality Reduction

3. Document Categorization

Task

Architecture

Categorization Classifiers

Application:Trend Analysis

4. Document Clustering

Task

Algorithms

Application

5. Product

6.Reference

0. Content


1 text mining introduction l.jpg
1. Text Mining:Introduction

1.1 What happens

1.2 What’s Text Mining

1.3 Text Mining vs Data Mining

1.4 Application


1 1 introduction what happens 1 l.jpg
1.1 Introduction:What happens(1)

  • Information explosive

  • 80% information stored in text

  • documents:journals, web pages,emails...

  • Difficult to extract special information

  • Current technologies...

Internet


1 1 introduction what happens 2 l.jpg
1.1 Introduction: What happens(2)

  • It is necessary to

  • automatically analyze,

  • organize,summarize...

Text Mining

La valeur des actions

des sociétés XML vont augmenter

Knowledge


1 2 introduction what s text mining 1 l.jpg
1.2 Introduction: What’s Text Mining(1)

  • Text Mining::= the procedure of synthesizing the information

  • by analyzing the relations, the patterns, and the rules among

  • textual data - semi-structured or unstructured text.

  • Techniques::=

    data mining

    machine learning

    information retrieval

    statistics

    natural-language understanding

    case-based reasoning

Ref:[1]-[4]


1 2 introduction what s text mining 2 l.jpg

Sample

Documents

Text document

Transformed

Learning

Representation

models

Domain specific

templates/models

knowledge

Visualizations

1.2 Introduction: What’s Text Mining(2)

Learning

Working


1 3 introduction tm vs dm l.jpg

Data MiningText Mining

Data Object Numerical & categorical data Textual data

Data structure Structure Unstructure&semi-structure

Data representation Straightforward Complex

Space dimension <tens of thousands > tens of thousands

Methods Data analysis, machine learning Data mining, information

statistic, neural networks retrieval, NLP,...

Maturity Broad implementation since1994 Broad implementation

starting 2000

Market 105 analysts at large and mid 108 analysts corporate workers

size companies and individual users

1.3 Introduction:TM vs DM

Ref:[12][13]


1 4 introduction application l.jpg
1.4 Introduction:Application

  • The potential applications are countless.

    • Customer profile analysis

    • Trend analysis

    • Information filtering and routing

    • Event tracks

    • news stories classification

    • Web search

    • …….

Ref:[12][13]


2 feature extraction l.jpg
2. Feature Extraction

2.1 Task

2.2 Indexing

2.3 Weighting Model

2.4 Dimensionality Reduction

Ref:[7][11][14][18][20][22]


2 1 feature extraction task 1 l.jpg

All unique

words/phrases

All good

words/phrases

2.1 Feature Extraction:Task(1)

Task: Extract a good subset of words to represent documents

Document

collection

Feature

Extraction

Ref:[7][11][14][18][20][22]


2 1 feature extraction task 2 l.jpg
2.1 Feature Extraction:Task(2)

While more and more textual information

is available online, effective retrieval is

difficult without good indexing of text

content.

16

While-more-and-textual-information-is-available-online-

effective-retrieval-difficult-without-good-indexing-text-content

Feature

Extraction

5

Text-information-online-retrieval-index

2 1 1 1 1

Ref:[7][11][14][18][20][22]


2 2 feature extraction indexing 1 l.jpg

Training

documents

2.2 Feature Extraction:Indexing(1)

Identification

all unique words

Removal

stop words

  • non-informative word

  • ex.{the,and,when,more}

  • Removal of suffix to

    generate word stem

  • grouping words

  • increasing the relevance

  • ex.{walker,walking}walk

Word Stemming

  • Naive terms

  • Importance of term in Doc

Term Weighting

Ref:[7][11][14][18][20][22]


2 2 feature extraction indexing 2 l.jpg
2.2 Feature Extraction:Indexing(2)

  • Document representations: vector space models

d=(w1,w2,…wt)Rt

wi is the weight of ith term in document d.

Ref:[7][11][14][18][20][22]


2 3 feature extraction weighting model 1 l.jpg

A B K O Q R S T W X

D1 3 1 0 1 1 1 1 1 1 1

D2 3 2 1 0 1 1 1 1 0 1

ABRTSAQWA

XAO

RTABBAXA

QSAK

2.3 Feature Extraction:Weighting Model(1)

  • tf - Term Frequency weighting

    • wij = Freqij

    • Freqij : := the number of times jth term

    • occurs in document Di.

  •  Drawback: without reflection of importance

  • factor for document discrimination.

  • Ex.

D1

D2

Ref:[11][22]


2 3 feature extraction weighting model 2 l.jpg
2.3 Feature Extraction:Weighting Model(2)

  • tfidf - Inverse Document Frequency weighting

    • wij=Freqij* log(N/DocFreqj) .

    • N: := the number of documents in the training

    • document collection.

    • DocFreqj::= the number of documents in

    • which the jth term occurs.

  • Advantage: with reflection of importance factor for

  • document discrimination.

  • Assumption:terms with low DocFreq are better discriminator

  • than ones with high DocFreq in document collection

    • Ex.

    A B K O Q R S T W X

    D1 0 0 0 0.3 0 0 0 0 0.3 0

    D2 0 0 0.3 0 0 0 0 0 0 0

    Ref:[13]

    Ref:[11][22]


    2 3 feature extraction weighting model 3 l.jpg
    2.3 Feature Extraction:Weighting Model(3)

    • Entropy weighting

    where

    is average entropy of ith term and

    -1: if word occurs once time in every document

    0: if word occurs in only one document

    Ref:[13]

    Ref:[11][22]


    2 4 feature extraction dimension reduction l.jpg
    2.4 Feature Extraction:Dimension Reduction

    • 2.4.1 Document Frequency Thresholding

    • 2.4.2 X2-statistic

    • 2.4.3 Latent Semantic Indexing

    Ref:[11][20][21][27]


    2 4 1 dimension reduction docfreq thresholding l.jpg

    Training

    documents D

    2.4.1 Dimension Reduction:DocFreq Thresholding

    • Document Frequency Thresholding

    Naive Terms

    Calculates DocFreq(w)

    Sets threshold 

    Removes all words:

    DocFreq < 

    Feature Terms

    Ref:[11][20][21][27]


    2 4 2 dimension reduction x 2 statistic l.jpg

    Category set

    C={c1,c2,..cm}

    2.4.2 Dimension Reduction: X2-statistic

    • Assumption:a pre-defined category set for a training collection D

    • Goal: Estimation independence between term and category

    Naive Terms

    Term categorical score

    Sets threshold 

    A:=|{d| dcj  wd}|

    B:=|{d| dcj  wd}|

    C:=|{d| dcj  wd}|

    D:=|{d| dcj  wd}|

    N:=|{d| dD}|

    Removes all words: X2max(w)< 

    FEATURE TERMS

    Ref:[11][20][21][27]


    2 4 3 dimension reduction lsi 1 l.jpg

    '

    D0

    D0

    = I

    '

    X

    (d,t)

    '

    T0

    T0

    = I

    2.4.3 Dimension Reduction:LSI(1)

    • LSI=Latent Semantic Indexing.

    1.SVD Model:Singular Value Decomposition of matrix

    documents

    m<=min(t,d)

    *

    *

    '

    X

    =

    T0

    S0

    *

    D0

    terms

    *

    *

    (t,d)

    (t,m)

    (m,m)

    (m,d)

    '

    documents

    =

    D0

    S0

    T0

    (d,m)

    (m,m)

    (m,t)

    '

    T0: t m orthogonal eigenvector matrix. Its rows are eigenvectors of X X

    '

    D0: d m orthogonal eigenvector matrix. Its rows are eigenvectors of X X

    S0: m m diagonal matrix of singular value( square roots of eigenvalue) in

    decreasing order of importance.

    m: rank of matrix X. m<=min(t,d).

    Ref:[11][20][21][27]


    2 4 3 dimension reduction lsi 2 l.jpg

    X  appr(X) =

    '

    2.4.3 Dimension Reduction:LSI(2)

    !

    !

    2.Approximate Model

    Select

    k<=m

    *

    *

    X 

    '

    appr(X)

    =

    T

    S

    *

    D

    *

    *

    (t,d)

    (t,k)

    (k,k)

    (k,d)

    '

    '

    D  S  T

    '

    • Every rows of matrix appr(X) approximately represents one documents.

    • Given a row xi and its corresponding row di, the following holds:

    '

    xi = di  S  T

    and

    di = xi  T  S-1

    Ref:[11][20][21][27]


    2 4 3 dimension reduction lsi 3 l.jpg
    2.4.3 Dimension Reduction:LSI(3)

    3.Document Represent Model:

    tNaive Terms

    new document d

    d=(w1,w2,…,wt) Rt

    d

    appr(d)

    =

    d  T  S-1 Rk

    (1,k) (1,t) (t,k) (k, k)

    • No good methods to determine k. It depends on application domain.

    • Some experiments suggest: 100  300.

    Ref:[11][20][21][27]


    3 document categorization l.jpg
    3. Document Categorization

    3.1 Task

    3.2 Architecture

    3.3 Categorization Classifiers

    3.4 Application

    Ref:[1][2][4][5][11][18][23][24]


    3 1 categorization task l.jpg
    3.1 Categorization:Task

    • Task: assignment of one or more predefined

    • categories to one document.

    Topics

    Themes


    3 2 categorization architecture l.jpg

    Feature space

    3.2 Categorization:Architecture

    Training

    documents

    New document

    d

    preprocessing

    Weighting

    Selecting feature

    Predefined

    categories

    Classifier

    Category(ies) tod


    3 3 categorization classifiers l.jpg
    3.3 Categorization Classifiers

    3.3.1Centroid-based Classifier

    3.3.2 k-Nearest Neighbor Classifier

    3.3.3 Naive Bayes Classifier


    3 3 1 model centroid based classifier 1 l.jpg

    1.Input:new documentd =(w1, w2,…,wn);

    2.Predefined categories:C={c1,c2,….,cl};

    3.//Compute centroid vector

    , ciC

    4.//Similarity model - cosine function

    5.//Compute similarity

    6.//Output:Assign to documentdthe categorycmax

    3.3.1 Model:Centroid-Based Classifier(1)


    3 3 1 model centroid based classifier 2 l.jpg
    3.3.1 Model:Centroid-Based Classifier(2)

    •  > 

    • cos()<cos()

    • d2 is more close to d1 than d3

    d3

    d2

    d1

    • Cosine-based similarity model can reflect the relations between features.


    3 3 2 model k nearest neighbor classifier l.jpg
    3.3.2 Model:K-Nearest Neighbor Classifier

    1.Input:new documentd;

    2.training collection:D={d1,d2,…dn };

    3.predefined categories:C={c1,c2,….,cl};

    4.//Compute similarities

    for(diD){ Simil(d,di) =cos(d,di); }

    5.//Select k-nearest neighbor

    Construct k-document subset Dk so that

    Simil(d,di) < min(Simil(d,doc) | doc Dk) di D- Dk.

    6.//Compute score for each category

    for(ciC){

    score(ci)=0;

    for(docDk){ score(ci)+=((docci)=true?1:0)}

    }

    7.//Output:Assign to d the category c with the highest score:

    score(c)  score(ci) , ci C- {c}


    3 3 3 model naive bayes classifier l.jpg
    3.3.3 Model:Naive Bayes Classifier

    Basic assumption: all terms distribute in documents independently.

    1.Input:new documentd;

    2.predefined categories:C={c1,c2,….,cl};

    3.//Compute the probability that d is in each class c C

    for(ciC){

    //note that terms wi in document are independent each other

    }

    4.//output:Assigns to d the categoryc with the highest probability:


    3 4 categorization application trend analysis eanalyst system l.jpg

    Time series data

    Trends

    Language

    models

    Language

    model

    Sample relevant

    docs

    3.4 Categorization: ApplicationTrend Analysis-EAnalyst System

    Goal: Predicts the trends in stock price based on news stories

    Find Trends

    Trend cluster

    Piecewise linear fitting

    Align

    trends

    with

    docs

    Retrieve

    Docs

    Sample Textual data

    Sample docs

    News

    documents

    Trend::=(slope,confidence)

    Bayes

    Classifier

    Learning Process

    Categorization

    New trend

    Ref:[28]


    4 document clustering l.jpg
    4. Document Clustering

    4.1 Task

    4.2 Algorithms

    4.3 Application

    Ref:[5][7][8][9][10][15][16][29]


    4 1 document clustering task l.jpg
    4.1 Document Clustering:Task

    • Task: It groups all documents so that the documents in the same

    • group are more similar than ones in other groups.

    • Cluster hypothesis: relevant documents tend to be more

    • closely related to each other than to

    • non-relevant document.

    Ref:[5][7][8][9][10][15][16][29]


    4 2 document clustering algorithms l.jpg
    4.2 Document Clustering:Algorithms

    • 4.2.1 k-means

    • 4.2.2 Hierarchic Agglomerative Clustering(HAC)

    • 4.2.3 Association Rule Hypergraph Partitioning (ARHP)

    Ref:[5][7][8][9][10][15][16][29]


    4 2 1 document clustering k means l.jpg
    4.2.1 Document Clustering:k-means

    • k-means: distance-based flat clustering

    0. Input: D::={d1,d2,…dn }; k::=the cluster number;

    1. Select k document vectors as the initial centriods of k clusters

    2. Repeat

    3. Select one vector d in remaining documents

    4. Compute similarities between d and k centroids

    5. Put d in the closest cluster and recompute the centroid

    6. Until the centroids don’t change

    7. Output:k clusters of documents

    • Advantage:

      • linear time complexity

      • works relatively well in low dimension space

  • Drawback:

    • distance computation in high dimension space

    • centroid vector may not well summarize the cluster documents

    • initial k clusters affect the quality of clusters

  • Ref:[5][7][8][9][10][15][16][29]


    4 2 2 document clustering hac l.jpg
    4.2.2 Document Clustering:HAC

    • Hierarchic agglomerative clustering(HAC):distance-based

    • hierarchic clustering

    0. Input: D::={d1,d2,…dn };

    1. Calculate similarity matrix SIM[i,j]

    2. Repeat

    3. Merge the most similar two clusters, K and L,

    to form a new cluster KL

    4. Compute similarities between KL and each of the remaining

    cluster and update SIM[i,j]

    5. Until there is a single(or specified number) cluster

    6. Output: dendogram of clusters

    • Advantage:

      • producing better quality clusters

      • works relatively well in low dimension space

  • Drawback:

    • distance computation in high dimension space

    • quadratic time complexity

  • Ref:[5][7][8][9][10][15][16][29]


    4 2 3 document clustering association rule hypergraph partitioning 1 l.jpg

    a 1

    i 1/2

    A

    D

    G

    f 3/2

    h 5/6

    b 1/3

    g

    1/3

    B

    F

    e 3/4

    C

    c 1

    E

    4.2.3 Document Clustering: Association Rule Hypergraph Partitioning(1)

    • Hypergraph

    H=(V,E)

    V::a set of vertices

    E::a set of hyperedges.

    Ref:[30]-[35]


    4 2 3 document clustering association rule hypergraph partitioning 2 l.jpg
    4.2.3 Document Clustering: Association Rule Hypergraph Partitioning (2)

    • Transactional View of Documents and features

      • item::=Document

      • transaction::=feature

    items

    Doc1 Doc2 Doc3 … Docn

    w1 5 5 2 ... 1

    w2 2 4 3 … 5

    w3 0 0 0 … 1

    . . . . … .

    . . . . .

    . . . . ... .

    wt 6 0 0 … 3

    (Transactional database of Documents and features)

    transactions

    Ref:[30]-[35]


    4 2 3 document clustering association rule hypergraph partitioning 3 l.jpg
    4.2.3 Document Clustering: Association Rule Hypergraph Partitioning(3)

    • Clustering

    Document-feature transaction database

    Discovering association rules

    Apriori algorithm

    Constructing hypergraph

    Association rule hypergraph

    Partitioning hypergraph

    Hypergraph partitioning algorithm

    k partitions

    • Hyperedges::=frequent item sets

    • Hyperedge weight::=average of the confidences of all rules

    • Assumption: documents occurring in the same frequent item set are more similar

    Ref:[30]-[35]


    4 2 3 document clustering association rule hypergraph partitioning 4 l.jpg
    4.2.3 Document Clustering: Association Rule Hypergraph Partitioning(4)

    • Advantage

      • Without the calculation of the mean of clusters.

      • Linear time complexity.

      • The quality of the clusters is not affected by the space dimensionality.

      • performing much better than traditional clustering in high dimensional space in terms of the quality of clusters and runtime.

    Ref:[30]-[35]


    4 3 document clustering application l.jpg
    4.3 Document Clustering:Application Partitioning(4)

    • Summarization of documents

    • Navigation of large document collections

    • Organization of Web search results

    Ref:[10][15]-[17]


    5 product intelligent miner for text imt 1 l.jpg

    Name Extractions Partitioning(4)

    Term Extraction

    Abbreviation Extraction

    Relationship Extraction

    Feature extraction

    Categorization

    Summarization

    Clustering

    Text Analysis

    Tools

    Hierarchical Clustering

    Binary relational Clustering

    IMT

    Text search engine

    Web Searching

    Tools

    NetQuestion Solution

    Web Crawler

    5. Product:Intelligent Miner for Text(IMT)(1)

    Ref:[5][36]


    5 product intelligent miner for text imt 2 l.jpg
    5. Product: Partitioning(4) Intelligent Miner for Text(IMT)(2)

    • 1.Feature extraction tools

      • 1.1 Information extraction

        • Extract linguistic items that represent document contents

      • 1.2 Feature extraction

        • Assign of different categories to vocabulary in documents,

        • Measure their importance to the document content.

      • 1.3 Name extraction

        • Locate names in text,

        • Determine what type of entity the name refers to

      • 1.4 Term extraction

        • Discover terms in text. Multiword technical terms

        • Recognize variants of the same concept

      • 1.5 Abbreviation recognition

        • Find abbreviation and math them with their full forms.

      • 1.6 Relation extraction


    5 product intelligent miner for text imt 3 l.jpg
    5. Product: Partitioning(4) Intelligent Miner for Text(IMT)(3)

    Feature extraction Demo.


    5 product intelligent miner for text imt 4 l.jpg
    5. Product: Partitioning(4) Intelligent Miner for Text(IMT)(4)

    • 2.Clustering tools

      • 2.1 Applications

        • Provide a overview of content in a large document collection

        • Identify hidden structures between groups of objects

        • Improve the browsing process to find similar or related information

        • Find outstanding documents within a collection

  • 2.2 Hierarchical clustering

    • Clusters are organized in a clustering tree and related clusters occurs in the same branch of tree.

  • 2.3 Binary relational clustering

    • Relationship of topics.

    • document  cluster topic.

  • NB:preprocessing step for the categorization tool


  • 5 product intelligent miner for text imt 5 l.jpg
    5.Product: Partitioning(4) Intelligent Miner for Text(IMT)(5)

    • Clustering demo.:navigation of document collection


    5 product intelligent miner for text imt 6 l.jpg
    5. Product: Partitioning(4) Intelligent Miner for Text(IMT)(6)

    • 3.Summarization tools

      • 3.1 Steps

        • the most relevant sentences  the relevancy of a sentence to a document

           a summary of the document with length set by user

          3.2 Applications

        • Judge the relevancy of a full text

        • Easily determine whether the document is relevant to read.

        • Enrich search results The results of a query to a search engine can be enriched with a short

        • summary of each document.

        • Get a fast overview over document collections

          • summary  full document


    5 product intelligent miner for text imt 7 l.jpg

    Black cat Partitioning(4)

    5.Product:Intelligent Miner for Text(IMT)(7)

    • 4.Categorization tool

      • Applications

        • Organize intranet documents

        • Assign documents to folders

        • Dispatch requests

        • Forward news to subscribers

    sports

    News

    article

    categorizer

    cultures

    I like health news

    health

    new router

    politics

    economics

    vacations


    6 reference 1 l.jpg
    6. Reference (1) Partitioning(4)

    Bibliography

    [1] Marti A. Hearst, Untangling Text Data Mining, Proceedings of ACL’99 the 37th

    Annual Meeting of the Association for Computational Linguistics, University of

    Maryland, June 20-26, 1999 (invited paper) http://www.sims.berkeley.edu/~hearst

    [2] Feldman and Dagan 1995 KDT - knowledge discovery in texts.

    In Proceedings of the First Annual Conference on Knowledge Discovery and Data Mining (KDD), Montreal.

    [3] IJCAI-99 Workshop TEXT MINING: FOUNDATIONS, TECHNIQUES AND APPLICATIONS Stockholm,

    Sweden August 2, 1999 http://www.cs.biu.ac.il/~feldman/ijcai-workshop%20cfp.html

    [4] Taeho C. Jo Text Categorization considering Categorical Weights and Substantial Weights of Informative Keywords,

    1999 (http://www.sccs.chukyo-u.ac.jp/ICCS/olp/p3-13/p3-13.htm)

    [5] A White Paper from IBM TechnologyText Mining: Turning Information Into Knowledge, February 17, 1998

    editor: Daniel Tkach IBM Software Solutions ( http://allen.comm.virginia.edu/jtl5t/whiteweb.html)

    [6] http://allen.comm.virginia.edu/jtl5t/index.htm

    [7] G. Salton et al, Introduction to Modern Information Retrieval, McGraw-Hill Book company, 1983

    [8] Michael Steinbach and George Karypis and Vipin Kumar, A Comparison of Document Clustering Techiques, KDD-2000

    [9] Douglass R. Cutting, Divid R. Karger, Jan O. Pedersen, and John W. Tukey, Scatter/Gather :

    A Cluster-based Approach to Browsing large Document Collections, SIGIR ’92,Pages 318 - 329

    [10] Oren Zamir, Oren Etzioni, Omid Madani, Richard M. Karp, Fast and Intuitive Clustering of Web Documents,

    KDD ’97, Pages 287 – 290, 1997


    6 reference 2 l.jpg
    6. Reference (2) Partitioning(4)

    Bibliography

    [11] Kjersti Aas et al. Text Categorization: Survey, 1999

    [12] Text mining White Paper http://textmining.krdl.org.sg/whiteppr.html

    [13] Gartner Group, Text mining White Paper , June 2000

    http://www.xanalys.com/intelligence_tools/products/text_mining_text.html

    [14] Yiming Y. and Jan O. Pedersen, A Comparative Study on Feature Selection in Text Categorization, In the 14th Int.

    Conf. On Machine Learning, PP. 412- 420, 1997

    [15] C. J. van Rijsbergen, (1979), Information Retrieval, Buttersworth, London.

    [16] Chris Buckley and Alan F. Lewit, Optimizations of inverted vector searches, SIGIR ’85, PP 97 – 110,1985.

    [17] Daphe Koller and Mehran Sahami, Hierarchically classifying documents using very few words, proceedings of

    the 14th International Conference on Machine Learning, Nashville, Tennessee, July 1997, PP170 – 178.

    [18] T. Joachims, A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization, In Int. Conf.

    Machine Learning, 1997.

    [19] K. Lang, NewsWeeder: Learning to Filter Netnews, International conference on Machine learning, 1995,

    http://anther.learning.cs.cmu.edu/ml95.ps

    [20] Hart, S.P. A Probabilistic Approch to Automatic Keyword Indexing, Journal of the American Society for

    Information Science, July-August, 1975


    6 reference 3 l.jpg
    6. Reference (3) Partitioning(4)

    Bibliography

    [21] Scott D., Indexing by Latent Semantic Analsis, Journal of the American Society for Information Science,

    Vol. 41,No,6, P.P. 391-407, 1990

    [22] S.T. Dumais, Improving the retrieval information from external sources, Behavior Research Methods,

    Instruments and Computers, Vol23, No.2,PP.229-236, 1991

    [23] T. Yavuz and H. A. Guvenir, Application of k-Nearest Neighbor on Feature Projections Classifier

    to Text Categorization, 1998

    [24]Eui-Hong H. and George K., Centroid-Based Document Classification: Analysis & Experimental Results,

    In European Conference on Principles of Data Mining and Knowledge Discovery(PKDD), 2000

    [25]Vladmimir N. Vapnik, the Nature of Statistical Learning Theory, Springer, New York, 1995

    [26]Ji He,A-Hwee Tan and Chew-Lim Tan, A comparative Study on Chinese Text Categorization Methods,

    , PRICAI 2000 Workshop on Text And Web Mining, Melbourne, pp.24-35,August 2000

    (http://textmining.krdl.org.sg/PRICAI2000/text-categorization.pdf)

    [27] Erik W. and Jan O. Pedersen and Andreas S. Weigend, A Neural Network Approach to Topic Spotting,

    In Pro 4th annaul symposium on cocument analysis and information retrieval, PP 22-34,1993

    [28] Lavrenko, V., Schmill, M., Lawrie, D., Ogilvie, P., Jensen, D. and Allan, J. in the Proceedings of KDD 2000 Conference, pp. 37-44.

    [29]Peter W. , Recent Trends in Hierarchic Document Clustering : A Critical Review, Information Procession & Management Vol.24, No. 5 pp. 577-597, 1988


    6 reference 4 l.jpg
    6. Reference (4) Partitioning(4)

    Bibliography

    [30] J. Larocca Neto, A.D. Santos, C.A.A. Kaestner, A.A. Freitas. Document clustering and text summarization. Proc. 4th Int. Conf. Practical Applications of Knowledge Discovery and Data Mining (PADD-2000), 41-55. London: The Practical Application Company. 2000.

    [31] Eui-Hong (Sam) Han, George Karypis, Vipin Kumar and Bamshad Mobasher, Clustering Based On Association Rule Hypergraphs , SIGMOD'97 Workshop on Research Issues on Data Mining and Knowledge Discovery, 1997.

    [32] Daniel Boley, Maria Gini, Robert Gross, Eui-Hong (Sam) Han, Kyle Hastings, George Karypis, Vipin Kumar, Bamshad Mobasher, and Jerome Moore, Parititioning-Based Clustering for Web Document Categorization,

    Decision Support Systems Journal, Vol 27, No. 3, pp 329-341, 1999.

    [34 George K. and Rajat A. and Vipin K. and Shashi S., Multilevel Hypergraph Partitioning: Applications in VLSI Domain, In proceedings o th e Design and Automation Conferences 97.

    [35] Eui-Hong (Sam) Han, George Karypis, Vipin Kumar and Bamshad Mobasher, Clustering In A High-Dimensional Space Using Hypergraph Models, Technical report #97-019, http://www-users.cs.umn.edu/~han/.

    [36] IBM White paper: Information Mining with the IBM Intelligent Miner Family, Daniel S. Tkach, February, 1998


    ad