on node classification in dynamic content based networks
Download
Skip this Video
Download Presentation
On Node Classification in Dynamic Content-based Networks

Loading in 2 Seconds...

play fullscreen
1 / 27

On Node Classification in Dynamic Content-based Networks - PowerPoint PPT Presentation


  • 105 Views
  • Uploaded on

Charu Aggarwal Nan Li IBM T. J. Watson Research Center [email protected] University of California, Santa Barbara [email protected] Presented by Nan Li. On Node Classification in Dynamic Content-based Networks. Motivation. Ke Wang. Jian Pei. “Sequential Pattern” “Data Mining”

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' On Node Classification in Dynamic Content-based Networks' - cyma


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
on node classification in dynamic content based networks

CharuAggarwal

Nan Li

IBM T. J. Watson Research Center

[email protected]

University of California, Santa Barbara

[email protected]

Presented by

Nan Li

On Node Classification in Dynamic Content-based Networks

motivation
Motivation

Ke Wang

Jian Pei

“Sequential Pattern”

“Data Mining”

“Systems”

“Rules”

“Mining”

“Efficient”

“Association Rules”

Jiawei Han

“Data Mining”

“Databases”

“Clustering”

“Sequential Pattern”

Kenneth A. Ross

“Algorithms”

Year 2001

motivation1
Motivation

Ke Wang

Jian Pei

“Association Rules”

“Data Mining”

“Ranking”

“Web”

“Pattern”

“Data Mining”

“Stream”

“Semantics”

Jiawei Han

Marianne Winslett

“Data Mining”

“Web”

“Sequential Pattern”

“Parallel”

“Automated”

“Data”

Xifeng Yan

Philip S. Yu

“Clustering”

“Distributed”

“Databases”

“Mining”

“Pattern Mining”

Year 2002

motivation2
Motivation

Ke Wang

Jian Pei

“Sequential Pattern”

“Mining”

“Systems”

“Rules”

“Mining”

“Efficient”

“Association”

Jiawei Han

Charu Aggarwal

“Clustering”

“Indexing”

“Knowledge”

“XML”

“Mining”

“Databases”

“Clustering”

“Sequential Pattern”

Xifeng Yan

Philip S. Yu

“Algorithms”

“Association Rules”

“Clustering”

“Wireless”

“Web”

“Graph”

“Databases”

“Sequential Mining”

Year 2003

motivation3
Motivation
  • Networks annotated with an increasing amount of text
    • Citation networks, co-authorship networks, product databases with large amounts of text content, etc.
    • Highly dynamic
  • Node classification Problem
    • Often arises in the context of many network scenarios in which the underlying nodes are associated with content.
    • A subset of the nodes in the network may be labeled.
      • Can we use these labeled nodes in conjunction with the structure for the classification of nodes which are not currently labeled?
  • Applications
challenges
Challenges
  • Information networks are very large
    • Scalable and efficient
  • Many such networks are dynamic
    • Updatable in real time
    • Self-adaptable and robust
  • Such networks are often noisy
    • Intelligent and selective
  • Heterogeneous correlations in such networks

A

A

A

B

C

B

C

B

C

outline
Outline
  • Related Works
  • DYCOS: DYnamic Classification algorithm with cOntent and Structure
    • Semi-bipartite content-structure transformation
    • Classification using a series of text and link-based random walks
    • Accuracy analysis
  • Experiments
    • NetKit-SRL
  • Conclusion
related works
Related Works
  • Link-based classification (Bhagat et al., WebKDD 2007)
    • Local iterative
    • Global nearest neighbor
  • Content-only classification (Nigam et al. Machine Learning 2000)
    • Each object’s own attributes only
  • Relational classification (Sen et al., Technical Report 2004)
    • Each object’s own attributes
    • Attributes and known labels of the neighbors
  • Collective classification (Macskassy & Provost, JMLR 2007, Sen et al., Technical Report 2004, Chakrabarti, SIGMOD 1998)
    • Local classification
      • Flexible: ranging from a decision tree to an SVM
    • Approximate inference algorithms
      • Iterative classification
      • Gibbs sampling
      • Loopy belief propagation
      • Relaxation labeling
outline1
Outline
  • Related Works
  • DYCOS: DYnamic Classification algorithm with cOntent and Structure
    • Semi-bipartite content-structure transformation
    • Classification using a series of text and link-based random walks
    • Accuracy analysis
  • Experiments
    • NetKit-SRL
  • Conclusion
dycos in a nutshell
DYCOS in A Nutshell
  • Node classification in a dynamic environment
    • Dynamic network: the entire network is denoted by Gt = (Nt, At, Tt) at time t.
    • Problem statement:
      • Classify the unlabeled nodes (Nt \ Tt) using both the content and structure of the network for all the time stamps in an efficient and accurate manner

t

t+1

t+2

Both the structure and the content of the network change over time!

semi bipartite transformation
Semi-bipartite Transformation
  • Text-augmented representation
    • Leveraged for a random walk-based classification model that uses both links and text
    • Two partitions: structural nodes and word nodes
    • Semi-bipartite: one partition of nodes is allowed to have edges either within the set, or to nodes in the other partition.
  • Efficient updates upon dynamic changes
random walk based classification
Random Walk-Based Classification
  • Random walks over augmented structure
    • Starting node: the unlabeled node to be classified.
    • Structural hop
      • A random jump from a structural node to one of its neighbors
    • Content-based multi-hop
      • A jump from a structural node to another through implicit common word nodes
    • Structural parameter: ps
  • Classification
    • Classify the starting node with the most frequently encountered class label during the random walks
gini index inverted lists
Gini-Index & Inverted Lists
  • Discriminative keywords
    • A set Mt of the top m words with the highest discriminative power are used to construct the word node partition.
    • Gini-index
      • The value of G(w) lies in the range (0, 1).
      • Words with a higher value of gini-index are more discriminative for classification purposes.
  • Inverted lists
    • Inverted list of keywords for each node
    • Inverted list of nodes for each keyword
analysis
Analysis
  • Why do we care?
    • DYCOS is essentially using Monte-Carlo sampling to sample various paths from each unlabeled node.
      • Advantage: fast approach
      • Disadvantage: loss of accuracy
    • Can we present analysis on how accurate DYCOS sampling is?
  • Probabilistic bound: bi-class classification
    • Two classes C1 and C2
    • E[Pr[C1]] = f1, E[Pr[C2]] = f2, f1 - f2 = b ≥ 0
    • Pr[mis-classification] ≤ exp{-lb2/2}
  • Probabilistic bound: multi-class classification
    • k classes {C1, C2, …, Ck}
    • b-accurate
    • Pr[b-accurate] ≥ 1 - (k-1)exp{-lb2/2}
outline2
Outline
  • Related Works
  • DYCOS: DYnamic Classification algorithm with cOntent and Structure
    • Semi-bipartite content-structure transformation
    • Classification using a series of text and link-based random walks
    • Accuracy analysis
  • Experiments
    • NetKit-SRL
  • Conclusion
experimental results
Experimental Results
  • Data sets
    • CORA: a set of research papers and the citation relations among them.
      • Each node is a paper and each edge is a citation relation.
      • A total of 12,313 English words are extracted from the paper titles.
      • We segment the data into 10 synthetic time periods.
    • DBLP: a set of authors and their collaborations
      • Each node is an author and each edge is a collaboration.
      • A total of 194 English words in the domain of computer science are used.
      • We segment the data into 36 annual graphs from year 1975 to year 2010.
experimental results1
Experimental Results
  • NetKit-SRL toolkit
    • An open-source and publicly available toolkit for statistical relational learning in networked data (Macskassy and Provost, 2007).
    • Instantiations of previous relational and collective classification algorithms
    • Configuration
      • Local classifier: domain-specific class prior
      • Relational classifier: network-only multinomial Bayes classifier
      • Collective inference: relaxation labeling
  • Parameters
    • 1) The number of most discriminative words, m; 2) The size constraint of the inverted list for each keyword a; 3) The number of top content-hop neighbors, q; 4) The number of random walks, l; 5) The length of each random walk, h; 6) Structure parameter, ps.

The results demonstrate that DYCOS improves the classification accuracy over NetKit

by 7.18% to 17.44%, while reducing the runtime to only 14.60% to 18.95% of that of NetKit.

experimental results2
Experimental Results

DYCOS vs. NetKit on CORA

Classification Time Comparison

Classification Accuracy Comparison

experimental results3
Experimental Results

Parameter Sensitivity of DYCOS

Sensitivity to m, l and h (a=30, ps=70%)

Sensitivity to a, m and ps (l=3, h=5)

DBLP Data

CORA Data

experimental results4
Experimental Results

Dynamic Updating Time: CORA

Dynamic Updating Time: DBLP

outline3
Outline
  • Related Works
  • DYCOS: DYnamic Classification algorithm with cOntent and Structure
    • Semi-bipartite content-structure transformation
    • Classification using a series of text and link-based random walks
    • Accuracy analysis
  • Experiments
    • NetKit-SRL
  • Conclusion
conclusion
Conclusion
  • We propose an efficient, dynamic and scalable method for node classification in dynamic networks.
  • We provide analysis on how accurate the proposed method will be in practice.
  • We present experimental results on real data sets, and show that our algorithms are more effective and efficient than competing algorithms.
experimental results5
Experimental Results

Classification accuracy comparison: DBLP

Classification time comparison: DBLP

experimental results6
Experimental Results

Sensitivity to m, l and h

Sensitivity to a, l and h

Sensitivity to m, a and ps

Sensitivity to a, m and ps

ad