Clustering Networked Data Based on Link and Similarity in Active Learning Advisor

Clustering networked data based on link and similarity in Active learning Advisor : Sing Ling Lee Student : Yi Ming Chang Speaker : Yi Ming Chang

Outline • Introduction • Active Learning • Networked data • Related Work • Newman’s Modularity • Collective Classification(ICA) • ALFNET • CLAL • Experimental Results • Conclusion

Passive Learning : Unlabeled instance : Labeled instance + Training data - - + Train + Classifier - + + - - Wrong : 5 + + - + + Classify + - - + + + - - + - Testing data + - + - - +

Active Learning : Unlabeled node : Labeled node + Training data - + Train + Classifier - - + - Query Wrong : 2 + + - + + + Classify + - - + + + + - - + - - Testing data + + - - - EX : Query batch number = 3

Network data + - + + - - + - - : Unlabeled node training : Labeled node classify Classifier

Newman’s Modularity for clustering 4 4 • m = 5 • : Real edge • : Degree of node • : Group of node •  = (1 – 2*2 /10 ) •  = (0 – 2*2/10 ) •  = (1 – 2*3/10 ) •  = (0 – 2*1/10 ) 1 1 5 5 2 2 3 3

Newman’s Modularity for clustering • Example : •  = (1 – 5*2 /16 ) = 0.375 •  = (0 – 5*3/ 16 ) = -0.9375 •  = (1 – 2*5/ 16 ) = 0.375 •  = (1 – 2*3/ 16 ) = 0.625 •  = (0 – 3*5/ 16 ) = -0.9375 •  = (1 – 3*2/ 16) = 0.625 1 2 3 0.625+0.625 > 0.375+0.375

Newman’s Modularity for clustering      Maximizing 1 1 -1 0.3 0.1 -0.5

Collective Classification(ICA) • Iterative Classification Algorithm(ICA) feature Neighbor feature ? + CO 1 0 0 1 0 … 1 CC 1 0 0 1 0 … 1 3/5 2/5 .. ? - ? ? training Content-Only learner ? - ? ? + Compute neighbor feature using CO Iteration 1 training Compute neighbor feature using CC Iteration 2 Collective learner Iteration 3 . . . Until stable or threshold of iteration have elapsed

CC problem • How to set threshold? : Labeled node + 3 : Unlabeled node - - + + - 2 - - - + + - + Infer neighbor feature : + - - 1 + - + + - - Iteration 1: 2/5 3/5 1 + 3/5 2/5 2 + - 0/1 1/1 3/5 2/5 3 Iteration 4: 1 2/5 3/5 2 3/5 2/5 Iteration 2: 1 1/1 0/1 3 2/5 3/5 2 1/1 0/1 3 2/5 3/5 Iteration 5: 1 4/5 1/5 2/5 3/5 2 Iteration 3: 1 0/1 1/1 4/5 1/5 3 2 0/1 1/1 3

ALFNET • 1. Cluster data at least k clusters. • 2. Pick k clusters based on size and initialize Content-Only(CO)classifier … … … cluster cluster cluster k CO ClassifierSVM

ALFNET • 3.while (labeled nodes < budget ) • 3.1 Re-train CO and CC classifier • 3.2 pick k cluster based on score : CO Training set train CC … … … cluster cluster cluster k

3.2 pick an item form each cluster based on CO Training set train CC

ALFNET CO CC Main Label predict predicted category Class A proportion of three classifier predicted Class B CO Class C CC Class D Main entropy(1/3) + entropy(1/3) + entropy(1/3) = 0.3662 *3 entropy(2/3) + entropy(1/3) = 0.2703 + 0.3662 entropy(3/3) = 0

Modularity and Similarity EX: Node 1 1 1 0 0 Node 3 1 1 0 0 Node 2 1 0 0 0 Node 4 0 0 1 1   

Maximum Q       Maximizing

CLAL : Labeled node : Unlabeled node training CO training CO Query & classify Query & classify Until Labeled node > budget

Tuning and greedy mechanism : Labeled node ? : Unlabeled node ? ? Moving priority: OutLink - Inlink 3 -> 2 -> 1 -> 1 ? ? ? CO training ? ? CO Query & classify Retrain & ? ? Query & classify reserve the greater COs ? ? Move Out-link > In-link ? ? ? CO CO Move Out-link > In-link Clustering priority : Low accuracy -> High accuracy

Background • Networked data • Social network • Citation network Person name friend node Person name feature feature feature … feature feature Attribute … feature Paper NO. cite word node Paper NO. word word … word Attribute word … word

Outline • Introduction • Active Learning • Networked data • Related Work • Newman’s Modularity • Collective Classification(ICA) • ALFNET • CLAL • ExperimentalResults • Conclusion

Appendix

SVM • Training data sets : Hyper-plan + + + + + + + + + + - - + + - - - - - - Margin Margin - -

Challenge • Query efficiency from discriminative feature 510 400 250 Paper name Sum of 2 class word word … word 250 220 100 Paper name Class 1 word word word … 260 180 150 Class 2 Paper name word word word …

CC problem :How to set terminal condition? • Different iteration will obtain diverse result. : CO predicted label :true labeled : labeled CC classifier Infer neighbor feature Local feature Neighbor feature A B A A B F1,F2,… NF_A NF_B B B A B 0,1,0,… 3/5 2/5 A A A A 2/3 1/3 Iteration 1 1/3 2/3 Iteration2 B B 4/5 1/5 A A A A 2/3 1/3 2/3 1/3

ALFNET Query and training CO Compute Query and training classifier Compute N Iteration > ? N Y Labeled node >Budget? Y Output

Representation and Challenge • In a citation network node node node node node How to use link information

Clustering Networked Data Based on Link and Similarity in Active Learning Advisor