KliqueFinder: Identifying Clusters in Network Data

KliqueFinder: Identifying Clusters in Network Data Kenneth A. Frank Michigan State University Based on: • Frank. K.A. 1995. Identifying Cohesive Subgroups. Social Networks (17): 27-56 • Frank, K. 1996. Mapping interactions within and between cohesive subgroups. Social Networks 18: 93-119. • *Field, S. *Frank, K.A., Schiller, K, Riegle-Crumb, C, and Muller, C. (2006). "Identifying Social Contexts in Affiliation Networks: Preserving the Duality of People and Events. Social Networks 28:97-123 * co first authors. • https://www.msu.edu/user/k/e/kenfrank/web/research.htm#representation

Overview • Clustering and Graphical Representations of Networks • Running KliqueFinder... • Step 1) Criteria for Determining Group Membership • Step 2: Maximizing Criterion • Step 3) Examine evidence of clusters • Step 4) Evaluating the Performance of the Algorithm : Did... • Make Sociogram in Netdraw • Confidentiality/Ethical issues in Collecting Network Data • Modifying the Image: Adding Node Data or Relations... • Two mode • Software Challenge... • Batch KliqueFinder • Prepping Converting data • A Priori Clusters

Clustering and Graphical Representations of Networksvideo : (26:09-31:41): ID: kenfrank@msu.edu PW:kenfrank2014 Goal: to identify patterns in the network • Rearrange rows and columns of social network matrix to reveal clustering • Plot actors and ties in two dimensions to reveal clustering

Theory for defining cluster membership • cohesion (clusters are called subgroups): an actor should be in a cluster if the actor has demonstrated a preference for engaging in ties with members of the cluster. • Result: ties are concentrated within subgroups • structural equivalence (blocks): an actor should be in a cluster if the actor engages in a similar pattern of ties as members of that cluster. • Result: blocks represent positions, but ties not necessarily concentrated within blocks.

Crystallized Sociogram: Friendships Among the French Financial Elite Lines indicate friendships: solid within subgroups, dotted between subgroups. numbers represent actors Rgt,Cen,Soc,Non = political parties; B=Banker, T=treasury; E=Ecole National D’administration Frank, K.A. & Yasumoto, J. (1998). "Linking Action to Social Structure within a System: Social Capital Within and Between Subgroups." American Journal of Sociology, Volume 104, No 3, pages 642-686

Crystallized Sociogram: Clusters in Foodwebs Krause, A., Frank, K.A., Mason, D.M., Ulanowicz, R.E. and Taylor, W.M. (2003). "Compartments exposed in food-web structure." Nature 426:282-285

Data Input File name must be less than 20 character. Best if file name is six characters followed by .list: xxxxxx.list . For example stanne.list Actor 1 interacts with actor 2 at a level of 3 Extent of relation can be binary or weighted Prepping data in excel Prepping Data in UCINET Converting data using sas New: flexible columns, Old (10 spaces for each) Same results ID’s should be 6 digits or less

Data Edgelist First two rows do not appear in the data – I put them there to show the format: 10 spaces for each entry Actor 1 interacts with actor 2 at a level of 3 Extent of relation can be binary or weighted Best if file name is six characters followed by .list. xxxxxx.list For example stanne.list New version of KliqueFinder is more flexible About 10 column widths. ID’s should be 6 digits or less Prepping data in excel Prepping Data in UCINET Converting data using sas

Steps for finding clustersvideo: (31:41-43:30): ID: kenfrank@msu.edu PW:kenfrank2014 1) Determine criterion for defining clusters 2) Maximize criterion 3) Examine evidence of clusters 4) Evaluate performance of the algorithm 5) Interpret clusters commonality of attributes focal experiences subsequent behavior

Step 1) Criteria for Determining Group Membership Structural Equivalence: Factor analyze sociomatrix (Katz & Kahn) iteratively rearrange and revalue rows and columns (CONCORR -- White el al., 1976) Cohesion utilize fixed criteria (e.g., must be connected to at least k others in clusters, or must be minimal path length from k others, etc). use flexible criterion -- preference relative to group sizes and number of ties:

Model Based Cohesion Wii’=1 if tie between actors i and i’, 0 otherwise samegroupii’= 1 if actors i and i’ are members of the same subgroup, 0 otherwise. Then θ1 represents subgroups salience: So ...... Maximize θ1 (odds ratio)

Odds Ratio for Association Between Common Subgroup Membership and The Occurrence of Ties Between Actors

Step 2: Maximizing Criterion • 1) find a subgroup seed (3 actors who interact with each other, and with similar others) • 2) add to the cluster to maximize θ1 until you cannot do any more • 3) start new subgroup with new seed • 4) shuffle between existing subgroups • 5) make new subgroups as necessary, dissolve existing ones as necessary.

Computationally intensive, modify for large networks Initialize: assign each actor to own subgroup For finding best subgroup seed: 1) can only choose from unaffiliated actors 2) Each actor can only be a seed once Find subgroup seed of 2 or 3 KliqueFinder Algorithm: Phase I Identify single move that most increases objective function θ1 Does move increase function? No yes Reassign actor that makes best move If assignment moves actor out of a group of 3, reassign reamaining 2 to next best groups

KliqueFinder Algorithm: Phases II and III • Phase II: If best move does not increase objective function and there are fewer than 3 actors available for subgroups then • Attach all isolated (or singleton) actors to best existing subgroups, even if this reduces objective function • Phase III: shuffle actors between existing subgroups without seeding new ones or disbanding existing ones • Number of subgroups is fixed • This is simple hill climbing and can be cast as EM algorithm

Running KliqueFindervideo :(43:30-1:01:00): ID: kenfrank@msu.edu PW:kenfrank2014 • Download KliqueFinder at • http://hlmsoft.net/wkf/ • Follow instructions to install. Put in c:\kliqfind • Mac users: vmware fusion, Windows 7, 32 bit: http://store.vmware.com/store/vmware/pd/productID.165310200/Currency.USD/ • Click on “Browse…” button to specify the directory where the data file is located.

KliqueFinder • Choose “Basic setup” and then click “Run setup file” button.

KliqueFinder • Click on the “Browse” button to choose a data file.

Run Analysis Data file

New Version of Data Input more Flexible File name must be less than 20 characters ID’s should be 6 digits or less Actor 1 interacts with actor 2 at a level of 3 Extent of relation can be binary or weighted Prepping data in excel Prepping Data in UCINET Converting data using sas New: flexible columns, Old (10 spaces for each) Same results

View Clusters Output

Blocked Network Data N Group And Actor Id 24 |AAAA|BBBBBB|CCCCCCCC|DDDDDD| | | | | | | 2 1|221 1| 11 2|111122| Group ID|7445|612214|98133560|796037| ------------+----+------+--------+------+ 1 A 7|A213|......|........|...1..| 1 A 24|4A3.|......|.4......|......| 1 A 4|33A.|......|........|......| 1 A 15|433A|......|........|......| ------------+----+------+--------+------+ 2 B 26|.2..|B443..|........|......| 2 B 21|.1..|4B....|...4....|....2.| 2 B 12|....|4.B...|........|......| 2 B 2|....|33.B..|........|...1..| 2 B 1|..3.|3..3B.|........|.3..2.| 2 B 14|....|....1B|........|......| ------------+----+------+--------+------+ 3 C 9|....|......|C...3.33|.3....| 3 C 8|.4..|..4...|.C.4..4.|4.....| 3 C 11|....|......|33C.4.3.|..4...| 3 C 13|.4..|.4....|444C....|......| 3 C 3|3...|.4....|4.44C...|......| 3 C 5|.1..|.....4|3.2.3C..|......| 3 C 6|....|......|444..4C4|......| 3 C 20|....|......|3..3.44C|......| ------------+----+------+--------+------+ 4 D 17|.1..|......|.1......|D.1...| 4 D 19|....|......|4.3.....|3D4...| 4 D 16|....|......|4..4...4|44D...| 4 D 10|..3.|...1..|........|...D3.| 4 D 23|....|.3....|........|.343D.| 4 D 27|.1..|.1....|........|.3..3D| θ1 =1.1738

Step 3) Examine evidence of clusters 1) randomly redistribute ties 2) apply algorithm 3) record value of odds ratio and θ1 4) repeat 1000 times to generate distribution 5) use mean of distribution as baseline for comparison

Randomly Redistributing Ties

Apply Algorithm to Random Data, θ1=.81822

Monte Carlo Sampling Distributionvideo: (1:06:35-1:18:50)ID: kenfrank@msu.edu PW:kenfrank2014 Output in sampdist.dat Data can include weights Indicate simulate data θ1=Log odds/2 Odds Ratio Set up sampling. Remember to do “new data” set up when done To prepare for next analysis

spss Code for Reading in Sample Distribution Data SAS GET DATA /TYPE=TXT /FILE="C:\KLIQFIND\sampdist.dat" /FIXCASE=1 /ARRANGEMENT=FIXED /FIRSTCASE=1 /IMPORTCASE=ALL /VARIABLES= /1 theta1 0-29 F30.10 oddsratio 30-59 F30.10 samplesize 60-89 F30.10. CACHE. EXECUTE. DATASET NAME DataSet9 WINDOW=FRONT. DATASET ACTIVATE DataSet9. GRAPH /HISTOGRAM=theta1. title "Sampling distribution for theta1"; data one; infile "sampdist.dat" missover; Input theta1 odds1; proc univariate plot; var theta1; Stata *This command imports the data file import delimited C:\KLIQFIND\sampdist.dat, delimiter(" ", asstring) *These commands perform data management: drop v1 rename v2 theta1 rename v3 oddsratio rename v4 samplesize *This command plots histogram for theta1: hist theta1,freq

Comparison of Sampling Distributions

Distribution of θ1base From Application of the Algorithm to Data Simulated Without Regard for Subgroup Membership Observed value: 1.1738

Sampling Distribution Parameters Edit simulation parameters. First element is number of replications Must keep # of reps in first 5 columns

Approximate p-value Based on Previous Simulations PREDICTED THETA (1 base) BASED ON SIMULATIONS. VALUE BASED ON UNWEIGHTED DATA. 0.76985 ESTIMATE OF THETA (1 subgroup processes) 0.40397 (total-predicted=evidence of groups): 1.1738-.76985=.40397 THE TOTAL THETA1 IS: 1.1738 APPROXIMATE TEST OF CONCENTRATION OF TIES WITHIN SUBGROUPS BASED ON SIZE OF THETA1 subgroup processes: THETA1 | SUBGROUP | APPROX | APPROX PROCESSES| LRT | P-VALUE 0.40 34.82 0.00 Reject null hypotheses of no clusters: H0:Θ1 subgroup processes =0

Step 4) Evaluating the Performance of the Algorithm : Did the Algorithm Recover the Correct Subgroups? • Many algorithms search for optimal subgroups. KliqueFinder does not, but how different are the subgroups it finds from the optimal or known subgroups?

Output for Recovery of Subgroups PREDICTED ACCURACY: LOG ODDS OF COMMON SUBGROUP MEMBERSHIP, + OR - .5734 (FOR A 95% CI) 1.4989 The Log odds applies to the following table: OBSERVED SUBGROUP DIFFERENT SAME ___________________ | | | DIFFERENT | A | B | KNOWN | | | SUBGROUP |--------|--------| | | | SAME | C | D | | | | ------------------- THE LOGODDS TRANSLATES TO AN ODDS RATIO OF 4.4766 WHICH INDICATES THE INCREASE IN THE ODDS THAT KLIQUEFINDER WILL ASSIGN TWO ACTORS TO THE SAME SUBGROUP IF THEY ARE TRULY IN THE IN THE SAME SUBGROUP. Specific accuracy for a given data set not known, results predicted from thousands of simulations – see next slide

Odds of Recovery (Toy Example) Simulated data with known subgroups Observed subgroups identified by KliqueFinder Cell A: 6 pairs correctly assigned to different subgroups: 1,5; 2,5; 3,5; 1,6; 2,6; 3,6 Missassignment of actor 4 contributes 3 to cell B and 2 to cell C OBSERVED SUBGROUP DIFFERENT SAME ___________________ | | | DIFFERENT | | | KNOWN | A (6)| B (3)| SUBGROUP |--------|--------| | | | SAME | | | | C (2)| D (4)| ------------------- Cell D: 4 pairs correctly assigned to same subgroup: (1,2; 1,3; 2,3; 5,6) Odds of recovery =(AD)/(BC)= 6x4/(3x2)=4.00

Make Sociogram in Netdrawvideo : (1:01:00-1:06:22): ID: kenfrank@msu.edu PW:kenfrank2014

Sometimes Netdraw can’t find fileretrieve manually

Modifying Image in Netdraw

Density = 4/(4x8)=1/8 Kliqfinder uses Density =4/(4x5)=.20 because maximum number of nominations is 5 N Group And Actor Id 24 |AAAA|BBBBBB|CCCCCCCC|DDDDDD| | | | | | | 2 1|221 1| 11 2|111122| Group ID|7445|612214|98133560|796037| ------------+----+------+--------+------+ 1 A 7|A213|......|........|...1..| 1 A 24|4A3.|......|.4......|......| 1 A 4|33A.|......|........|......| 1 A 15|433A|......|........|......| ------------+----+------+--------+------+ 2 B 26|.2..|B443..|........|......| 2 B 21|.1..|4B....|...4....|....2.| 2 B 12|....|4.B...|........|......| 2 B 2|....|33.B..|........|...1..| 2 B 1|..3.|3..3B.|........|.3..2.| 2 B 14|....|....1B|........|......| ------------+----+------+--------+------+ 3 C 9|....|......|C...3.33|.3....| 3 C 8|.4..|..4...|.C.4..4.|4.....| 3 C 11|....|......|33C.4.3.|..4...| 3 C 13|.4..|.4....|444C....|......| 3 C 3|3...|.4....|4.44C...|......| 3 C 5|.1..|.....4|3.2.3C..|......| 3 C 6|....|......|444..4C4|......| 3 C 20|....|......|3..3.44C|......| ------------+----+------+--------+------+ 4 D 17|.1..|......|.1......|D.1...| 4 D 19|....|......|4.3.....|3D4...| 4 D 16|....|......|4..4...4|44D...| 4 D 10|..3.|...1..|........|...D3.| 4 D 23|....|.3....|........|.343D.| 4 D 27|.1..|.1....|........|.3..3D| Data used for multidimensional Scaling within subgroups. Distance= maximum value/cell entry e.g., maximum value is 4, So a tie of 2  4/2=2, distance of 2 DIRECT ASSOCIATIONS GROUP 1 2 3 4 LABEL A B C D N 4 6 8 6 GROUP 1 2.42 0.00 0.20 0.05 2 0.25 1.07 0.13 0.27 3 0.38 0.40 2.40 0.28 4 0.21 0.17 0.67 1.17 In xxxxxx.clusters Distance in multidimensional Scaling between subgroups =maximum value /density

Frank, K. 1996. Mapping interactions within and between cohesive subgroups. Social Networks 18: 93-119. cohesion Structural similarity video: (1:19:15-1:23:40))ID: kenfrank@msu.edu PW:kenfrank2014

Choosing lines: Groups

Confidentiality/Ethical issues in Collecting Network Data • Need names on survey • Data can be confidential but not anonymous (especially for longitudinal) • R.L. Breiger, “Ethical Dilemmas in Social Network Research: Introduction to Special Issue.” Social Networks 27 / 2 (2005): 89 – 93. Read it online. http://www.u.arizona.edu/~breiger/2005BreigerIntroEthics.pdf • (All issues of social networks available via science direct) • Who benefits from network analysis? Who bears the cost? • Kadushin, Charles “Who benefits from network analysis: ethics of social network research” Social Networks 27 / 2 (2005): Pages 139-153. • Issues to raise when dealing with Human Subjects Board: • Klovdahl, Alden S. Social network research and human subjects protection: Towards more effective infectious disease control Pages 119-137 • Hint on Human Subjects boards: they like precedents. Once you have one network study accepted, refer to it when submitting others! • https://www.msu.edu/~kenfrank/social%20network/irb%20with%20network%20data.htm video : (1:23:41-1:28)ID: kenfrank@msu.edu PW:kenfrank2014

The SRI/KLiqueFinder Solution to confidentiality: aggregate to subgroups 1) Provide information about who is in which cluster as well as information regarding the resources embedded in each cluster. Resources could be information, expertise, material resources, etc. Benefit: reveals location of resources relative to social; structure Protection: does not reveal specific responses because all information is at the cluster level. 2) Provide locations from in a sociogram unique for each respondent, indicating where that person is located (“you are here”). But figure does not include the lines from a sociogram, so respondents cannot infer others’ responses. Benefit: Respondents then use this as a guide to individual behavior for identifying further resources or information. Protection: Specific responses of others not revealed, so confidentiality preserved.

Can even include names of actors Using subgroups for feedback to respondents and in a proposal

Choosing Lines: Actor Level Within

Choosing Lines: Actor Level Remove group nodes

Choosing Lines: Actor Level Between

Choosing Lines: Group Level

Modifying the Image: Adding Node Data or Relationsvideo : ID: kenfrank@msu.edu PW:kenfrank2014 : (1:49:35-2:07:48) http://www.analytictech.com/ucinet/download.htm http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CB0QFjAA&url=http%3A%2F%2Fwww.analytictech.com%2FNetdraw%2FNetdrawGuide.doc&ei=6pC4Tp29Men3sQLv99WoCA&usg=AFQjCNHg_NTjlHOclmeJkwQs2xRaiPYgXQ&sig2=WLwXKSjJq_Yinpfkwv0m4w http://faculty.ucr.edu/~hanneman/nettext/C4_netdraw.html#data

Files for KliqueFinder Alternative network data Node data Network data Input data xxxxxx.list xxxxxx.ilabel xxxxxx.xnet Kliqfind.par Printo Simulate.par Parameters KliqueFinder Output xxxxxx.place xxxxxx.clusters xxxxxx.vna Data containing actor ID’s and subgroup placement Diagnostics and matrix formatted data for Netdraw

KliqueFinder: Identifying Clusters in Network Data