Complete Network Analysis Network Connections: Large-Scale network structure

Complete Network Analysis Network Connections: Large-Scale network structure The basic network hypothesis is that the structure of a network affects the likelihood that goods will flow through the network. While direct measures are fine for smaller networks, we often want to make generalizations to very large-scale network structure. The next section covers large-scale network topography and bridges us to generalized images of the network structure captured by cohesive groups and blockmodels. We focus on 3 such factors today: 1) Basic structure of large-scale networks 2) Cohesive Peer Groups 3) Identifying Role positions (blockmodels)

Complete Network Analysis Network Connections: Large-Scale network structure Based on Milgram’s (1967) famous work, the substantive point is that networks are structured such that even when most of our connections are local, any pair of people can be connected by a fairly small number of relational steps.

Complete Network Analysis Network Connections: Large-Scale network structure Watts says there are 4 conditions that make the small world phenomenon interesting: 1) The network is large - O(Billions) 2) The network is sparse - people are connected to a small fraction of the total network 3) The network is decentralized -- no single (or small #) of stars 4) The network is highly clustered -- most friendship circles are overlapping

Complete Network Analysis Network Connections: Large-Scale network structure Formally, we can characterize a graph through 2 statistics. • 1) The characteristic path length, L • The average length of the shortest paths connecting any two actors. • (note this only works for connected graphs) • 2) The clustering coefficient, C • Version 1: the average local density. That is, Cv = ego-network density, and C = Cv/n • Version 2: transitivity ratio. Number of closed triads divided by the number of closed and open triads. • A small world graph is any graph with a relatively small L and a relatively large C.

Complete Network Analysis Network Connections: Large-Scale network structure The most clustered graph is Watt’s “Caveman” graph:

Complete Network Analysis Network Connections: Large-Scale network structure 1.2 140 120 1 100 0.8 Characteristic Path Length Clustering Coefficient 80 0.6 60 0.4 40 0.2 20 0 0 0 20 40 60 80 100 120 Degree (k) C and L as functions of k for a Caveman graph of n=1000

Complete Network Analysis Network Connections: Large-Scale network structure Compared to random graphs, C is large and L is long. The intuition, then, is that clustered graphs tend to have (relatively) long characteristic path lengths. But the small world phenomenon rests on just the opposite: high clustering and short path distances. How is this so?

Complete Network Analysis Network Connections: Large-Scale network structure A model for pair formation, as a function of mutual contacts. Using this equation, a produces networks that range from completely ordered (caveman-like) to random.

Complete Network Analysis Network Connections: Large-Scale network structure C=Large, L is Small = SW Graphs

Complete Network Analysis Network Connections: Large-Scale network structure Why does this work? Key is fraction of shortcuts in the network In a highly clustered, ordered network, a single random connection will create a shortcut that lowers L dramatically Watts demonstrates that Small world graphs occur in graphs with a small number of shortcuts

Complete Network Analysis Network Connections: Large-Scale network structure 1) Movie network: Actors through Movies Lo/Lr= 1.22 Co/Cr = 2925 2) Western Power Grid: Lo/Lr= 1.50 Co/Cr = 16 3) C. elegans Lo/Lr= 1.17 Co/Cr = 5.6

Complete Network Analysis Network Connections: Large-Scale network structure What are the substantive implications? Return to the initial interest in connectivity: disease diffusion • 1) Diseases move more slowly in highly clustered graphs • (fig. 11) - not a new finding. • 2) The dynamics are very non-linear -- with no clear pattern based on local connectivity. Implication: small local changes (shortcuts) can have dramatic global outcomes (disease diffusion)

Complete Network Analysis Network Connections: Large-Scale network structure How do we know if an observed graph fits the SW model? Random expectations: For basic one-mode networks (such as acquaintance nets), we can get approximate random values for L and C as: Lrandom ~ ln(n) / ln(k) Crandom ~ k / n As k and n get large. Note that C essentially approaches zero as N increases, and K is assumed fixed. This formula uses the density-based measure of C, but the substantive implications are similar for the triad formula.

Complete Network Analysis Network Connections: Large-Scale network structure How do we know if an observed graph fits the SW model? One problem with using the simple formulas for most extant data on large graphs is that, because the data result from people overlapping in groups/movies/publications, necessary clustering results from the assignment to groups. G1 G2 G3 G4 G5 Amy 1 0 1 0 0 Billy 0 1 0 1 0 Charlie 0 1 0 1 0 Debbie 1 0 0 0 0 Elaine 1 0 1 0 1 Frank 0 1 0 1 0 George 0 1 0 1 0 . . . . LINES CUT . . . . . William 0 1 0 0 0 Xavier 0 1 0 1 0 Yolanda 1 0 1 0 0 Zanfir 0 1 1 1 1 12 14 9 14 5

Complete Network Analysis Network Connections: Large-Scale network structure How do we know if an observed graph fits the SW model? Newman, M. E. J.; Strogatz, S. J., and Watts, D. J. “Random Graphs with arbitrary degree distributions and their applications” Phys. Rev. E. 2001 This paper extends the formulas for expected clustering and path length using a generating functions approach, making it possible to calculate E(C,L) for graphs with any degree distribution. Importantly, this procedure also makes it possible to account for clustering in a two-mode graph caused by the distribution of assignment to groups.

Complete Network Analysis Network Connections: Large-Scale network structure How do we know if an observed graph fits the SW model? Newman, M. E. J.; Strogatz, S. J., and Watts, D. J. “Random Graphs with arbitrary degree distributions and their applications” Phys. Rev. E. 2001 Where N is the size of the graph, Z1 is the average number of people 1 step away (degree) and Z2 is the average number of people 2 steps away. Theoretically, these formulas can be used to calculate many properties of the network – including largest component size, based on degree distributions. A word of warning: The math in these papers is not simple, sharpen your calculus pencil before reading the paper…

Complete Network Analysis Network Connections: Large-Scale network structure How do we know if an observed graph fits the SW model? Since C is just the transitivity ratio, there are a number of good formulas for calculating the expected value. Using the ratio of complete to (incomplete + complete) triads, we can use the expected values from the triad distribution in PAJEK for a simple graph or we can use the expected value conditional on the dyad types (if we have directed data) using the formulas in SPAN and Wasserman and Faust (1994).

Complete Network Analysis Network Connections: Large-Scale network structure • Other extensions of the SW model? • Searchability on small worlds (Klienberg) suggests that the ability to walk through the graph requires a structure based on some knowable “distance” features. That is, people can’t search a simple random SW graph. • Graph Dynamics. The distance-shortening effects of shortcut ties are much less effective when the graph itself changes over time. To shorten the distance, structurally “shortcut” ties must also be temporally sequenced. This means that when relations are changing quickly, the rapid returns to shortcuts drops significantly.

Complete Network Analysis Network Connections: Large-Scale network structure Standard result on a static graph

Complete Network Analysis Network Connections: Large-Scale network structure Small World Mechanisms on Dynamic Graphs

Complete Network Analysis Network Connections: Large-Scale network structure Across a large number of substantive settings, Barabási points out that the distribution of network involvement (degree) is highly and characteristically skewed.

Complete Network Analysis Network Connections: Large-Scale network structure Many large networks are characterized by a highly skewed distribution of the number of partners (degree)

Complete Network Analysis Network Connections: Large-Scale network structure The scale-free model focuses on the distance-reducing capacity of high-degree nodes:

Complete Network Analysis Network Connections: Large-Scale network structure The scale-free model focuses on the distance-reducing capacity of high-degree nodes, as ‘hubs’ create shortcuts that carry network flow. The diffusion implications of mathematical models based on the preferential attachment model are dim, because the carrying capacity of the network comes to depend entirely on a vanishingly small number of stars, who are statistically hard to find. Thus, random treatment to the network does no good, but targeted treatment does.

Complete Network Analysis Network Connections: Large-Scale network structure The scale-free model focuses on the distance-reducing capacity of high-degree nodes, as ‘hubs’ create shortcuts that carry network flow. • The primary mechanism hypothesized to drive a power-law degree distribution is the “preferential attachment” model. This model suggests that new nodes enter the population and connect to current nodes with probability proportional to the current node’s degree. • This implies that “The rich get richer” and the graph takes on a decidedly star-like shape.

Complete Network Analysis Network Connections: Large-Scale network structure • Critiques of the Scale-free model: • The insights are not particularly new, having been anticipated in the epidemiology of STDs for some time. • Many of the empirical claims are over-stated. • The most common ‘test’ for a scale free network is to plot the degree histogram on a log-log scale and fit a regression line to it. This is poor statistical practice, and better models for fitting distributions show that most of the sexual networks are not, in fact, scale free (see Jones and Handcock, "Sexual contacts and epidemic thresholds" Nature, 423, 6940, 605-606) • Theoretically, any degree-based metric has no necessary relation to the arrangement of ties within the network. That is, there are many graphs with identical degree distributions but very different topologies. • Preferential attachment  scale free, but not vice versa • Finding a power-law degree distribution is really not that useful if there is any kind of blocking structure (focal aspects) to the network.

Complete Network Analysis Network Connections: Large-Scale network structure Colorado Springs High-Risk (Sexual contact only) • Network is approximately scale-free, with l = -1.3 • But connectivity does not depend on the hubs.

Complete Network Analysis Network Connections: Social Subgroups A primary interest in Social Network Analysis is the identification of “significant social subgroups” – some smaller collection of nodes in the graph that can be considered, at least in some senses, as a “unit” based on the pattern, strength, or frequency of ties. There are many ways to identify groups. They all insist on a group being in a connected component, but other than that the variation is wide.

Complete Network Analysis Network Connections: Social Subgroups • A) Graph theoretical methods: Cliques and extensions of cliques • Cliques • k-cores • k-plexes • Freeman (1992) Models • K-components • B) Algorithmic methods: search through a network trying to maximize for a particular pattern. Adjust assignment of actors to groups until a particular pattern of ties (block diagonal, usually) is identified. • Standard models: • - Factions (UCI-NET) • - NEGOPY (Richards) • - KliqueFinder (Frank) • - RNM, JIGGLE (Moody) • - Betweeness Centrality (Newman) • - General Distance & Clustering Methods

Complete Network Analysis Network Connections: Social Subgroups Graph Theoretical Models. Start with a clique. A clique is defined as a maximal subgraph in which every member of the graph is connected to every other member of the graph. Cliques are collections of nodes where density = 1.0. • Properties of cliques: • Density: 1.0 • Everyone connected to n-1 alters • Distance between every pair is 1 • Ratio of within group ties to between group ties is infinite • All triads are transitive

Complete Network Analysis Network Connections: Social Subgroups Graph Theoretical Models. In practice, complete cliques are not very useful. They tend to overlap heavily and are limited in their size. Graph theorists have thus relaxed the complete connectivity requirement (with varying degrees of success). See the Moody & White (2003) for a discussion of these attempts.

Complete Network Analysis Network Connections: Social Subgroups Graph Theoretical Models. k-cores: Every person connected to at least k other people. Ideally, they would look something like this (here two 3-cores). However, adding a single tie from A to B would make the whole graph a 3-core

Complete Network Analysis Network Connections: Social Subgroups Extensions of this idea include: K-Core: Every person has ties to at least k other people in the set. K-plex: Every member connected to at least n-k other people in the graph (recall in a clique everyone is connected to n-1, so this relaxes that condition. n-clique: Every person is connected by a path of N or less (recall a clique is with distance = 1). N-clan: same as an n-clique, but all paths must be inside the group. I’ve never had much luck with any of these methods empirically. Real data is usually too messy to work well. Since many of the graph-theoretic options seem not to work well, authors have used optimization techniques, that attempt to identify groups iteratively.

Complete Network Analysis Network Connections: Social Subgroups Algorithmic Approaches to Identifying Primary Groups: 1) Measures of fit To identify a primary group, we need some measure of how clustered the network is. Usually, this is a function of the number of ties that fall within group to the number of ties that fall between group. 2.1) Processes designed to maximize (1) Once we have such an index, we need a method for searching through the network to maximize the fit. 2.2) Generalized cluster analysis In addition to maximizing a group function such as (1) we can use the relational distance directly, and look for clusters in the data.

Complete Network Analysis Network Connections: Social Subgroups Segregation Index (Freeman, L. C. 1972. "Segregation in Social Networks." Sociological Methods and Research 6411-30.) Freeman asked how we could identify segregation in a social network. Theoretically, he argues, if a given attribute (group label) does not matter for social relations, then relations should be distributed randomly with respect to the attribute. Thus, the difference between the number of cross-group ties expected by chance and the number observed measures segregation.

Complete Network Analysis Network Connections: Social Subgroups Consider the (hypothetical) network below. There are two attributes in this network: people with Blue eyes and Brown eyes and people who are square or not (they must be hip).

Complete Network Analysis Network Connections: Social Subgroups Blue Brown Blue 6 17 Brown 17 16 Hip Square Hip 20 3 Square 3 30 Segregation Index Mixing Matrix: Seg = -0.25 Seg = 0.78

Complete Network Analysis Network Connections: Social Subgroups Segregation Index One problem with the segregation index is that it is not ‘margin free.’ That is, if you were to change the distribution of the category of interest (say race) by a constant but not the core association between race and friendship choice, you can get a different segregation level. One antidote to this problem is to use odds ratios. In this case, and odds ratio tells us the relative likelihood that two people in the same category will choose each other as friends.

Complete Network Analysis Network Connections: Social Subgroups Segregation index compared to the odds ratio: Friendship Segregation Index r=.95 Log(Same-Sex Odds Ratio)

Complete Network Analysis Network Connections: Social Subgroups The second problem is that the Segregation index has no clear maximum – if every node is assigned to a single group the value can be higher than if everyone is assigned to the “right” group. This means you can’t just keep adjusting nodes until you see a best fit, but instead have to look for changes in fit. The modularity score solves this problem by re-organizing the expectation in a way that forces the value to 0 if everyone is in a single group.

Complete Network Analysis Network Connections: Social Subgroups We can also measure the extent that ties fall within clusters with the modularity score: Where: s indexes clusters in the network ls is the number of lines in cluster s ds is the sum of the degrees of s L is the total number of lines M has the advantage of going to 0 if there is only 1 group, which means maximizing the score is sensible

Complete Network Analysis Network Connections: Social Subgroups Modularity Scores Comparison to Segregation Index – comparing values for known solutions Modularity Score Plotted against Segregation Index for various nets

Complete Network Analysis Network Connections: Social Subgroups Modularity Scores Comparison to Segregation Index – comparing value of multiple solutions on the same network Number of groups  In-group Density 

Complete Network Analysis Network Connections: Social Subgroups • Other groupedness indices include: • a) The ratio of in-group to out-group ties (Negopy, UCINET Factions) • b) Maximizing the probability of in-group contact (CliqueFinder) • c) The Segregation Matrix Index (SMI) • d) The dyadic factor loadings for overlapping groups (akin to a latent class model) • e) Minimize the within-group distance • Once a metric has been chosen, some algorithm is needed to search through the graph to identify clusters. These algorithms range from very sophisticated “graph-intelligent” algorithms, such as NEGOPY, to simple cluster analysis of distance matrices. • In most cases, you have to pre-set the number of groups to use (the exceptions are NEGOPY and CliqueFinder. Moody’s JIGGLE algorithm also has automatic stopping criteria, but you have to give it starting values.

Complete Network Analysis Network Connections: Social Subgroups In practice, the different algorithms will give different results. Here, I compare the NEGOPY results to the RNM results. NEGOPY returned one large group, RNM found many smaller, denser groups. It’s usually a good idea to explore multiple solutions and algorithms.

Complete Network Analysis Network Connections: Social Subgroups Gangon Prison Network In practice, the different algorithms will give different results. Here, I compare NEGOPY, FACTIONS and RNM. Groups A and B are identical, C is close. F, E and D differ. It’s usually a good idea to explore multiple solutions and algorithms. (all solutions constrained to 6 groups)

Complete Network Analysis Network Connections: Social Subgroups Section of a School Friendship Network In practice, the different algorithms will give different results. Here, I compare NEGOPY, FACTIONS and RNM with the k-connectivity levels. It’s usually a good idea to explore multiple solutions and algorithms.

Complete Network Analysis Network Connections: Social Subgroups Cluster analysis In addition to tools like FACTIONS, we can use the distance information contained in a network to cluster observations that are ‘close’ to each other. In general, cluster analysis is a set of techniques that allows you to identify collections of objects that are simmilar to each other in some degree. A very good reference is the SAS/STAT manual section called, “Introduction to clustering procedures.” (http://wks.uts.ohio-state.edu/sasdoc/8/sashtml/stat/chap8/index.htm) (See also Wasserman and Faust, though the coverage is spotty). We are going to start with the general problem of hierarchical clustering applied to any set of analytic objects based on similarity, and then transfer that to clustering nodes in a network.

Complete Network Analysis Network Connections: Social Subgroups Cluster analysis Imagine a set of objects (say people) arrayed in a two dimensional space. You want to identify groups of people based on their position in that space. How do you do it? How Smart you are How Cool you are

Complete Network Analysis Network Connections: Large-Scale network structure