1 / 46

Jerry Scripps

R. W. E. T. O. K. N. G. I. I. N. M. N. Jerry Scripps. Overview. What is network mining? Motivation Preliminaries definitions metrics network types Network mining techniques. What is Network Mining?. Statistics. Computer Science. Mathematics. Data Mining.

sophie
Download Presentation

Jerry Scripps

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. R W E T O K N G I I N M N Jerry Scripps

  2. Overview • What is network mining? • Motivation • Preliminaries • definitions • metrics • network types • Network mining techniques

  3. What is Network Mining? Statistics Computer Science Mathematics Data Mining MachineLearning Pattern Recognition Social Network Analysis Graph Theory Network Mining

  4. What is Network Mining?Border Disciplines • Sociology • Military • Biology • Medicine • Chemistry • Business • Statistics • Computer Science • Physics • Math • Psychology • Law Enforcement

  5. What is Network Mining? Examples: • Discovering communities within collaboration networks • Finding authoritative web pages on a given topic • Selecting the most influential people in a social network

  6. Network Mining – MotivationEmerging Data Sets • World wide web • Social networking • Collaboration databases • Customer or Employee sets • Genomic data • Terrorist sets • Supply Chains • Many more…

  7. Network Mining – MotivationDirect Applications • What is the community around msu.edu? • What are the authoritative pages? • Who has the most influence? • Who is the likely member of terrorist cell? • Is this a news story about crime, politics or business?

  8. Network Mining – MotivationIndirect Applications • Convert ordinary data sets into networks • Integrate network mining techniques into other techniques

  9. Preliminaries • Definitions • Metrics • Network Types • Definitions • Metrics • Network Types

  10. Definitions Community Node (vertex, point, object) Link (edge, arc)

  11. Metrics Network • Characteristic path length • Clustering coefficient • Min-cut • diameter Node Pair • Graph distance • Min-cut • Common neighbors • Jaccard’s coef • Adamic/adar • Pref. attachment • Katz • Hitting time • Rooted pageRank • simRank • Bibliographic metrics Node • Degree • Closeness • Betweenness • Clustering coefficient

  12. Network Types – Random

  13. Network Types – Small World Watts & Strogatz Small World Random Regular

  14. Networks – Scale-free • Barabasi & Bonabeau • Degree follows a power law ~ 1/kn • Can be found in a wide variety of real-world networks

  15. Network recap

  16. Techniques • Link-Based Classification • Link Prediction • Ranking • Influential Nodes • Community Finding

  17. Include features from linked objects: building a single model on all features Fusion of link and attribute models Link-Based Classification ?

  18. Link-Based ClassificationChakrabarti, et al. • Copying data from neighboring web pages actually reduced accuracy • Using the label from neighboring page improved accuracy 111011 111011 ? B 101011 B 101011 010010 A 010010 A A 011110 011110 A

  19. Link-Based ClassificationLu & Getoor • Define vectors for attributes and links • Attribute data OA(X) • Link data LD(X) constructed using • mode (single feature – class of plurality) • count (feature for each class – count for neighbors) • binary (feature for each class – 0/1 if exists) 111011 ? OA (attr) LD (link) 101011 B 2 1 0 … 1 1 0 … A … 111011 … 010010 A 011110 A Model 1 Model 2 Model

  20. Link-Based ClassificationLu & Getoor • Define probabilities for both • Attribute • Link • Class estimation:

  21. Collective Classification • Uses both attributes and links • Iteratively update the unlabeled instances • message passing, loopy belief nets, etc.

  22. Link-Based ClassificationSummary • Using class of neighbors improves accuracy • Using separate models for attribute and link data further improves accuracy • Other considerations: • improvements are possible by using community information • knowledge of network type could also benefit classifier

  23. Techniques • Link-Based Classification • Link Prediction • Ranking • Influential Nodes • Community Finding

  24. Link Prediction

  25. Link PredictionLiben-Nowell and Kleinberg Tested node-pair metrics: • Graph distance • Common neighbors • Jaccards coefficient • Adamic/adar • Preferential attachment • Katz • Hitting time • Rooted PageRank • SimRank Neighborhood Ensemble of paths

  26. Link Prediction - results

  27. Link Prediction – newer methods • maximum likelihood • stochastic block model • probabilistic

  28. Link Prediction – summary • There is room for growth – best predictor has accuracy of only around 9% • Predicting collaborations is difficult • New problem could be to predict the direction of the link

  29. Techniques • Link-Based Classification • Link Prediction • Ranking • Influential Nodes • Community Finding • Link Completion

  30. Ranking

  31. Ranking – Markov Chain Based • Random-surfer analogy • Problem with cycles • PageRank uses random vector

  32. Ranking – summary • Other methods such as HITS and SALSA also based on Markov chain • Ranking has been applied in other areas: • text summarization • anomaly detection

  33. Techniques • Link-Based Classification • Link Prediction • Ranking • Influential Nodes • Community Finding

  34. Influence

  35. Influence Maximization • Problem: find the best nodes to activate • Approaches: • degree – fast but not effective • greedy – effective but slow • improvements to greedy: degree heuristics and Shapely value • use communities • cost-benefit – probabilistic approach

  36. Maximizing influence model-based • Problem – finding the k best nodes to activate to maximize the number of nodes activated • Models: • independent cascade – when activated a node has a one-time change to activate neighbors with prob. pij • linear threshold – node becomes activated when the percent of its neighbors crosses a threshold

  37. Maximizing influence model-based • Models: independent cascade & linear threshold • A function f:S→S*, can be created using either model • Functions use monte-carlo, hill-climbing solution • Submodular functions, where ST are proven in another work to be NP-C but by using a hill-climbing solution can get to within 1-1/e of optimum.

  38. Maximizing influence – cost/benefit • Assumptions: • product x sells for $100 • a discount of 10% can be offered to various prospective customers • If customer purchases profit is: • 90 if discount is offered • 100 if discount is not offered • Expected lift in profit (ELP) from offering discount is: • 90*P(buy|discount) - 100*P(buy|no discount)

  39. Maximizing influence – cost/benefit • Goal is to find M that maximizes global ELP • Three approximations used: • single pass • greedy • hill-climbing • Xi is the decision of customer i to buy • Y is vector of product attributes • M is vector of marketing decision • f is a function to set the ith element of M • r0 and r1 are revenue gained • c is the cost of marketing

  40. Comparison of approaches • An extension would be to spread influence to the most number of communities • Improvements can be made in speed

  41. Techniques • Link-Based Classification • Link Prediction • Ranking • Influential Nodes • Community Finding

  42. Communities

  43. Gibson, Kleinberg and Raghavan Query Search Engine Root Set Base Set: add forward and back links Use HITS to find top 10 hubs and authorities

  44. Flake, Lawrence and Giles • Uses Min-cut • Start with seed set • Add linked nodes • Find nodes from outgoing links • Create virtual source node • Add virtual sink linking it to all nodes • Find the min-cut of the virtual source and sink

  45. Community Finding • Girvan and Newman – minimize betweenness • Clauset, et al. – agglomerative, uses modularity • Shi & Malik – spectral clustering

  46. Communities - summary • There are many options for building communities around a small group of nodes • Possible future directions • finding communities in networks having different link types • impact of network type on community finding techniques

More Related