Jerry scripps
Download
1 / 48

Jerry Scripps - PowerPoint PPT Presentation


  • 168 Views
  • Uploaded on

K. I. L. N. I. G. I. N. M. N. Jerry Scripps. Overview. What is link mining? Motivation Preliminaries definitions metrics network types Link mining techniques. What is Link Mining?. Graph Theory. Statistics. Link Mining. Data Mining. Machine Learning.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Jerry Scripps' - quasar


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Jerry scripps

K

I

L

N

I

G

I

N

M

N

Jerry Scripps


Overview
Overview

  • What is link mining?

  • Motivation

  • Preliminaries

    • definitions

    • metrics

    • network types

  • Link mining techniques


What is link mining
What is Link Mining?

Graph Theory

Statistics

Link Mining

Data Mining

MachineLearning

Social Network Analysis

Database


What is link mining1
What is Link Mining?

Examples:

  • Discovering communities within collaboration networks

  • Finding authoritative web pages on a given topic

  • Selecting the most influential people in a social network


Link mining motivation emerging data sets
Link Mining – MotivationEmerging Data Sets

  • World wide web

  • Social networking

  • Collaboration databases

  • etc.


Link mining motivation direct applications
Link Mining – MotivationDirect Applications

  • What is the community around msu.edu?

  • What are the authoritative pages?

  • Who has the most influence?

  • Who is the likely member of terrorist cell?

  • Is this a news story about crime, politics or business?


Link mining motivation indirect applications
Link Mining – MotivationIndirect Applications

  • Convert ordinary data sets into networks

  • Integrate link mining techniques into other techniques


Preliminaries
Preliminaries

  • Definitions

  • Metrics

  • Network Types

  • Definitions

  • Metrics

  • Network Types


Definitions
Definitions

Community

Node (vertex, point, object)

Link (edge, arc)


Metrics
Metrics

Network

  • Characteristic path length

  • Clustering coefficient

  • Min-cut

Node Pair

  • Graph distance

  • Min-cut

  • Common neighbors

  • Jaccard’s coef

  • Adamic/adar

  • Pref. attachment

  • Katz

  • Hitting time

  • Rooted pageRank

  • simRank

  • Bibliographic metrics

Node

  • Degree

  • Closeness

  • Betweenness

  • Clustering coefficient


Network types
Network Types

Watts & Strogatz

Small World

Random

Regular


Networks scale free
Networks – Scale-free

  • Barabasi & Bonabeau

  • Degree follows a power law ~ 1/kn

  • Can be found in a wide variety of real-world networks



Techniques
Techniques

  • Link-Based Classification

  • Link Prediction

  • Ranking

  • Influential Nodes

  • Community Finding

  • Link Completion


Link based classification

Include features from linked objects:

building a single model on all features

Fusion of link and attribute models

Link-Based Classification

?


Link based classification chakrabarti et al
Link-Based ClassificationChakrabarti, et al.

  • Copying data from neighboring web pages actually reduced accuracy

  • Using the label from neighboring page improved accuracy

111011

111011

?

B

101011

B

101011

010010

A

010010

A

A

011110

011110

A


Link based classification lu getoor
Link-Based ClassificationLu & Getoor

  • Define vectors for attributes and links

    • Attribute data OA(X)

    • Link data LD(X) constructed using

      • mode (single feature – class of plurality)

      • count (feature for each class – count for neighbors)

      • binary (feature for each class – 0/1 if exists)

111011

?

OA (attr)

LD (link)

101011

B

2 1 0

1 1 0

A

111011

010010

A

011110

A

Model 1

Model 2

Model


Link based classification lu getoor1
Link-Based ClassificationLu & Getoor

  • Define probabilities for both

    • Attribute

    • Link

  • Class estimation:


Link based classification summary
Link-Based ClassificationSummary

  • Using class of neighbors improves accuracy

  • Using separate models for attribute and link data further improves accuracy

  • Other considerations:

    • improvements are possible by using community information

    • knowledge of network type could also benefit classifier


Techniques1
Techniques

  • Link-Based Classification

  • Link Prediction

  • Ranking

  • Influential Nodes

  • Community Finding

  • Link Completion



Link prediction liben nowell and kleinberg
Link PredictionLiben-Nowell and Kleinberg

Tested node-pair metrics:

  • Graph distance

  • Common neighbors

  • Jaccards coefficient

  • Adamic/adar

  • Preferential attachment

  • Katz

  • Hitting time

  • Rooted PageRank

  • SimRank

Neighborhood

Ensemble of paths



Link prediction summary
Link Prediction – summary

  • There is room for growth – best predictor has accuracy of only around 9%

  • Predicting collaborations is difficult

  • Finding communities could help if most collaborations are intra-community

  • New problem could be to predict the direction of the link


Techniques2
Techniques

  • Link-Based Classification

  • Link Prediction

  • Ranking

  • Influential Nodes

  • Community Finding

  • Link Completion



Ranking markov chain based
Ranking – Markov Chain Based

  • Random-surfer analogy

  • Problem with cycles

  • PageRank uses random vector


Ranking summary
Ranking – summary

  • Other methods such as HITS and SALSA also based on Markov chain

  • Ranking has been applied in other areas:

    • text summarization

    • anomaly detection


Techniques3
Techniques

  • Link-Based Classification

  • Link Prediction

  • Ranking

  • Influential Nodes

  • Community Finding

  • Link Completion



Maximizing influence model based
Maximizing influence model-based

  • Problem – finding the k best nodes to activate to maximize the number of nodes activated

  • Models:

    • independent cascade – when activated a node has a one-time change to activate neighbors with prob. pij

    • linear threshold – node becomes activated when the percent of its neighbors crosses a threshold


Maximizing influence model based1
Maximizing influence model-based

  • Models: independent cascade & linear threshold

  • A function f:S→S*, can be created using either model

  • Functions use monte-carlo, hill-climbing solution

  • Submodular functions, where ST are proven in another work to be NP-C but by using a hill-climbing solution can get to within 1-1/e of optimum.


Maximizing influence cost benefit
Maximizing influence – cost/benefit

  • Assumptions:

    • product x sells for $100

    • a discount of 10% can be offered to various prospective customers

  • If customer purchases profit is:

    • 90 if discount is offered

    • 100 if discount is not offered

  • Expected lift in profit (ELP) from offering discount is:

    • 90*P(buy|discount) - 100*P(buy|no discount)


Maximizing influence cost benefit1
Maximizing influence – cost/benefit

  • Goal is to find M that maximizes global ELP

  • Three approximations used:

    • single pass

    • greedy

    • hill-climbing

  • Xi is the decision of customer i to buy

  • Y is vector of product attributes

  • M is vector of marketing decision

  • f is a function to set the ith element of M

  • r0 and r1 are revenue gained

  • c is the cost of marketing


Comparison of approaches
Comparison of approaches

  • An extension would be to spread influence to the most number of communities

  • Improvements can be made in speed


Techniques4
Techniques

  • Link-Based Classification

  • Link Prediction

  • Ranking

  • Influential Nodes

  • Community Finding

  • Link Completion



Gibson kleinberg and raghavan
Gibson, Kleinberg and Raghavan

Query

Search Engine

Root Set

Base Set: add forward and back links

Use HITS to find top 10 hubs and authorities


Reddy and kitsuregawa
Reddy and Kitsuregawa

  • Bipartite graph

  • Given an initial set of nodes T build I from the nodes pointed to from T

  • Repeat:

    • use relax_cocite to expand T and I

    • prune T and I using dense bipartite graph function DBPG(T,I,α,β)

I

T

u

v

w


Flake lawrence and giles
Flake, Lawrence and Giles

  • Uses Min-cut

  • Start with seed set

  • Add linked nodes

  • Find nodes from outgoing links

  • Create virtual source node

  • Add virtual sink linking it to all nodes

  • Find the min-cut of the virtual source and sink


Neville adler and jensen
Neville, Adler and Jensen

A

0 1 1 0

  • Distance based on links and attributes

  • If link exists score is number of common attributes zero otherwise

  • score(A,B)=2, score(A,C)=1,score(B,C)=0

  • Used with 3 partitioning algorithms:

    • Karger’s Min-Cut

    • MajorClust

    • Spectral partitioning by Shi & Malik

B

1 1 0 0

C

1 1 0 1


Communities summary
Communities - summary

  • There are many options for building communities around a small group of nodes

  • Possible future directions

    • finding communities in networks having different link types

    • impact of network type on community finding techniques


Techniques5
Techniques

  • Link-Based Classification

  • Link Prediction

  • Ranking

  • Influential Nodes

  • Community Finding

  • Link Completion



Goldenberg kubica and komarek
Goldenberg, Kubica and Komarek

Problem: given a network and n-1 members of a community find the nth

  • random

  • counting

  • popular

  • NB

  • NN

  • cGraph

  • BayesNet

  • EBS and LR


Conclusions
Conclusions

  • Link mining is a young, dynamic field of study with problem areas that continue to emerge and morph as techniques continue to evolve

  • Opportunities for improvements exist in

    • using community knowledge

    • using network knowledge

  • We are the living links in a life force that moves and plays around and through us, binding the deepest soils with the farthest stars.

    • Alan Chadwick


Ranking1

14

5

3

9

2

15

6

9

4

Ranking

  • Based on Markov Chain

  • Rank is sum of node weights from incoming links

  • Breaks down when cycles exist


Ranking continued
Ranking - continued

  • General approach

    • ap = authority score for p

    • Bp = backlinks of p

  • PageRank 

  • HITS approach

    • ap = authority score for p

    • hp = hub score for p

    • Bp = backlinks of p

  • Normalize between iterations


ad