Information Flow Prediction and People Mining - PowerPoint PPT Presentation

Information flow prediction and people mining
Download
1 / 51

Information Flow Prediction and People Mining. Ching-Yung Lin IBM T. J. Watson Research Center May 27, 2007. Data Flow through an Internet Gateway. 10Gbit/s Continuous Feed Coming into System Types of Data

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Information Flow Prediction and People Mining

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Information flow prediction and people mining

Information Flow Prediction and People Mining

Ching-Yung Lin

IBM T. J. Watson Research Center

May 27, 2007


Data flow through an internet gateway

Data Flow through an Internet Gateway..

  • 10Gbit/s Continuous Feed Coming into System

  • Types of Data

    • Speech, text, moving images, still images, coded application data, machine-to-machine binary communication

  • System Mechanisms

    • Telephony: 9.6Gbit/sec (including VoIP)

    • Internet

      • Email: 250Mbit/sec (about 500 pieces per second)

      • Dynamic web pages: 50Mbit/sec

      • Instant Messaging: 200Kbit/sec

      • Static web pages: 100Kbit/sec

      • Transactional data: TBD

    • TV: 40Mb/sec (equivalent to about 10 stations)

    • Radio: 2Mb/sec (equivalent to about 20 stations)


Network monitoring and stream analysis

rtsp

Advanced content analysis

ftp

tcp

keywords

ip

id

Interest Filtering

http

audio

sess

Interest Routing

rtp

udp

video

Interested MM streams

sess

ntp

Packet content analysis

per PE rates

200-500MB/s

~100MB/s

10 MB/s

Network Monitoring and Stream Analysis

Dataflow Graph

Inputs

By IBM Dense Information Gliding Team


Borrow this from hoover

Borrow this from Hoover...


One of the issues speech recognition speaker social network detection

Denoising & Social Network Analysis

Speaker Detection

Olivier

Mihalis

talks to

Upendra

Ching-Yung

talks to

Deepak

After denoising

One of the issues – Speech Recognition, Speaker & Social Network Detection

Stream A

Stream B

Stream C

- Social network

- Fusion technique

- Iterative method

Stream D

What can be achieved by combining

content analysis and social network analysis?


Challenge every node in the network is unique

Challenge – every node in the network is unique

Photo Source: New York Times, 3/2/2005


Part i dynamic probabilistic complex network and information flow

Part I: Dynamic Probabilistic Complex Network and Information Flow


The most difficult challenge state of the arts

The Most Difficult Challenge: State-of-the-Arts?

 Our Objectives: Find important people, community structures, or information flow in a network, which is dynamic, probabilistic and complex, in order allocate resources in a large-scale mining system.

  • Social Networks in sociological and statistic fields: focus on (1) overall network characteristics, (2) dynamic random graphs, (3) binary edges, etc.  Not consider probabilistic nodes/edges or individual nodes/edges.

  • Epidemic Networks & Computer Virus Network: focus on (1) overall network characteristics – when will an outbreak occurs, (2) regular / random graphs.  Not focus on individual nodes/edges.

  • (Computer) Communication Networks: focus on (1) packet transmission – information is not duplicated, or (2) broadcasting – not considering individual nodes/edges or complex network topology.

  • WWW: focus on (1) topology description, (2) binary edges and ranked nodes (e.g., Google PageRank)  Not consider probabilistic edges


What is a dynamic probabilistic complex network

What is a Dynamic Probabilistic Complex Network?


Modeling a dynamic probabilistic complex network

Modeling a Dynamic Probabilistic Complex Network

  • [Assumption] A DPCN can be represented by a Dynamic Transition Matrix P(t), a Dynamic Vertex Status Random Vector Q(t), and two dependency functionsfMandgM.

and

where

where

and

: the status value of vertex i at time t.

: the status value of edge i →j at time t.


Presentation_6591

Information Flow in Dynamic Probabilistic Complex Network (Let’s call it: Behavioral Information Flow (BIF) Model)

  • [Assumption] Edge can be represented by a four-state S-D-A-R (Susceptible-Dormant-Active-Removed) Markov Model. Nodes can be represented by three states S-A-I (Susceptible-Active-Informed) Model.

and

where


Presentation_6591

Major Difference between BIF and Prior Modeling Methods in Epidemic Research and Computer Virus Fields

  • Prior Models:

    • Model Human Nodes as S-I-R (Susceptible, Infected, and Removed).

    • Did not consider individual node’s behavior different in network structure/topology  did not consider edge status.

  • We propose to model edge status as (autonomous) S-D-A-R Markov Model (Susceptible, Dormant, Active, Removed)

  • We propose to model human node behavior as S-A-I (Susceptible, Active, and Informed).


Edges are markov state machines nodes are not

trigger

R

D

A

S

S

I

A

trigger

Edges are Markov State Machines, Nodes are not

  • State transitions of edges: S-D-A-R model. (Susceptible, Dormant, Active, and Removed) This indicates the time-aspect changes of the state of edges.

Edge view

  • States of nodes: S-A-I model. (Susceptible, Active, and Informed) Trigger occurs when the start node of the edge changes from state S to state I :

Node view

Network view


Edge state probability and network configuration model

Edge State Probability and Network Configuration Model

  • Nodes and Edges

  • Network Configuration Model (which is learned by training). It includes the network topology information, long-term edge probability, and delay parameter).

  • ai,j = 0  No Edge between i and j

  • Our KDD 2005 paper is a special case that ai,j =1 or 0, and did not model (bi,j ,gi,j )


Define edge state probability update function

trigger

R

D

A

S

Define Edge State Probability Update Function

Edge State Probability

Update functionf(.)s.t.:

  • Given three different cases:

    • On trigger:

    • No trigger – node not informed yet:

    • No trigger – node has been informed:

  • Therefore, consider the probabilities of node states, then we get f(.):


Nodes state transitions determined by incoming edges

S

I

A

trigger

Nodes: State Transitions Determined by Incoming Edges

  • Node State Probability Update Function g(.):

where

and WV,i is the set of all source nodes of the incoming edges of Node i:

Network view


An application of information flow prediction find important people

An Application of Information Flow Prediction – find important people

  • Who are the most likely people to talk about this information at a specific time given the current observation?

  • For a given concrete observation, the values in the given priors

    are either 0 or 1.

  • For speaker recognition results, the priors can be confidence values between 0 ~ 1.

given

or


Case study i switchboard data from 679 people

Case Study I – Switchboard data from 679 people

  • Monte Carlo Method: Simulate each DPCN information flow for 1000 times.

  • It takes 12 seconds to use MC simulation to predict the process. (For a given model and test all 679 nodes, it takes a PC 130 mins for calculate the probabilities if the information flow starts from different 679 seeds).


The distribution histogram of the alpha values of the edges in the enron dataset

The distribution histogram of the alpha values of the edges in the Enron dataset.


Noise factor i impact of classification error from speaker recognition

Z

K

φi2Z

fiK

truth

detected

Noise Factor I – Impact of Classification Error from Speaker Recognition

  • Assume the classification precision rate on the speaker (node) i is fi, and the false alarm rate on the speaker i is φi.

  • Then the expected number of times that the node is

    counted is:

  • And the link is counted is:

  • Therefore,

  • If we assume a universal precision and false alarm rate at all speakers, then:

    Assume the average waiting time of links and the average transmission duration of links are the same regardless of the links observed, then:

  • If we assume the false alarm rate is small and can be neglected when the number of nodes is large, then

and


Presentation_6591

Speaker Recognition Accuracy can be Improved by Fusion of Original Speaker Recognition and Predicted Node Probability

  • We can use this fusion method to combine both speaker recognition result and the estimated node probability:

which is guaranteed to be increasing when

Speaker i

Recognizer

Before Fusion

Speaker i

Recognizer

After Fusion with

BIF Prediction

BIF

Prediction


Recognition result from switchboard 2 telephone conversation set

Recognition Result from Switchboard-2 Telephone Conversation Set

  • Improvement on Recognition Accuracy on Node 171. The x-axis is the time that model is updated based on the recognition result after fusion. The y-axis represents the recognition accuracy. In the six testing cases, the Node 171 is usually confused with Node 218 or Node 164. In the first two cases, there are no false alarm from the classification of Node 218 or 164. In the next two cases, they are usually confused with each other. In the last two cases, the false alarm from Node 218 or 164 is 0.3.


Case study ii our experiments on enron emails

Case Study (II) – our experiments on Enron Emails


Modeling and predicting topic related personal information flow

Modeling and Predicting Topic-Related Personal Information Flow

  • Content-Time-Relation Model Combine content, time and social relation information with Dirichlet allocations and a causal Bayesian network. [ Song et al., KDD, August 2005] (1st paper combining content analysis and social network analysis)

ad

t

Given the senderand the timeof an email:

1. Get the probability of a topic given the sender

2. Get the probability of the receiver given the sender and the topic

3. Get the probability of a word given the topic

S

A

z

w

T

r

N

D

Tm

: observations

a: sender/author, z: topic, S: social network (Exponential Random Graph Model / p* model),

D: document/emailr: receivers, w: content words, N: Word set, T: Topic

Boxes represents iteration.


Corporate topic trend analysis example yearly repeating events

Corporate Topic Trend Analysis Example: Yearly repeating events

Topic 45, which is talking about a schedule issue, reaches a peak during June to September.

For topic 19, it is talking about a meeting issue. The trend repeats year to year.


Topic detection and key people detection of california power match their real life roles

Topic Detection and Key People Detection of “California Power” Match Their Real-Life Roles

(a)

Event “California Energy Crisis” occurred at exactly this time period. Key people are active in this event except Vince_Kaminski …


Social network of enron managers

Social Network of Enron Managers

  • If we try to find out social networks based on all communications, it is difficult.


Information flow in enron california market

Information Flow in Enron – California Market

  • Actor 151 (Rosalee Fleming — the Enron CEO Ken L.’s assistant) is the key information spreader of this issue.


Information flow in enron market opportunities

Information Flow in Enron – Market Opportunities

  • Rosalee Fleming also played an important role at “Market Opportunities.” She received info from Actor 119 (Mike Carson) and Actor 23 (James Steffes – VP of Gov. Affairs of Enron.)

  • Actor 68 (Rod Hayslett -- CFO) is also a major information spreader.


Information flow in enron north american products

Information Flow in Enron – North American Products

  • Two disjoint communities can be observed. Actor 21 (Keith Holst) and Actor 142 (Dan Hyvl) are the main bridges of the two communities.


This kind of analysis is wonderful but

This kind of analysis is wonderful, but..

  • We cannot wait until our company has scandle and bankrupts....

  • What kinds of applications can be valuable out of network analysis?


Part ii small blue

Part II: Small Blue


Social network a key differentiator for corporate performance

Social Network -- A key differentiator for corporate performance

  • Informal social network within formal organizations is a major factor affecting companies’ performance:

    • Krackhardt (CMU, 2005) showed that companies with strong informal networks perform five or six times better than those with weak networks.

    • Brydon (VisblePath, 2006) showed that the performance gain of companies utilizing social networks:

      • 16x at sales

      • 4x at marketing

      • 10x at hiring


Presentation_6591

We hope social network and expertise mining can dramatically increase our colleagues’ knowledge and collaboration


Social networks beyond the organizational chart

Social Networks -- Beyond the organizational chart

  • Organization charts are not the best indicator of how work gets done

  • Senior people are not always central; peripheral people can represent untapped knowledge

  • Making the network visible makes it actionable and becomes the basis for a collaboration action plan

Source: Cross, R., Parker, A., Prusak, L. & Borgatti, S.P. 2001. Knowing What We Know: Supporting Knowledge Creation and Sharing in Social Networks. Organizational Dynamics 30(2): 100-120. [pdf]

Provided by Drs. Tony Mobbs and Kate Ehrlich, IBM


Group and roles

Marketing

Finance

Manufacturing

Group and Roles

Central people

  • Sam. Could be bottleneck or holding group together

    Peripheral people

  • Earl. Goes to others but no-one goes to him for information. At risk for leaving. Potentially unrealized expertise

    Sub-groups

  • Group split by function. Very little information shared across groups

Andy

Frank

Indojit

Carl

Karen

Darren

Bob

Sam

Ming

Neo

Leo

Earl

Gerry

Harry

Jeff

This slide is excerpted from SNA Theory, Concepts and Practice

by Dr. T. Mobbs, BCS and Dr. K. Ehrlick, Research


Some roles are especially critical

Marketing

Finance

Manufacturing

Some Roles are especially critical

What happens if Sam leaves the group through layoffs, job reassignment, attrition, merger, retirement?

Andy

Frank

Indojit

Carl

Karen

Darren

Bob

Ming

Neo

Leo

Earl

Gerry

Harry

Jeff

This slide is excerpted from SNA Theory, Concepts and Practice

by Dr. T. Mobbs, BCS and Dr. K. Ehrlick, Research


Relationships are multi dimensional and traditionally uncovered through network questions

Awareness

Emotional

Relationships are multi-dimensional and (traditionally) uncovered through network questions

Actions

Communication

How often do you communicate with this person?

Awareness

I am aware of this person’s knowledge and skills

Trust

I believe there is a high personal cost in seeking advice or support from this person

Innovation

How often do you turn to this person for new ideas

Valued Expertise

How likely are you to turn to this person for specialized expertise

Access

I believe this person will respond to my request in a reasonable and timely manner

Advice

How often do you seek advice from this person before making an important decision?

Learning

How likely are you to rely on this person for advice on new methods and processes

Energy

I generally feel energized when I interact with this person

Provided by Drs. Tony Mobbs and Kate Ehrlich, IBM


Personal network preferred source for information and collaboration

  • Forces:

  • Time Constrained

  • Delivery activity focus

  • What gets measured gets done

  • Expedience

  • Perceived value (return on time investment)

  • High reliance on:

  • 50% ~ 75%: Personal networks (Gartner Report, 2006)

  • Hard-drive materials

  • What has worked for them previously (personal experience)

Personal Network

  • fast turnaround of request

  • specific response

  • Small # relevant items returned

  • recommendation of quality

  • ability to quickly understand the supplied resource & determine relevant parts

  • additional context / value-add info not available in electronic materials

Preferred / primary

mode

?

GBS Practitioner with task in project / delivery environment

W3 Stub

W3 Stub

W3 Stub

/ Client

client

client

W3 Stub

/ client

W3 Stub

/ client

W3 Stub

W3 Stub

/ client

PSN

Methods

Education

Other w3

content

Knowledge

View

Communities

Project

Repositories

Collaboration

Project

Tools

Standalone, disparate, poor integration, large number of sources, steep learning curve (identify, understand & synthesise into specific work context), difficult to locate, choose & use.

Existing Resources Provided

leads to

Personal Network preferred source for information and collaboration

  • Under utilisation of electronic products and services.

  • Content has lower performance impact / not realising full potential benefits.

  • Widely inconsistent working practices.

 Who knows what? How to reach them? Who plays what hidden roles?


Mining expertise interests and social network

Mining Expertise, Interests and Social Network

public

  • People can be “known” by:

    • public resources:

      • publications

      • personal webpages

      • blogs

      • presentations

      • wiki

    • organizational resources:

      • patent applications

      • bluepages

    • personal resources:

      • emails

      • instant messaging

      • meeting

      • phone calls

      • face-to-face interactions

  • Expertise can also be inferred by her friends’ recommendations or expertises.

timely &

abundant

resources

for

expertise

modeling

private


Presentation_6591

SmallBlue Clients

(Distributed Automatic Social Sensors)

  • Other IBMers’ EgoNets

  • Other IBMers’ Expertise Inferences

  • I cannot see their communications, EgoNets nor Expertise Inferences

External

Data

  • user search experts or person

SmallBlue Find

  • Bluepages

  • BlueGroups

  • CommunityMap

  • BlogCentral

  • IBM Forum

  • KnowledgeView

  • Social Bookmark

  • social network analysis of Top-K experts

  • My personal network (Ego net) inferred from my Notes emails in server/local/archive and SameTime chats

  • Inference of my understanding on my friends’ expertise

  • social network analysis of a list of people

SmallBlue Connect

SmallBlue

Inference

Engines

and

Servers

SmallBlue Ego

  • Corporate-wise ranked experts

  • Ranked experts in my extended personal network, in a business unit and/or in a country

  • Only Public Information is shown

  • how to reach a person

  • My friends’ social values to me

  • Evolution of my Ego net

  • social network info

SmallBlue Reach

SmallBlue Expand

  • My social paths to her: which friends can introduce her, which friends work with her, ..  trust, awareness, collaboration.

  • Her public postings, profiles, and communities to judge whether she is the right person.

Public

  • Who I may want to know..

  • Which communities I may want to join..

  • Which documents I may want to look at

  • social network analysis (SNA): who are the key persons in this network? who are the major hubs? who are the major bridges?

  • SNA of a formal group, a bluegroup or a community

Public & Personalized

Private & Personalized


Major use of smallblue find

Major Use of SmallBlue Find

  • Find out who are the experts of any search terms. (Right now, zillions of possible terms.)

  • Rank them based on collaborative expert recommendation

  • Can show experts based on:

    • whole corporate-wise

    • business unit

    • country

    • my personal proximity


Collaborative expert recommendation

Collaborative Expert Recommendation

  • Combine everyone’s knowledge of the expertise of our colleagues.

  • The more recommendation from more colleagues, the higher the score.

  • The more recommendation from my trusted colleagues, the higher the score.

  • The higher recommendation score from colleagues, the higher the overall score.

  • Combining all IBMers’ knowledge,

    we can make an advanced expert finding search engine.

  • Utilizing the expert search engine, we can enhance all IBMers’

    knowledge and social connections.


Smallblue reach paths help users to reach another person

SmallBlue Reach Paths help users to reach another person

  • SmallBlue Reach Paths show the shortest paths for me to reach a person up to 6 degrees away.

  • SmallBlue Reach Paths can be initiated from any one of three SmallBlue applications.

  • Can be used for:

    • Access -- knowing who can help introducing me to this person.

    • Trust -- knowing who in my social networks knows this person.

    • Get Familiar with – knowing what kinds of people are contacting to this person.

    • Initiate Communication – who do we know in common.


Smallblue ego

SmallBlue Ego

  • How healthy is my personal social capital?

  • What is the social value of Alice to me?

  • What are the changes and trends of my social capital evolution?

    • For instance, I have to talk to Alice soon. She is valuable to me in terms of social connections and she is getting out of the Ego net circle..


Smallblue connect

SmallBlue Connect

  • Enterprise Social Network Analysis Tool

  • Showing Social Networks of people based on:

    • expertise key words

    • formal hierarchy

    • Any list of emails

  • Utilizing Social Network Analysis to show:

    • who are the important hubs among experts

    • who are the important bridges linking groups


Privacy consideration bottom line

Privacy Consideration – Bottom Line

  • Employees’ communications (e.g., time, from, to, cc, subject, content of emails, SameTime, etc.) are NOT searched nor retrievable to anyone.

  • Employees’ knowledge of other employees are INFERRED. Only the aggregated inferred knowledge is searchable. It is NOT possible to guess which part of aggregated inferred knowledge is contributed by whom.

  • In the social network analysis graphs, people relationships are modeled by their multimodal generic relationships. NO clue for their communication content.

  • Only the employees’ outgoing emails & instant messages and the portion that was authored by the employee is utilized.

  • Anyone can suggest keywords not be searched, search terms that should not find him, or ask to remove from the system.


Preliminary user evaluation

Preliminary User Evaluation


Presentation_6591

Demo


Coincidence

Coincidence ?? 

SmallBlue Find and Connect

Trial Release (9/20)

SmallBlue Ego

Trial Release (8/21)

SmallBlue on

TAP (11/07)


Acknowledgements

Acknowledgements

  • Thanks to the SmallBlue Team Members:

    • Vicky Griffits-Fisher,

    • Kate Ehrlich,

    • Christopher Desforges,

    • Michael Ackerbaruer,

    • Reynold Khachatourian,

    • Irina Fedulova,

    • Ekaterina Zaytseva,

    • Jeffrey Borden,

    • Jennifer Xu,

    • Yi Gu,

    • Jie Lu,

    • Dima Rekesh

    • Belle Tseng

    • Xiaodan Song

  • Contact: Ching-Yung Lin (chingyung@us.ibm.com)

    ( http://www.research.ibm.com/people/c/cylin )


  • Login