Information Flow Prediction and People Mining Ching-Yung Lin IBM T. J. Watson Research Center May 27, 2007
Data Flow through an Internet Gateway.. • 10Gbit/s Continuous Feed Coming into System • Types of Data • Speech, text, moving images, still images, coded application data, machine-to-machine binary communication • System Mechanisms • Telephony: 9.6Gbit/sec (including VoIP) • Internet • Email: 250Mbit/sec (about 500 pieces per second) • Dynamic web pages: 50Mbit/sec • Instant Messaging: 200Kbit/sec • Static web pages: 100Kbit/sec • Transactional data: TBD • TV: 40Mb/sec (equivalent to about 10 stations) • Radio: 2Mb/sec (equivalent to about 20 stations)
rtsp Advanced content analysis ftp tcp keywords ip id Interest Filtering http audio sess Interest Routing rtp udp video Interested MM streams sess ntp Packet content analysis per PE rates 200-500MB/s ~100MB/s 10 MB/s Network Monitoring and Stream Analysis Dataflow Graph Inputs By IBM Dense Information Gliding Team
Denoising & Social Network Analysis Speaker Detection Olivier Mihalis talks to Upendra Ching-Yung talks to Deepak After denoising One of the issues – Speech Recognition, Speaker & Social Network Detection Stream A Stream B Stream C - Social network - Fusion technique - Iterative method Stream D What can be achieved by combining content analysis and social network analysis?
Challenge – every node in the network is unique Photo Source: New York Times, 3/2/2005
Part I: Dynamic Probabilistic Complex Network and Information Flow
The Most Difficult Challenge: State-of-the-Arts? Our Objectives: Find important people, community structures, or information flow in a network, which is dynamic, probabilistic and complex, in order allocate resources in a large-scale mining system. • Social Networks in sociological and statistic fields: focus on (1) overall network characteristics, (2) dynamic random graphs, (3) binary edges, etc. Not consider probabilistic nodes/edges or individual nodes/edges. • Epidemic Networks & Computer Virus Network: focus on (1) overall network characteristics – when will an outbreak occurs, (2) regular / random graphs. Not focus on individual nodes/edges. • (Computer) Communication Networks: focus on (1) packet transmission – information is not duplicated, or (2) broadcasting – not considering individual nodes/edges or complex network topology. • WWW: focus on (1) topology description, (2) binary edges and ranked nodes (e.g., Google PageRank) Not consider probabilistic edges
Modeling a Dynamic Probabilistic Complex Network • [Assumption] A DPCN can be represented by a Dynamic Transition Matrix P(t), a Dynamic Vertex Status Random Vector Q(t), and two dependency functionsfMandgM. and where where and : the status value of vertex i at time t. : the status value of edge i →j at time t.
Information Flow in Dynamic Probabilistic Complex Network (Let’s call it: Behavioral Information Flow (BIF) Model) • [Assumption] Edge can be represented by a four-state S-D-A-R (Susceptible-Dormant-Active-Removed) Markov Model. Nodes can be represented by three states S-A-I (Susceptible-Active-Informed) Model. and where
Major Difference between BIF and Prior Modeling Methods in Epidemic Research and Computer Virus Fields • Prior Models: • Model Human Nodes as S-I-R (Susceptible, Infected, and Removed). • Did not consider individual node’s behavior different in network structure/topology did not consider edge status. • We propose to model edge status as (autonomous) S-D-A-R Markov Model (Susceptible, Dormant, Active, Removed) • We propose to model human node behavior as S-A-I (Susceptible, Active, and Informed).
trigger R D A S S I A trigger Edges are Markov State Machines, Nodes are not • State transitions of edges: S-D-A-R model. (Susceptible, Dormant, Active, and Removed) This indicates the time-aspect changes of the state of edges. Edge view • States of nodes: S-A-I model. (Susceptible, Active, and Informed) Trigger occurs when the start node of the edge changes from state S to state I : Node view Network view
Edge State Probability and Network Configuration Model • Nodes and Edges • Network Configuration Model (which is learned by training). It includes the network topology information, long-term edge probability, and delay parameter). • ai,j = 0 No Edge between i and j • Our KDD 2005 paper is a special case that ai,j =1 or 0, and did not model (bi,j ,gi,j )
trigger R D A S Define Edge State Probability Update Function Edge State Probability Update functionf(.)s.t.: • Given three different cases: • On trigger: • No trigger – node not informed yet: • No trigger – node has been informed: • Therefore, consider the probabilities of node states, then we get f(.):
S I A trigger Nodes: State Transitions Determined by Incoming Edges • Node State Probability Update Function g(.): where and WV,i is the set of all source nodes of the incoming edges of Node i: Network view
An Application of Information Flow Prediction – find important people • Who are the most likely people to talk about this information at a specific time given the current observation? • For a given concrete observation, the values in the given priors are either 0 or 1. • For speaker recognition results, the priors can be confidence values between 0 ~ 1. given or
Case Study I – Switchboard data from 679 people • Monte Carlo Method: Simulate each DPCN information flow for 1000 times. • It takes 12 seconds to use MC simulation to predict the process. (For a given model and test all 679 nodes, it takes a PC 130 mins for calculate the probabilities if the information flow starts from different 679 seeds).
The distribution histogram of the alpha values of the edges in the Enron dataset.
Z K φi2Z fiK truth detected Noise Factor I – Impact of Classification Error from Speaker Recognition • Assume the classification precision rate on the speaker (node) i is fi, and the false alarm rate on the speaker i is φi. • Then the expected number of times that the node is counted is: • And the link is counted is: • Therefore, • If we assume a universal precision and false alarm rate at all speakers, then: Assume the average waiting time of links and the average transmission duration of links are the same regardless of the links observed, then: • If we assume the false alarm rate is small and can be neglected when the number of nodes is large, then and
Speaker Recognition Accuracy can be Improved by Fusion of Original Speaker Recognition and Predicted Node Probability • We can use this fusion method to combine both speaker recognition result and the estimated node probability: which is guaranteed to be increasing when Speaker i Recognizer Before Fusion Speaker i Recognizer After Fusion with BIF Prediction BIF Prediction
Recognition Result from Switchboard-2 Telephone Conversation Set • Improvement on Recognition Accuracy on Node 171. The x-axis is the time that model is updated based on the recognition result after fusion. The y-axis represents the recognition accuracy. In the six testing cases, the Node 171 is usually confused with Node 218 or Node 164. In the first two cases, there are no false alarm from the classification of Node 218 or 164. In the next two cases, they are usually confused with each other. In the last two cases, the false alarm from Node 218 or 164 is 0.3.
Modeling and Predicting Topic-Related Personal Information Flow • Content-Time-Relation Model Combine content, time and social relation information with Dirichlet allocations and a causal Bayesian network. [ Song et al., KDD, August 2005] (1st paper combining content analysis and social network analysis) ad t Given the senderand the timeof an email: 1. Get the probability of a topic given the sender 2. Get the probability of the receiver given the sender and the topic 3. Get the probability of a word given the topic S A z w T r N D Tm : observations a: sender/author, z: topic, S: social network (Exponential Random Graph Model / p* model), D: document/emailr: receivers, w: content words, N: Word set, T: Topic Boxes represents iteration.
Corporate Topic Trend Analysis Example: Yearly repeating events Topic 45, which is talking about a schedule issue, reaches a peak during June to September. For topic 19, it is talking about a meeting issue. The trend repeats year to year.
Topic Detection and Key People Detection of “California Power” Match Their Real-Life Roles (a) Event “California Energy Crisis” occurred at exactly this time period. Key people are active in this event except Vince_Kaminski …
Social Network of Enron Managers • If we try to find out social networks based on all communications, it is difficult.
Information Flow in Enron – California Market • Actor 151 (Rosalee Fleming — the Enron CEO Ken L.’s assistant) is the key information spreader of this issue.
Information Flow in Enron – Market Opportunities • Rosalee Fleming also played an important role at “Market Opportunities.” She received info from Actor 119 (Mike Carson) and Actor 23 (James Steffes – VP of Gov. Affairs of Enron.) • Actor 68 (Rod Hayslett -- CFO) is also a major information spreader.
Information Flow in Enron – North American Products • Two disjoint communities can be observed. Actor 21 (Keith Holst) and Actor 142 (Dan Hyvl) are the main bridges of the two communities.
This kind of analysis is wonderful, but.. • We cannot wait until our company has scandle and bankrupts.... • What kinds of applications can be valuable out of network analysis?
Social Network -- A key differentiator for corporate performance • Informal social network within formal organizations is a major factor affecting companies’ performance: • Krackhardt (CMU, 2005) showed that companies with strong informal networks perform five or six times better than those with weak networks. • Brydon (VisblePath, 2006) showed that the performance gain of companies utilizing social networks: • 16x at sales • 4x at marketing • 10x at hiring
We hope social network and expertise mining can dramatically increase our colleagues’ knowledge and collaboration
Social Networks -- Beyond the organizational chart • Organization charts are not the best indicator of how work gets done • Senior people are not always central; peripheral people can represent untapped knowledge • Making the network visible makes it actionable and becomes the basis for a collaboration action plan Source: Cross, R., Parker, A., Prusak, L. & Borgatti, S.P. 2001. Knowing What We Know: Supporting Knowledge Creation and Sharing in Social Networks. Organizational Dynamics 30(2): 100-120. [pdf] Provided by Drs. Tony Mobbs and Kate Ehrlich, IBM
Marketing Finance Manufacturing Group and Roles Central people • Sam. Could be bottleneck or holding group together Peripheral people • Earl. Goes to others but no-one goes to him for information. At risk for leaving. Potentially unrealized expertise Sub-groups • Group split by function. Very little information shared across groups Andy Frank Indojit Carl Karen Darren Bob Sam Ming Neo Leo Earl Gerry Harry Jeff This slide is excerpted from SNA Theory, Concepts and Practice by Dr. T. Mobbs, BCS and Dr. K. Ehrlick, Research
Marketing Finance Manufacturing Some Roles are especially critical What happens if Sam leaves the group through layoffs, job reassignment, attrition, merger, retirement? Andy Frank Indojit Carl Karen Darren Bob Ming Neo Leo Earl Gerry Harry Jeff This slide is excerpted from SNA Theory, Concepts and Practice by Dr. T. Mobbs, BCS and Dr. K. Ehrlick, Research
Awareness Emotional Relationships are multi-dimensional and (traditionally) uncovered through network questions Actions Communication How often do you communicate with this person? Awareness I am aware of this person’s knowledge and skills Trust I believe there is a high personal cost in seeking advice or support from this person Innovation How often do you turn to this person for new ideas Valued Expertise How likely are you to turn to this person for specialized expertise Access I believe this person will respond to my request in a reasonable and timely manner Advice How often do you seek advice from this person before making an important decision? Learning How likely are you to rely on this person for advice on new methods and processes Energy I generally feel energized when I interact with this person Provided by Drs. Tony Mobbs and Kate Ehrlich, IBM
Forces: • Time Constrained • Delivery activity focus • What gets measured gets done • Expedience • Perceived value (return on time investment) • High reliance on: • 50% ~ 75%: Personal networks (Gartner Report, 2006) • Hard-drive materials • What has worked for them previously (personal experience) Personal Network • fast turnaround of request • specific response • Small # relevant items returned • recommendation of quality • ability to quickly understand the supplied resource & determine relevant parts • additional context / value-add info not available in electronic materials Preferred / primary mode ? GBS Practitioner with task in project / delivery environment W3 Stub W3 Stub W3 Stub / Client client client W3 Stub / client W3 Stub / client W3 Stub W3 Stub / client PSN Methods Education Other w3 content Knowledge View Communities Project Repositories Collaboration Project Tools Standalone, disparate, poor integration, large number of sources, steep learning curve (identify, understand & synthesise into specific work context), difficult to locate, choose & use. Existing Resources Provided leads to Personal Network preferred source for information and collaboration • Under utilisation of electronic products and services. • Content has lower performance impact / not realising full potential benefits. • Widely inconsistent working practices. Who knows what? How to reach them? Who plays what hidden roles?
Mining Expertise, Interests and Social Network public • People can be “known” by: • public resources: • publications • personal webpages • blogs • presentations • wiki • organizational resources: • patent applications • bluepages • personal resources: • emails • instant messaging • meeting • phone calls • face-to-face interactions • Expertise can also be inferred by her friends’ recommendations or expertises. timely & abundant resources for expertise modeling private
SmallBlue Clients (Distributed Automatic Social Sensors) • Other IBMers’ EgoNets • Other IBMers’ Expertise Inferences • I cannot see their communications, EgoNets nor Expertise Inferences External Data • user search experts or person SmallBlue Find • Bluepages • BlueGroups • CommunityMap • BlogCentral • IBM Forum • KnowledgeView • Social Bookmark • social network analysis of Top-K experts • My personal network (Ego net) inferred from my Notes emails in server/local/archive and SameTime chats • Inference of my understanding on my friends’ expertise • social network analysis of a list of people SmallBlue Connect SmallBlue Inference Engines and Servers SmallBlue Ego • Corporate-wise ranked experts • Ranked experts in my extended personal network, in a business unit and/or in a country • Only Public Information is shown • how to reach a person • My friends’ social values to me • Evolution of my Ego net • social network info SmallBlue Reach SmallBlue Expand • My social paths to her: which friends can introduce her, which friends work with her, .. trust, awareness, collaboration. • Her public postings, profiles, and communities to judge whether she is the right person. Public • Who I may want to know.. • Which communities I may want to join.. • Which documents I may want to look at • social network analysis (SNA): who are the key persons in this network? who are the major hubs? who are the major bridges? • SNA of a formal group, a bluegroup or a community Public & Personalized Private & Personalized
Major Use of SmallBlue Find • Find out who are the experts of any search terms. (Right now, zillions of possible terms.) • Rank them based on collaborative expert recommendation • Can show experts based on: • whole corporate-wise • business unit • country • my personal proximity
Collaborative Expert Recommendation • Combine everyone’s knowledge of the expertise of our colleagues. • The more recommendation from more colleagues, the higher the score. • The more recommendation from my trusted colleagues, the higher the score. • The higher recommendation score from colleagues, the higher the overall score. • Combining all IBMers’ knowledge, we can make an advanced expert finding search engine. • Utilizing the expert search engine, we can enhance all IBMers’ knowledge and social connections.
SmallBlue Reach Paths help users to reach another person • SmallBlue Reach Paths show the shortest paths for me to reach a person up to 6 degrees away. • SmallBlue Reach Paths can be initiated from any one of three SmallBlue applications. • Can be used for: • Access -- knowing who can help introducing me to this person. • Trust -- knowing who in my social networks knows this person. • Get Familiar with – knowing what kinds of people are contacting to this person. • Initiate Communication – who do we know in common.
SmallBlue Ego • How healthy is my personal social capital? • What is the social value of Alice to me? • What are the changes and trends of my social capital evolution? • For instance, I have to talk to Alice soon. She is valuable to me in terms of social connections and she is getting out of the Ego net circle..
SmallBlue Connect • Enterprise Social Network Analysis Tool • Showing Social Networks of people based on: • expertise key words • formal hierarchy • Any list of emails • Utilizing Social Network Analysis to show: • who are the important hubs among experts • who are the important bridges linking groups
Privacy Consideration – Bottom Line • Employees’ communications (e.g., time, from, to, cc, subject, content of emails, SameTime, etc.) are NOT searched nor retrievable to anyone. • Employees’ knowledge of other employees are INFERRED. Only the aggregated inferred knowledge is searchable. It is NOT possible to guess which part of aggregated inferred knowledge is contributed by whom. • In the social network analysis graphs, people relationships are modeled by their multimodal generic relationships. NO clue for their communication content. • Only the employees’ outgoing emails & instant messages and the portion that was authored by the employee is utilized. • Anyone can suggest keywords not be searched, search terms that should not find him, or ask to remove from the system.
Coincidence ?? SmallBlue Find and Connect Trial Release (9/20) SmallBlue Ego Trial Release (8/21) SmallBlue on TAP (11/07)