1 / 28

Efficient Identification of Starters and Followers in Social Media

Efficient Identification of Starters and Followers in Social Media. Michael Mathioudakis , Nick Koudas. Goals. Formalize a definition of “starters” and “followers” in blogs Random sampling approaches to achieve significant efficiency while identifying “starters” and “followers”.

louie
Download Presentation

Efficient Identification of Starters and Followers in Social Media

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Identification of Starters and Followers in Social Media Michael Mathioudakis, Nick Koudas

  2. Goals • Formalize a definition of “starters” and “followers” in blogs • Random sampling approaches to achieve significant efficiency while identifying “starters” and “followers”

  3. Starters vs Followers • Starter: a blogger who generates posts that others link to over a period of time • Follower: a blogger that links to other blog posts over a period of time

  4. Notation

  5. Calculating Starters and Followers • In degree of node • Out degree of node • Degree of node

  6. Brute Force • Query the database for all posts • Calculate the degree of every node and sum • Why not? • Retrieving all posts can be costly • Lots of overhead

  7. Deterministic Early-Stopping Conditions • = enumerated subset of • is the set of k starters • If , then exists a pair ,with and such that • Use linear equalities to determine feasibility

  8. Linear Inequalities

  9. Linear Inequality Issues • Result? • Large domains • Easily feasible • Traverse almost all edges before stopping • Solution? • Relax requirements, use probabilistic guarantees

  10. Probabilistic Early-Stopping Conditions • Trade efficiency with accuracy • Still aim to return starters • Assume edges chosen uniformly at random

  11. Probabilities • for all pairs of nodes • Pr < 10% return the result set • How do you determine the bound for the probability?

  12. Hoeffding’s Inequality • Provides a lower bound • Lower bound = • Uniform sample should capture any skew • Starters appear after few sampled edges

  13. Random Sampling Techniques • Out-degrees among nodes is known • Maximum out-degree of a node is known • Sampling nodes uniformly at random • Random walk approach

  14. Out-Degrees Known

  15. Out-Degrees Known Issues • Knowing out-degree = strong assumption • Requirements • Retrieve all posts in query • Extract all links • Solution? • Weaker assumption on distribution of edges

  16. Maximum Out-Degree Known

  17. Maximum Out-Degree Issues • Blog graphs typically heavy-tailed • Probability at one iteration = • Expected iterations =

  18. Sampling Nodes Uniformly at Random

  19. Sampling Nodes Uniformly at Random Issues • Not sampled uniformly at random • Only unbiased estimates of edges from one node to another • Can’t handle heavy-tailed distributions • Leads to poor accuracy

  20. Random Walk Approach • 2 step approach • Obtain a new graph from the input graph • Obtain a Markov chain

  21. Step 1 – Obtain New Graph • Create a new graph H(V, E) from input graph • Remove direction of edges • Add self-loops • Add edges between nodes returned in order

  22. Step 2 – Create Markov Chain • Markov Chain = MC(K, T) • K = the possible stats (nodes) • T = possible transitions (edges)

  23. The Random Walk At a step of the walk Follows a transition to one of its states (b): Edge of current node = no lookup cost (c): Edge of new node = random access cost

  24. Stopping the Random Walk • At each step, for each pair of nodes • Average the score over all pairs of nodes • Stop when confScore > threshold

  25. Results Most in-links doesn’t necessarilymean the best starter

  26. Results (continued)

  27. Real World Application • BlogScope • Project of University of Toronto • Provides graph and search outputof blog data • How does it work? • Crawler to gather blog data and filter spam • Stored in MySQL (1174.14 million posts) • Build statistics regularly • Provide correlation discovery, popularity curves, and hot keywords

  28. Related Work Discovering Leaders from Community Actions AmitGoyal, Francesco Bonchi, Laks V. S. Lakshmanan Users perform actions (bookmark url, rate song, buying gadgets, etc) Friends see actions and may perform same actions (influence) Compute influence matrix with a sliding window working backwards Pass over actions log only once Uses frequent pattern discovery to determine leaders Finds tribes where one user influences a group of people over a series of actions Problem when there is a popular action where influence might not be a factor

More Related