1 / 76

Link Mining in the Blogosphere Workshop on Community-based Web Service Computing and Mining

Link Mining in the Blogosphere Workshop on Community-based Web Service Computing and Mining. NCKU CSIE IKM Lab. Hung-Yu Kao 2008. 12. Outlines. Motivation Link mining Basis Random walking, Mutual Reinforcement Social metrics Some Related Work on blog links Our Work Link Extraction

beau
Download Presentation

Link Mining in the Blogosphere Workshop on Community-based Web Service Computing and Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Link Mining in the BlogosphereWorkshop on Community-based Web Service Computing and Mining NCKU CSIE IKM Lab. Hung-Yu Kao 2008. 12

  2. Outlines • Motivation • Link mining • Basis • Random walking, Mutual Reinforcement • Social metrics • Some Related Work on blog links • Our Work • Link Extraction • Blog Ranking • Blog match finding

  3. Social relationship mining hyping Google Trend for “social network”, “information retrieval”, “data mining”, “semantic web” and “PageRank” (http://www.google.com/trends?q=pagerank%2C+social+network%2C+information+retrieval%2C+data+mining%2C+semantic+web&ctab=0&geo=all&date=all&sort=0)

  4. Web versions

  5. Users / Information Users Users / Information Users / Information Information New Interactions Users Information

  6. Differences for researchers • Throng of pages • With complicated, but ruled styles • Informational v.s. emotional • Orz, , ^^, 冏, 凸,… • Throng of links • physical v.s. virtual links • simple v.s. diverse / clustered • Throng of machine-understandable human knowledge • Collaborative tagging / filtering / bookmarking

  7. Ranking in Web2.0 • Rank pages, rank people • Blog ranking • More interaction, much capitalism impact • PageRank Prediction • More knowledge repository, more latent ontology • Wikipedia, Del.icio.us • Information extraction / understanding become essential, realizable • Visual / Semantic block extraction

  8. Link (relationship) Analysis

  9. Link analysis -- Motivation • For one query, which pages are the answer set? • Results of search engines • Rank manually • Rank by similarity • Rank by hit rate (need usage log) • Rank by link analysis (google) • Relevant v.s. Authoritative • Intra-page v.s. inter-page • Users need authoritative pages among relevant pages.

  10. Link analysis -- Motivation • Human knowledge is real, convincing and trustable information • E.g., classification by human in yahoo • Hyperlinks contain information about the human judgment • Social sciences • Nodes: persons, organizations • Edges: social interaction • Easy job ?Counting in-links for popularity

  11. HITS - Kleinberg’s Algorithm • HITS – Hypertext Induced Topic Selection • For each vertex v Є V in a subgraph of interest: a(v) - the authority of v h(v) - the hubness of v • A site is very authoritative if it receives many citations. Citation from important sites weight more than citations from less-important sites • Hubness shows the importance of a site. A good hub is a site that links to many authoritative sites

  12. Authority and Hubness 5 2 3 1 1 6 4 7 h(1) = a(5) + a(6) + a(7) a(1) = h(2) + h(3) + h(4)

  13. HITS Example Results Authority Hubness 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Authority and hubness weights

  14. PageRank • Introduced by Page et al (1998, WWW) • The weight is assigned by the rank of parents • Difference with HITS • HITS takes Hubness & Authority weights • The page rank is proportional to its parents’ rank, but inversely proportional to its parents’ outdegree • Query independent

  15. PageRank example • Confirm the result • # of inlinks from high ranked page • hard to explain about 5&2, 6&7 * How do you create your homepage highly ranked ? * How to detect it ?

  16. Limits of Link Analysis 眾好之(spam),必查焉,眾惡之(new page),必查焉--論語·衛靈公 • Stability • Adding even a small number of nodes/edges to the graph has a significant impact • Topic drift – similar to TKC • A top authority may be a hub of pages on a different topic resulting in increased rank of the authority page • Content evolution • Adding/removing links/content can affect the intuitive authority rank of a page requiring recalculation of page ranks • Incremental link analysis

  17. Link analysis in a social network • Node  entity • Edge  relationship • We want to know in this social network • Which (group of) node / edge is influential • Which (group of) node / edge is important • Which node is an outlier • Information flow / tracking

  18. Centrality • Degree centrality • In-degree, out-degree • Localization, isolation • Closeness centrality • Geodesic distance between the entity and all other entities • Betweeness centrality • Gendesic path • Eigenvector centrality • Central entity receiving many communications from other well-connected entities (central entities) • Power centrality

  19. Centralization = 1 Centralization = 0 Network centralization • Summary of centralization of a network • E.g.,

  20. 9/11 Hijackers Graph Reference from “The Text Mining Handbook”, Ronen Feldman, James Sanger, P257.

  21. Some Related work in blog ranking (with link information) • Technorati (technorati.com/) • real-time blog search engine which watches over 100 million blogs • Multiple list • Number of fans • Blog authority: counts the number of blogs linking to • BlogLook (look.urs.tw/) • 60,000+ bloggers • Ranking from many features • #Inlink / #post in Google(general SE, blogger SE) and Yahoo • Scores in delicious, furl, Hemidemi, Myshare • Index factor, impact score, Page score, Technoratiscore, Bloginference score

  22. Some Related work in blog ranking • EigenRumor (Fujimura, 2005) is based on eigenvector calculation of the adjacency matrix of links • BlogRank (Apostolos, 2006) is a generalized form of PageRank which use similarity features to make the link graph denser. • Identifying influential bloggers (Nitin, WSDM 2008)

  23. Influential Properties (Nitin, WSDM 2008) • Recognition: Citations (incoming links) • The more influential the referring posts are, the more influential the referred post becomes. • Activity Generation: Volume of discussion (comments) • Large number of comments indicates that the blog post affects many such that they care to write comments, hence influential. • Novelty: Referring to (outgoing links) • Novel ideas exert more influence. Large number of outlinks suggests that the blog post refers to several other blog posts, hence less novel. • Eloquence: “goodness” of a blog post (length) • Short spam message • Copy message

  24. EigenRumor • Scoring each blog entry by weighting the hub and authority scores of the bloggers based on eigenvector calculations • similar to HITS • focuses on the behaviors of bloggers on blog posts • the adjacency matrix is constructed from agent-to-object links, not page-to-page (or object-to-object) links • Agent: • it is used to represent an aspect of human being such as a blogger • Object: • it is used to represent any object such as a blog entity

  25. EigenRumor • Two Matrixes • Provisioning Matrix • P= [pij] (i=1…m,j=1…n) • pij means a provisioning link • In this notation, pij=1 if agent i provides object j and zero otherwise. • Evaluation Matrix • E= [eij] (i=1…m,j=1…n) • eij means a evaluation link • The evaluation link is assigned weight eijbased on the strength of the support given to object j • Assuming eijhas the range of [0,1] and higher values indicate stronger support

  26. AlgorithmScores -1 • The EigenRumor algorithm scores agents in two aspects: • information evaluation (hub score) • information provisioning (authority score) • To implement this idea, two scores for each agent and one score for each object are introduced in the algorithm • agent property • Authority score • Hub score • object property • Reputation score

  27. Algorithmmain procedure

  28. Agent Object AlgorithmMapping to blog community

  29. AlgorithmComparison

  30. Work I:Link / Block extraction in CSS-rich pages

  31. Motivation • Informative block (IB) that presented in a form of block on the Web is meaningful data for extractor on page analysis. • Blog is hot! There are many investigation on it. • Ex: social network and trend analysis • There are something different between Blog page and general page on IB scoring and ranking. • DOM tree is not a flat tree already.

  32. Motivation • Related works on block extraction • MDR [1], IKM [3], IEPAD [2] • They have some limitations on CSS Web page • More <DIV> tag for page layout, but less <TABLE> • Tree ambiguity • Use CSS to design Web page style • Data presentation does not correspond to DOM tree structure • Can’t extract single Presentation block • Our objective • Extract all blocks on CSS Web page by CSS properties • Visual attributes and attribute entropy facilitate block extraction

  33. The properties of CSS Web page • CSS selector • HTML tag name, CLASS attribute and ID attribute • A block is with high information content if it contains many varied selectors • CSS definition • A CSS definition comprise a selector, a property and a value • CSS definitions indicate some visual information for tree modification

  34. Block tag analysis • Content page

  35. CSS tag analysis • Content page

  36. A B C The properties of CSS Web page • Layer containment • Structural containment • DOM tree structure • visual containment • Block presentation structure • Structural containment is not equal to visual containment on CSS Web page

  37. System architecture • Three processes for block extraction • Tree Generation (TG), Entropy Evaluation Model (EEM) and Block Identification (BI)

  38. System architecture • Tree Generation • DOM Parse transforms a Web page into DOM tree • Tree Constructer uses Tag filter and Visual Information Module to modifies DOM according to node attributes • Entropy Evaluation Model • Use Partial Path Entropy Evaluation (PPEE) to calculate attribute entropy • Aggregation function provides thresholds automatically to BI for block type notation

  39. B AH AT A A A A A AH AT D C A A A B A T L T C D L CSS Tag Entropy Tree Structure Informative Block

  40. The performance of CB Extraction • CB Evaluation

  41. Visual Tree • Visual Tree

  42. Work II:Blog Ranking

  43. Motivation • Among this large number of blogs, people need to know which blogs would be more informative. • Google use PageRank to rank web pages, and provide a successful service for searching web pages • Blogs is not only a set of web pages but contains many particular characteristics and interactive behaviors. • A ranking method based on the characteristics of blogs is needed

  44. Informative Blogs • An informative blog post is normally commented by many bloggers • Users may cite the informative posts or send a trackback while writing posts of relative topics • A blog with informative posts is an informative blog • We will use these blog features and relationships to design a modified PageRank algorithm for blog rankings

  45. Idea • To quantify the quality of blogs , the interactive behaviors or links between all blogs are great indicators • Comment • Trackback • Blogrolls • Hyperlinks in the Content

  46. Linking relationship Proposed Original

  47. Blog Network • Network Structure • Each node represents a blog • Each edge between two nodes represents a relationship for the two blogs • There are three general types of edges in the blog network • Support Edge (Support Relationships) : comment , trackback between blogs) • Similarity Edge (Similarity Relationships) : common links in contents or users between blogs (a virtual edge with lower weight ) • Hyperlink Edge: the links in contents between blog and a web page

  48. Based on the original PageRank, we adjust the probability of a blog surfer to follow a link in blog A to another blog B PageRank: We combing several blog relationships( ) with different weight, the probability ( ) is give by a new formula Besides the support relationships which constructing the Blog Network, similarity relationships are used in this formula because the similarity of blogs may convince the surfer more reason to stay on the blog Local Blog Rank Algorithm

  49. The probabilities from Blog A to Blog B is decided by the following three factors Blog Relationship Type (ex: similarity, comment, trackback…) Different Blog Relationships are given different weights Show the relationships with other blogs Blog Relationship Number (ex: number of comments ) The number of the corresponding relationship Blog Quality Score (BQ) Normalized Blog Features Show the general activity of a blog It’s assumed that if users know quality of the blog features for each blog, the probability of moving to a blog with higher activity and attention is more than others. Local Blog Rank Algorithm

  50. The probability formula X are the blogs to which the Blog A links The Relationship Score combines all kinds of relationship between blog A and K, and is calculated by the weight and number of corresponding relationship type multiplying the blog quality score of K Local Blog Rank Algorithm

More Related