1 / 13

Yoann VENY Université Libre de Bruxelles (ULB) - GERME yoann.veny@ulb.ac.be

Sampling online communities: using triplets as basis for a (semi-) automated hyperlink web crawler. Yoann VENY Université Libre de Bruxelles (ULB) - GERME yoann.veny@ulb.ac.be This research is funded by the FRS-FNRS

vaughn
Download Presentation

Yoann VENY Université Libre de Bruxelles (ULB) - GERME yoann.veny@ulb.ac.be

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sampling online communities: using triplets as basis for a (semi-) automated hyperlink web crawler. Yoann VENY Université Libre de Bruxelles (ULB) - GERME yoann.veny@ulb.ac.be This research is funded by the FRS-FNRS Paper presented at the 15th General Online Reasearch Conference, 4-6 march, Mannheim

  2. Online communities – a theoreticaldefinitions • Whatis an online community? • “social aggregations that emerge from the Net when enough people carry on those public discussions long enough, with sufficient human feeling, to form webs of personal relationship in cyberspace” » (Rheingold 2000) • long term involvement (Jones 2006) • sense of community (Blanchard 2008) • temporal perspective (Lin et al 2006) • Probably important … but the first operationshouldbe to takeintoaccount the ‘hyperlinkenvironment’  Graph analysis issue / SNA issue

  3. Online Communities – A graphicaldefinition (1) • Community = more ties among members than with non-members • three general classes of ‘community’ in graph partitioning algorithm (Fortunato 2010) : • a local definition: focus on the sub-graphs (i.e.: cliques, n-cliques (Luce, 1950), k-plex (Seidman & Foster, 1978), lambda sets (Borgatti et al, 1990), … ) • a global definition: focus on the graph as a whole (observed graph significantly different from a random graph (i.e.: Erdös-Rényi graph)?) • vertex similarity: focus on actors (i.e.: euclidian distance & hierarchical clustering, max-flow/min-cut (Elias et al, 1956; Flake et al, 2000)

  4. Online communities – graphicaldefinition (2) • 2 main problems of graph partitionning in a hyperlinkenvironment: • 1) network size / and form (i.e. tree structure) • 2) edges direction •  betterdiscovercommunitieswith a efficient web crawler

  5. Web crawling - Generalities • The general idea for a web crawling process: • - We have a number of starting blogs (seeds) • - All hyperlink are retrieved from these seeds blogs • - For each new website discovered, decide wether this new site is accepted or refused • - If the site is accepted, it become a seed and the process is reiterated on this site. Source: Jacomi & Ghitalla (2007)

  6. Web crawling – constrain-based web crawler (1) • Twoproblems of a manual crawler : • Number and quality of decision • Closure? • A solution: takingadvantage of local structural properties of a network:  Assume that a network is an outcome of the agregation of local social processes: • Examples in SNA: • General philosphy of ERG Models (seef.e. : Robins et al 2007) • Local clustering coefficient (seef.e. : Watts & Strogatz, 1998)  Constrain the crawler to identify local social structures (ie: triangles, mutualdyads, transitive triads,…

  7. Web crawling – constrain-based web crawler (2) An example of a constrained web crawler based on identification of triangles Generalisation

  8. Experimentalresults - method Y is the n x n adjacency matrix of a binary network with elements: Undirected dyadic  # edges = (unsupervised crawler) Directed dyadic  (mutuality crawler) Undirected triadic  # (Triangle crawler) Directed triadic  (triplet crawler) Where is the number of “two path” connecting i and j or j and i.

  9. Experimentalresults - method Y is the n x n adjacency matrix of a binary network with elements: Undirected dyadic  # edges = (unsupervised crawler) Directed dyadic  (mutuality crawler) Undirected triadic  # (Triangle crawler) Directed triadic  (triplet crawler) Where is the number of “two path” connecting i and j or j and i.

  10. Experimentalresults – results(1) Starting set: 6 « polititicalecological » blogs Remarks: dyad sampler and triplets samplers  closure Unsupervised and triangles samplers  manuallystopped

  11. Experimentalresults – results (2) Triangles Dyads Triplets

  12. Unsupervised crawler is not manageable (+20000 actorsafter 4 iterations!!) • Dyads: did not selected ‘authoritative’ sources + sensitive to the number of seeds ? • Triplets seems to be the best solution: taketies direction intoaccount + take profit of authoritative sources + conservative • Triangles: problem of network size … but sampled network can have interestingproperties.

  13. Conclusion and furtherresearches • Pitfalls to avoid: • Not necessary all relevant information in the core: thereis a lot of information in the periphery of thiscore. • Based on humanbehaviour patterns: not adaptedat all for otherkind of networks (wordsoccurencies, proteïnschains,…) • Do not throwaway more classical graph partitionningmethods • Always question yourresults. • How to assessefficiency of a crawler? Shouldcommunities in web graph alwaysbetopic-centered • Furtherresearches: • Analysis and detection of ‘multi-core’ networks • ‘Randomwalks’ in complete networks to findrecursive patterns using T.C. assumptions • Code of the samplers in ‘R’

More Related