1 / 14

Ch. 13 Structure of the Web

Padmini Srinivasan Computer Science Department Department of Management Sciences http:// cs.uiowa.edu / ~ psriniva padmini-srinivasan@uiowa.edu. Ch. 13 Structure of the Web. Origins. Origins of WWW (1989/1990: http) Sir Tim Berners-Lee & Robert Cailliau

leoma
Download Presentation

Ch. 13 Structure of the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Padmini Srinivasan Computer Science Department Department of Management Sciences http://cs.uiowa.edu/~psriniva padmini-srinivasan@uiowa.edu Ch. 13 Structure of the Web

  2. Origins • Origins of WWW (1989/1990: http) • Sir Tim Berners-Lee & Robert Cailliau • First prototype of browser: WorldWideWeb • 1st popular graphical browser: Mosaic (NCSA), Marc Andreessen and others • Mozilla -> Netscape -> Firefox • Lynx • 2000 Windows explorer • WAIS, Gopher, Veronica, • 1994: W3C • 1993: 1st World wide web conference • 1995: Yahoo! 1998: Google 2006: Live Search -> Bing

  3. Network Metaphor • Information network: • Different from social network • Notion of a logical document: different • Decentralized, over many computers • annotation • Network metaphor: “inspired and non-obvious” • Origins in hypertext – origins in citation nets • Citation nets: distinctly temporal, web? • Citation maps (popular) co-citation; bibliographic coupling; • H-index (Hirsch); g-index; f-index • Patents; legal cases (precedents); medical literature • Indexes: cross-linkages; see also; wikipedia

  4. Links/Associations • Directed edges, • Friendship nets, name-recognition, business colleagues, collaboration [Erdos number, Bacon number], IM nets, email graphs etc. • paths, shortest paths… • Associative memory • Semantic nets aka Conceptual networks (free-association studies) • Vannevar Bush “As We May Think” (1945) Atlantic Monthly. WW2. MEMEX (on web) • Associative connections between all of knowledge • Acknowledged by most • A way to rechannelhuman resources

  5. Paths and Connectivity • Connected graphs • Path: sequence of nodes beginning at node X and ending at node Y. • A directed graph is strongly connected if there is a path (directed of course) between every pair of its nodes. • If it is not strongly connected, need to examine its ‘reachability’ properties. • Easier in an undirected graph: disconnected components • Directed? Find strongly connected components

  6. Strongly Connected Component • SCC in a directed graph is a subset of nodes such that • (1) every node in it has a path to every other node in it • (2) the subset is not a part of a larger set of nodes that has the same property. [So it is the largest such component] • Why is it interesting to know about such components in the Web?

  7. Bow-Tie Structure of the Web • 1999 Andrei Broder (now Yahoo!), then Alta Vista • SCC; IN; OUT; Tendrils; Tubes, Disconnected • Macro-model • Properties of a reasonable model: • Should have a succinct and fairly natural description • Rooted in plausible macro-level process for creation of Web content • Not require some prior static set of topics • Should reflect many of the structural phenomenon observed in the Web

  8. Similar Studies • Donato et al. ACM TOIT, 2007. The Web as a Graph: How Far We Are • Webbase, 200 Million Stanford crawl • 39% OUT; 11% IN; 13% Tendrils; 33% SCC (48 million) next SCC: 10 thousand!

  9. Similar Studies • Buriol et al. (includes Donato): Temporal analysis of Wikigraph.

  10. Bow-Tie • Why a single SCC? Why not two large ones? • Any other explanations? • Interlinked world? • Hard to be disconnected? • What about a new page? • Is the SCC static/fixed? How does it change? • Are links permanent? (2004: 25% remain after 1 year and 50% of pages stay the same; Ntoulas et al., 2004) • Many naturally occurring graphs have a giant SCC • IM (nodes people, link message) almost all are in the SCC; median path length is 7,mean 6.6.

  11. Bow-Tie: points to note • Incomplete picture • Doesn’t tell you how this is generated, just that it is. • Macro model: • Thematic collections; differences? • Organization specific collections • Regional: economic incentives/disincentives? • Community based: education levels? • Bipartite cliques (small sized – many in number) • Fans pointing to centers • Will it always be observed? How about now?

  12. Web 2.0 • “an attitude not a technology” • Collaboration/collective maintenance • Annotation, tags, links, editing, revisions • Data generated by individuals for individual and group sharing; Flickr, Gmail. • Connections between entities beyond “documents”. • Social feedback key; ‘wisdom of crowds’; long tail;

  13. Web Links • Navigational – static pages – passive services • Transactional – dynamic / computational services. Deep web • Search engines – heuristics • What kinds of rules would you use? • Implications for crawlers

  14. Summary • Web: origins, network metaphor • Citations, MEMEX • Paths • Structures (macro) • SCC • Bow-Tie model • Next • Ch 14: Hubs and Authorities; PageRank

More Related