1 / 38

(A taste of) Data Management Over the Web

(A taste of) Data Management Over the Web. Web R&D. The web has revolutionized our world Relevant research areas include databases, networks, security… Data structures and architecture, complexity, image processing, security, natural language processing, user interfaces design..

Download Presentation

(A taste of) Data Management Over the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. (A taste of)Data Management Over the Web

  2. Web R&D • The web has revolutionized our world • Relevant research areas include databases, networks, security… • Data structures and architecture, complexity, image processing, security, natural language processing, user interfaces design.. • Lots of research in each of these directions • Specialized conferences for web research • Lots of companies • This course will focus on Web Data

  3. Web Data • The web has revolutionized our world • Data is everywhere • Web pages, images, movies, social data, likes and dislikes… • Constitutes a great potential • But also a lot of challenges • Web data is huge, not structured, dirty.. • Just the ingredients of a fun research topic!

  4. Ingredients • Representation & Storage • Standards (HTML, HTTP), compact representations, security… • Search and Retrieval • Crawling, inferring information from text… • Ranking • What's important and what's not • Google PageRank, Top-K algorithms, recommendations…

  5. Challenges • Huge • Over 14 Billions of pages indexed in Google • Unstructured • But we do have some structure, such as html links, friendships in social networks.. • Dirty • A lot of the data is incorrect, inconsistent, contradicting, just irrelevant..

  6. Course Goal • Introducing a selection of fun topics in web data management • Allowing you to understand some state-of-the-art notions, algorithms, and techniques • As well as the main challenges and how we approach them

  7. Course outline Ranking: HITS and PageRank • Data representation: XML; HTML • Crawling • Information Retrieval and Extraction, Wikipedia example • Aggregating ranks and Top-K algorithms • Recommendations, Collaborative Filtering for recommending movies in NetFlix • Other topics (time permitting): Deep Web, Advertisements… • The course is partly based on: Web Data Management and Distribution, Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, Pierre Senellart And on a course by Pierre Senellart (and others) in telecom Paris-tech

  8. Course requirement A small final project Will involve understanding of 2 or 3 of the subjects studied and some implementation Will be given next monday

  9. Ranking

  10. Why Ranking? • Huge number of pages • Huge even if we filter according to relevance • Keep only pages that include the keywords • A lot of the pages are not informative • And anyway it is impossible for users to go through 10K results

  11. How to rank? • Observation: links are very informative! • Instead of a collection of Web Pages, we have a Web Graph!! • This is important for discovering new sites (see crawling), but also for estimating the importance of a site • CNN.com has more links to it than my homepage…

  12. Authority and Hubness • Authority: a site is very authoritative if it receives many citations. Citation from important sites weight more than citations from less-important sites A(v) = The authority of v • Hubness shows the importance of a site. A good hub is a site that links to many authoritative sites H(v) = The hubness of v

  13. HITS (Kleinberg ’99) • Recursive dependency: a(v) = Σin h(u) h(v) = Σout a(u) Normalize according to sum of authorities \ hubness values • We can show that a(v) and h(v) converge

  14. Random Surfer Model • Consider a "random surfer" • At each point chooses a link and clicks on it P(W) = P(W1)* (1/O(W1))+…+ P(Wn)* (1/O(Wn)) Where Wi…Wn link to W, O(Wi) is the number of out-edges of Wi

  15. Recursive definition • PageRank reflects the probability of being in a web-page (PR(w) = P(w)) Then: PR(W) = PR(W1)* (1/O(W1))+…+ PR(Wn)* (1/O(Wn)) • How to solve?

  16. EigenVector! • PR (row vector) is the left eigenvector of the stochastic transition matrix • I.e. the adjacency matrix normalized to have the sum of every column to be 1 • The Perron-Frobinius theorem ensures that such a vector exists • Unique if the matrix is irreducible • Can be guaranteed by small perturbations

  17. Problems • A random surfer may get stuck in one component of the graph • May get stuck in loops • “Rank Sink” Problem • Many Web pages have no outlinks

  18. Damping Factor • Add some probability d for "jumping" to a random page • Now P(W) = (1-d) * [P(W1)* (1/O(W1))+…+ P(Wn)* (1/O(Wn))] + d* 1/N Where N is the number of pages in the index

  19. How to compute PR? • Analytical methods • Can we solve the equations? • In principle yes, but the matrix is huge! • Not a realistic solution for web scale • Approximations

  20. A random surfer algorithm • Start from an arbitrary page • Toss a coin to decide if you want to follow a link or to randomly choose a new page • Then toss another coin to decide which link to follow \ which page to go to • Keep record of the frequency of the web-pages visited • The frequency for each page converges to its PageRank

  21. Power method • Start with some arbitrary rank row vector R0 • Compute Ri = Ri-1* A • If we happen to get to the eigenvector we will stay there • Theorem: The process converges to the eigenvector! • Convergence is in practice pretty fast (~100 iterations)

  22. Other issues • Accelerating Computation • Distributed PageRank • Mixed Model (Incorporating "static" importance) • Personalized PageRank

  23. XML

  24. HTML (HyperText Markup Language) • Used for presentation • Standardized by W3C (1999) • Described the structure and content of a (web) document • HTML is an open format • Can be processed by a variety of tools

  25. HTTP Application protocol Client request: GET /MarkUp/ HTTP/1.1 Host: www.google.com Server response: HTTP/1.1 200 OK Two main HTTP methods: GET and POST

  26. GET URL: http://www.google.com/search?q=BGU Corresponding HTTP GET request: GET /search?q=BGU HTTP/1.1 Host: www.google.com

  27. POST Used for submitting forms POST /php/test.php HTTP/1.1 Host: www.bgu.ac.il Content-Type: application/x-www-formurlencoded Content-Length: 100 …

  28. Status codes HTTP response always starts with a status code followed by a human-readable message (e.g., 200 OK) First digit indicates the class of the response: 1 Information 2 Success 3 Redirection 4 Client-side error 5 Server-side error

  29. Authentication HTTPS is a variant of HTTP that includes encryption, cryptographic authentication, session tracking, etc. It can be used instead to transmit sensitive data GET ... HTTP/1.1 Authorization: Basic dG90bzp0aXRp

  30. Cookies • Key/value pairs, that a server asks a client to store and retransmit with each HTTP request (for a given domain name). • Can be used to keep information on users between visits • Often what is stored is a session ID • Connected, on the server side, to all session information

  31. Crawling

  32. Basics of Crawling Crawlers, (Web) spiders, (Web) robots: autonomous agents that retrieve pages from the Web Basics crawling algorithm: 1. Start from a given URL or set of URLs 2. Retrieve and process the corresponding page 3. Discover new URLs (next slide) 4. Repeat on each found URL Problem: The web is huge!

  33. Discovering new URLs Browse the "internet graph" (following e.g. hyperlinks) Referrer urls Site maps (sitemap.org)

  34. The internet graph At least 14.06 billion nodes = pages At least 140 billion edges = links

  35. Graph-browsing algorithms Depth-first Breath-first Combinations..

  36. Duplicates Identifying duplicates or near-duplicates on the Web to prevent multiple indexing Trivial duplicates: same resource at the same canonized URL: http://example.com:80/toto http://example.com/titi/../toto Exact duplicates: identification by hashing near-duplicates: (timestamps, tip of the day, etc.) more complex!

  37. Near-duplicate detection • Edit distance • Good measure of similarity, • Does not scale to a large collection of documents (unreasonable to compute the edit distance for every pair!). • Shingles: two documents similar if they mostly share the same succession of k-grams

  38. Crawling ethics robots.txt at the root of a Web server User-agent: * Allow: /searchhistory/ Disallow: /search Per-page exclusion (de facto standard). <meta name="ROBOTS" content="NOINDEX,NOFOLLOW"> Per-link exclusion (de facto standard). <a href="toto.html" rel="nofollow">Toto</a> Avoid Denial Of Service (DOS), wait 100ms/1s between two repeated requests to the same Web server

More Related