Google Scale Data Management

Google Scale Data Management

Google (..a lecture on that??) A good example of where a good understanding of computer science and some entrepreneurial spirit can take you.. (partially based on slides by Qing Li)

Google (what?..couldn’t they come up with a better name?) • ..doesn’t mean anything, right? • ...or does it mean “search” in French?... • Surprising, but “Google” actually means stg. • “Googol” is the name of a “number”… • A big one : 1 followed by 100 zeros • ..this reflects the company's mission....and the reality of the world: • …there are a whole lot of “data” out there… • …and whoever helps people (e.g. me) manage it effectively deserves to be rich! (partially based on slides by Qing Li)

..in fact there is more to Google than search... These are data intensive applications (partially based on slides by Qing Li)

Database Question #1: How can it know so much?...because it crawls… Web Data Crawling (partially based on slides by Qing Li)

Question #2: How can it be so fast? A whole lot of users One Search Engine A whole lot of data (partially based on slides by Qing Li)

Database Question #1: How can it be so fast?...because it indexes… Web Data Index K1 d1d2 K2 d1d3 … Indexing Crawling Indexes: quick lookup tables (like the index pages in a book) (partially based on slides by Qing Li)

Database (copies of all web pages!) Question #2: How can it be so fast? 1 Front-end web server (partially based on slides by Qing Li)

Database (copies of all web pages!) Question #2: How can it be so fast? • use a lot of computers…but, intelligently! (parallelism) • …organize data for fast access (databases, indexing) 1 2 Front-end web server Index servers (partially based on slides by Qing Li)

Database (copies of all web pages!) Question #2: How can it be so fast? • use a lot of computers…but, intelligently! (parallelism) • …organize data for fast access (databases, indexing) 3 1 2 Front-end web server Index servers Content servers (partially based on slides by Qing Li)

Database (copies of all web pages!) Question #2: How can it be so fast? • use a lot of computers…but, intelligently! (parallelism) • …organize data for fast access (databases, indexing) 3 1 2 6 4 5 Front-end web server Index servers Content servers (partially based on slides by Qing Li)

Database (copies of all web pages!) Question #2: How can it be so fast? • use a lot of computers…but, intelligently! (parallelism) • …organize data for fast access (databases, indexing) • …most people want to know the same thing anyhow (caching) CACHE (copies of recent results) 1 2 Front-end web server Index servers Content servers (partially based on slides by Qing Li)

Question 3: How can it serve the most relevant page? • ANALYZE!!!! (partially based on slides by Qing Li)

Question 3: How can it serve the most relevant page? • Text analysis • Analyze the content of a page to determine if it is relevant to the query or not (partially based on slides by Qing Li)

Question 3: How can it serve the most relevant page? • Text analysis • Analyze the content of a page to determine if it is relevant to the query or not • Link analysis • Analyze the (incoming and outgoing) links of pages to determine if the page is “worthy” or not (partially based on slides by Qing Li)

Question 3: How can it serve the most relevant page? • Text analysis • Analyze the content of a page to determine if it is relevant to the query or not • Link analysis • Analyze the (incoming and outgoing) links of pages to determine if the page is “worthy” or not • User analysis • Analyze the users search context to guess what the user wants to do to serve the most relevant content (and the advertisement!!) (partially based on slides by Qing Li)

Text Analysis Web Crawling Raw text Text analysis Index (partially based on slides by Qing Li)

Term Extraction • Extraction of index terms • Stemming • “informed”, “informs”  inform (partially based on slides by Qing Li)

Term Extraction • Extraction of index terms • Stemming • “informed”, “informs”  inform • Compound word identification • {“white”, “house”} or {“white house”} ???? (partially based on slides by Qing Li)

Term Extraction • Extraction of index terms • Stemming • “informed”, “informs”  inform • Compound word identification • {“white”, “house”} or {“white house”} ???? • Removal of stop words • “a”, “an”, “the”, “is”, “are”, “am”, … • they do not help discriminate between relevant and irrelevant documents (partially based on slides by Qing Li)

Stop words • Those terms that occur too frequently in a language are not good discriminators for search stop words Frequency of usage in english language the an ASU Rank (of the term) (partially based on slides by Qing Li)

“Inverted” index requires “fast” string search!! (hashes, search trees) Terms Count Doc #1 ----- ----- ----- search Google . . . ASU . . . . . . tiger 3 . . . 2 . . . . . . 2 Doc #2 ----- ----- ----- Query “ASU” ... Doc #5 ----- ----- ----- ... Directory Document Collection (partially based on slides by Qing Li)

t1 t2 What are the weights???? • They need to capture how good the term is in describing the content of the page Document t1 better than t2 term-frequency (partially based on slides by Qing Li)

t1 t1 t1 t1 t1 t1 t1 t2 t2 t2 What are the weights???? • They need to capture how discriminating the term is (remember the stop words??) t2 better than t1 (partially based on slides by Qing Li)

t1 t1 t1 t1 t1 t1 t1 t2 t2 t2 What are the weights???? • They need to capture how discriminating the term is (remember the stop words??) t2 better than t1 inverse-document-frequency (partially based on slides by Qing Li)

What are the weights???? • They need to capture how • good the term is in describing the content of the page • discriminating the term is.. (partially based on slides by Qing Li)

How are the results ranked based on relevance? (partially based on slides by Qing Li)

Vectors…what are they??? • Page with “ASU” weight <0.5>, “CSE” weight <0.7>, and “Selcuk” weight <0.9> CSE 0.7 <0.5,0.7.0.9> ASU 0 0.5 Selcuk 0.9 (partially based on slides by Qing Li)

CSE ASU Selcuk Database A web database (a vector space!) (partially based on slides by Qing Li)

Query “ASU CSE Selcuk” A web database (a vector space!) CSE ASU Selcuk (partially based on slides by Qing Li)

Query “ASU CSE Selcuk” Measuring relevance: “Find 2 closest vectors” Requires “multidimensional” index structures CSE ASU Selcuk (partially based on slides by Qing Li)

How about “importance” of a page…can we measure it? (partially based on slides by Qing Li)

How about “importance” of a page…can we measure it? • We might be able to learn how important a page is by studying its connectivity (“link analysis”) more links to a page implies a better page (partially based on slides by Qing Li)

How about “importance” of a page…can we measure it? • We might be able to learn how important a page is by studying its connectivity (“link analysis”) more links to a page implies a better page Links from a more important page should count more than links from a weaker page (partially based on slides by Qing Li)

“Graph theory” helps analyze large networks of links Hubs and authorities • Good hubs should point to good authorities • Good authorities must be pointed by good hubs. (partially based on slides by Qing Li)

PageRank PR(A) = PR(C)/1 more links to a page implies a better page Links from a more important page should count more than links from a weaker page (partially based on slides by Qing Li)

PageRank PR(A) = PR(C)/1 PR(B) = PR(A) / 2 more links to a page implies a better page Links from a more important page should count more than links from a weaker page (partially based on slides by Qing Li)

PageRank These types of problems are generally solved using “linear algebra” PR(A) = PR(C)/1 PR(B) = PR(A) / 2PR(C) = PR(A) / 2 + PR(B)/1 more links to a page implies a better page Links from a more important page should count more than links from a weaker page (partially based on slides by Qing Li)

How to compute ranks?? • PageRank (PR) definition is recursive • PageRank of a page depends on and influences PageRanks of other pages (partially based on slides by Qing Li)

How to compute ranks?? • PageRank (PR) definition is recursive • PageRank of a page depends on and influences PageRanks of other pages • Solved iteratively • To compute PageRank: • Choose arbitrary initial PR_old and use it to compute PR_new • Repeat, setting PR_old to PR_new until PR converges (the difference between old and new PR is sufficiently small) • Rank values typically converge in 50-100 iterations (partially based on slides by Qing Li)

Related CSE courses • 185 Introduction to Web Development • 205 Object-Oriented Programming and Data Structures • 310 Data Structures and Algorithms • 355 Introduction to Theoretical Computer Science • 412 Database Management • 414 Advanced Database Concepts • 450 Design and Analysis of Algorithms • 457 Theory of Formal Languages • 471 Introduction to Artificial Intelligence • 494 Cognitive Systems and Intelligent Agents • 494 Information Retrieval, Mining • 494 Numerical Linear Algebra Data Exploration • 494 Introduction to High Performance Computing (partially based on slides by Qing Li)

Reference • L. Page and S. Brin,The PageRank citation ranking: Bringing order to the web , Stanford Digital Library Technique, Working paper 1999-0120, 1998. • L. Page and S. Brin. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the Seventh International Web Conference (WWW 98), 1998. (partially based on slides by Qing Li)

Google Scale Data Management

Google Scale Data Management

Presentation Transcript

Web/Google Data Mining

iRODS and Large-Scale Data Management

Data Center Scale Computing

MULTI-SCALE WATERSHED MANAGEMENT: USING A CONCEPTUAL FRAMEWORK FOR INTEGRATING MULTI-SCALE DATA

Statistical structures for Internet-scale data management

Large- scale Linked Data Management

Large scale data processing

Google Data Protocol

HEART Online Large-Scale Assessment Data Management System

Data Management Challenges of Large-Scale Data Intensive Scientific Workflows

NetSearch : Googling Large-scale Network Management Data

Large Scale Data Analytics

Google AdWords Perth | Adwords Management | Google Advertising

Google Data Highlighter

Scale up B2B Data Management with Artificial Intelligence

Google Data Studio & Google Search Console Integration

Google Adwords Management

large scale data analysis

HEART Online Large-Scale Assessment Data Management System

google ads management

Google Drive Management