Algorithms for Large Data Sets

Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 1 March 19, 2006 http://www.ee.technion.ac.il/courses/049011

Course overviewPart 1: Web • Architecture of search engines • Information retrieval • Index construction, Vector Space Model, evaluation criteria • Ranking methods • Google’s PageRank, Kleinberg’s Hubs & Authorities • Spectral methods • Latent Semantic Indexing • Web structure • Power laws, small-world phenomenon • Random graph models for the web • Preferential attachment, the copying model • Web sampling • Sampling from the web, sampling from search engines • Rank aggregation

Course overviewPart 2: Algorithms • Random sampling • Mean, median, MST • Data streams • Distinct elements, frequency moments, Lp distances • Sketching • Shingling, Hamming distance, Bloom filters • Dimension reduction • Low distortion embeddings, locality sensitive hash functions • Lower bounds • Communication complexity, reductions

Prerequisites • Algorithms and data structures • Complexity analysis • Hashing • Probability theory • Conditional probabilities, expectation, variance • Linear algebra • Matrices, eigenvalues, vector spaces • Combinatorics • Graph theory

Course requirements & grading • Papers review 90% of final grade • Readings & participation 10% of final grade

Papers review • Write a critical review of 2-3 papers on a subject not studied in class. • Deliverables: • Choice of papers to review (due: 4/6/06) • 5 page review report (due: 9/7/06) • 20 minute class presentation (9/7/06)

Textbooks Randomized Algorithms by Rajeev Motwani and Prabhakar Raghavan Mining the Web by Soumen Chakrabarti Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto

Instructor • Ziv Bar-Yossef • Tel: 5737 Email: zivby@ee • Office hours: Mondays, 15:30-17:30 at 917 Meyer • Mailing list: ee049011s-l

Large Data SetsExamples, Challenges, and Models

Examples of large data sets:Astronomy • Astronomical sky surveys • 120 Gigabytes/week • 6.5 Terabytes/year The Hubble Telescope

Examples of large data sets:Genomics • 25,000 genes in human genome • 3 billion bases • 3 Gigabytes of genetic data

Examples of large data sets:Phone call billing records • 250M calls/day • 60G calls/year • 40 bytes/call • 2.5 Terabytes/year

Examples of large data sets:Credit card transactions • 47.5 billion transactions in 2005 worldwide • 115 Terabytes of data transmitted to VisaNet data processing center in 2004

Examples of large data sets:Internet traffic • Traffic in a typical router: • 42 kB/second • 3.5 Gigabytes/day • 1.3 Terabytes/year

Examples of large data sets:The World-Wide Web • 25 billion pages indexed • 10kB/Page • 250 Terabytes of indexed text data • “Deep web” is supposedly 100 times as large

Reasons for the emergence of large data sets:Better technology • Storage & disks • Cheaper • More volume • Physically smaller • More efficient Large data sets are affordable

Reasons for the emergence of large data sets:Better networking • High speed Internet • Cellular phones • Wireless LAN More data consumers More data producers

Reasons for the emergence of large data sets:Better IT • More processes are automatic • E-commerce and V-commerce • Online and telephone banking • Online and telephone customer service • E-learning • Chats, news, blogs • Online journals • Digital libraries • More enterprises are computerized • Companies • Banks • Governmental institutions • Universities More data is available in digital form World’s yearly production of data: 5 billion Gigabyes

Reasons for the emergence of large data sets:Growing needs • Science • Astronomy • Earth and environmental studies • Meteorology • Genetics • Business • Billing • Mining customer data • Intelligence • Emails • Web sites • Phone calls • Search • Web pages • Images • Audio & Video More incentive to construct large data sets

Characteristics of large data sets • Huge • Distributed • Dispersed over many servers • Dynamic • Items add/deleted/modified continuously • Heterogeneous • Many agents access/update data • Noisy • Inherent • Unintentional • Malicious • Unstructured / semi-structured • No database schema

New challengesRestricted access • Large data sets are kept on magnetic and optical storage devices • Access to data is sequential • Random access is costly

New challengesStringent efficiency requirements • Traditionally, “efficient” algorithms • Run in (small) polynomialtime. • Use linear space. • For large data sets, efficient algorithms • Must run in linear or even sub-linear time. • Must use up to poly-logarithmic space.

New challengesSearch the data • Traditionally, input data is: • Either small and thus easily searchable • Moderately large, but organized in database tables. • In large data sets, input data is: • Immense • Disorganized, unstructured, non-standardized Hard to find what you want

New challengesMine the data • Association rules • “Beers and diapers” • Patterns • Clusters • Statistical data • Graph structure

New challengesClean the data • Noise in data distorts • Computation results • Search results • Mining results • Need automatic methods for “cleaning” the data • Spam filters • Duplicate elimination • Quality evaluation

Abstract model of computing (n is very large) Examples Mean Data Parity Computer Program • Approximation of f(x) is sufficient • Program can be randomized Approximation of

Models for computing over large data sets • Random sampling • Data Streams • Sketching

Random sampling (n is very large) Examples Mean O(1) queries Data Parity n queries Computer Program Query a few data items Approximation of

Random sampling • Advantages • Ultra-efficient • Sub-linear running time & space (could even be independent of data set size) • Disadvantages • May require random access • Doesn’t fit many problems • Hard to sample from disorganized data sets

Data streams (n is very large) Examples Mean O(1) memory Data Parity 1 bit of memory Computer Program Stream through the data; Use limited memory Approximation of

Data streams • Advantages • Sequential access • Limited memory • Disadvantages • Running time is at least linear • Too restricted for some problems

Sketching (n is very large) Examples Data1 Data1 Data2 Sketch2 Sketch1 Data2 Equality O(1) size sketch Hamming distance O(1) size sketch Compress each data segment into a small “sketch” Compute over the sketches Lp distance (p > 2) (n1-2/p) size sketch Approximation of

Sketching • Advantages • Appropriate for distributed data sets • Useful for “dimension reduction” • Disadvantages • Too restricted for some problems • Usually, at least linear running time

Algorithms for large data sets Data Streams Sampling • Distinct elements • Frequency moments • Lp distances • Geometric problems • Graph problems • Database problems • Mean and other moments • Median and other quantiles • Volume estimations • Histograms • Graph problems • Property testing Sketching • Equality • Hamming distance • Set membership • Edit distance • Resemblance

WebHistory and Architecture of Search Engines

A brief history of the Internet • 1961: First paper on packet switching (Kleinrock, MIT) • 1966: ARPANET (first design of a wide area computer network) • 1969: First packet sent from UCLA to SRI. • 1971: First E-mail (Ray Tomlinson) • 1974: Transmission Control Protocol (TCP) (Vint Cerf & Bob Kahn) • 1978: TCP splits into TCP and IP (Internet Protocol) • 1979: USENET (newsgroups) • 1984: Domain Name System (DNS) • 1988: First Internet worm • 1990: The World-Wide Web (Tim Berners-Lee, CERN)

A brief history of the Web • 1945: Hypertext (Vannevar Bush) • 1980: Enquire (First hypertext browser) • 1990: WorldWideWeb )First web browser) • 1991: HTML and HTTP • 1993: Mosaic (Mark Andressen) • 1994: First WWW conference • 1994: W3C • 1994: Lycos (First commercial search engine) • 1994: Yahoo! (First web directory, Jerry Yang and David Filo) • 1995: AltaVista (DEC) • 1997: Google (First link-based search engine, Sergey Brin and Larry Page)

End of Lecture 1

Basic terminology • Hypertext: document connected to other documents by links. • World-Wide Web: corpus of billions of hypertext documents (“pages”) that are stored on computers connected to the Internet. • Documents are written in HTML • Documents can be viewed using Web browsers

Information Retrieval (IR) • Information Retrieval System: a system that allows a user to retrieve documents that match her “information need” from a large corpus. • Example: Get documents about Java, except for ones that are about the Java coffee. • Data Retrieval System: a system that allows a user to retrieve all documents that match her query from a large corpus. • Example: Get all documents containing the term “Java” but not containing the term “coffee”.

Information Retrieval vs. Data Retrieval

Information retrieval systems ranked retrieved docs IR System query processor user query ranking procedure retrieved docs User system query index postings Corpus text processor indexer tokenized docs

Search engines ranked retrieved docs Search Engine query processor user query ranking procedure retrieved docs User system query index postings Web text processor indexer tokenized docs crawler repository global analyzer

Classical IR vs. Web IR

Algorithms for Large Data Sets