440 likes | 585 Views
Algorithms for Large Data Sets. Ziv Bar-Yossef. Lecture 1 March 19, 2006. http://www.ee.technion.ac.il/courses/049011. Course overview Part 1: Web. Architecture of search engines Information retrieval Index construction, Vector Space Model, evaluation criteria Ranking methods
E N D
Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 1 March 19, 2006 http://www.ee.technion.ac.il/courses/049011
Course overviewPart 1: Web • Architecture of search engines • Information retrieval • Index construction, Vector Space Model, evaluation criteria • Ranking methods • Google’s PageRank, Kleinberg’s Hubs & Authorities • Spectral methods • Latent Semantic Indexing • Web structure • Power laws, small-world phenomenon • Random graph models for the web • Preferential attachment, the copying model • Web sampling • Sampling from the web, sampling from search engines • Rank aggregation
Course overviewPart 2: Algorithms • Random sampling • Mean, median, MST • Data streams • Distinct elements, frequency moments, Lp distances • Sketching • Shingling, Hamming distance, Bloom filters • Dimension reduction • Low distortion embeddings, locality sensitive hash functions • Lower bounds • Communication complexity, reductions
Prerequisites • Algorithms and data structures • Complexity analysis • Hashing • Probability theory • Conditional probabilities, expectation, variance • Linear algebra • Matrices, eigenvalues, vector spaces • Combinatorics • Graph theory
Course requirements & grading • Papers review 90% of final grade • Readings & participation 10% of final grade
Papers review • Write a critical review of 2-3 papers on a subject not studied in class. • Deliverables: • Choice of papers to review (due: 4/6/06) • 5 page review report (due: 9/7/06) • 20 minute class presentation (9/7/06)
Textbooks Randomized Algorithms by Rajeev Motwani and Prabhakar Raghavan Mining the Web by Soumen Chakrabarti Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto
Instructor • Ziv Bar-Yossef • Tel: 5737 Email: zivby@ee • Office hours: Mondays, 15:30-17:30 at 917 Meyer • Mailing list: ee049011s-l
Examples of large data sets:Astronomy • Astronomical sky surveys • 120 Gigabytes/week • 6.5 Terabytes/year The Hubble Telescope
Examples of large data sets:Genomics • 25,000 genes in human genome • 3 billion bases • 3 Gigabytes of genetic data
Examples of large data sets:Phone call billing records • 250M calls/day • 60G calls/year • 40 bytes/call • 2.5 Terabytes/year
Examples of large data sets:Credit card transactions • 47.5 billion transactions in 2005 worldwide • 115 Terabytes of data transmitted to VisaNet data processing center in 2004
Examples of large data sets:Internet traffic • Traffic in a typical router: • 42 kB/second • 3.5 Gigabytes/day • 1.3 Terabytes/year
Examples of large data sets:The World-Wide Web • 25 billion pages indexed • 10kB/Page • 250 Terabytes of indexed text data • “Deep web” is supposedly 100 times as large
Reasons for the emergence of large data sets:Better technology • Storage & disks • Cheaper • More volume • Physically smaller • More efficient Large data sets are affordable
Reasons for the emergence of large data sets:Better networking • High speed Internet • Cellular phones • Wireless LAN More data consumers More data producers
Reasons for the emergence of large data sets:Better IT • More processes are automatic • E-commerce and V-commerce • Online and telephone banking • Online and telephone customer service • E-learning • Chats, news, blogs • Online journals • Digital libraries • More enterprises are computerized • Companies • Banks • Governmental institutions • Universities More data is available in digital form World’s yearly production of data: 5 billion Gigabyes
Reasons for the emergence of large data sets:Growing needs • Science • Astronomy • Earth and environmental studies • Meteorology • Genetics • Business • Billing • Mining customer data • Intelligence • Emails • Web sites • Phone calls • Search • Web pages • Images • Audio & Video More incentive to construct large data sets
Characteristics of large data sets • Huge • Distributed • Dispersed over many servers • Dynamic • Items add/deleted/modified continuously • Heterogeneous • Many agents access/update data • Noisy • Inherent • Unintentional • Malicious • Unstructured / semi-structured • No database schema
New challengesRestricted access • Large data sets are kept on magnetic and optical storage devices • Access to data is sequential • Random access is costly
New challengesStringent efficiency requirements • Traditionally, “efficient” algorithms • Run in (small) polynomialtime. • Use linear space. • For large data sets, efficient algorithms • Must run in linear or even sub-linear time. • Must use up to poly-logarithmic space.
New challengesSearch the data • Traditionally, input data is: • Either small and thus easily searchable • Moderately large, but organized in database tables. • In large data sets, input data is: • Immense • Disorganized, unstructured, non-standardized Hard to find what you want
New challengesMine the data • Association rules • “Beers and diapers” • Patterns • Clusters • Statistical data • Graph structure
New challengesClean the data • Noise in data distorts • Computation results • Search results • Mining results • Need automatic methods for “cleaning” the data • Spam filters • Duplicate elimination • Quality evaluation
Abstract model of computing (n is very large) Examples Mean Data Parity Computer Program • Approximation of f(x) is sufficient • Program can be randomized Approximation of
Models for computing over large data sets • Random sampling • Data Streams • Sketching
Random sampling (n is very large) Examples Mean O(1) queries Data Parity n queries Computer Program Query a few data items Approximation of
Random sampling • Advantages • Ultra-efficient • Sub-linear running time & space (could even be independent of data set size) • Disadvantages • May require random access • Doesn’t fit many problems • Hard to sample from disorganized data sets
Data streams (n is very large) Examples Mean O(1) memory Data Parity 1 bit of memory Computer Program Stream through the data; Use limited memory Approximation of
Data streams • Advantages • Sequential access • Limited memory • Disadvantages • Running time is at least linear • Too restricted for some problems
Sketching (n is very large) Examples Data1 Data1 Data2 Sketch2 Sketch1 Data2 Equality O(1) size sketch Hamming distance O(1) size sketch Compress each data segment into a small “sketch” Compute over the sketches Lp distance (p > 2) (n1-2/p) size sketch Approximation of
Sketching • Advantages • Appropriate for distributed data sets • Useful for “dimension reduction” • Disadvantages • Too restricted for some problems • Usually, at least linear running time
Algorithms for large data sets Data Streams Sampling • Distinct elements • Frequency moments • Lp distances • Geometric problems • Graph problems • Database problems • Mean and other moments • Median and other quantiles • Volume estimations • Histograms • Graph problems • Property testing Sketching • Equality • Hamming distance • Set membership • Edit distance • Resemblance
A brief history of the Internet • 1961: First paper on packet switching (Kleinrock, MIT) • 1966: ARPANET (first design of a wide area computer network) • 1969: First packet sent from UCLA to SRI. • 1971: First E-mail (Ray Tomlinson) • 1974: Transmission Control Protocol (TCP) (Vint Cerf & Bob Kahn) • 1978: TCP splits into TCP and IP (Internet Protocol) • 1979: USENET (newsgroups) • 1984: Domain Name System (DNS) • 1988: First Internet worm • 1990: The World-Wide Web (Tim Berners-Lee, CERN)
A brief history of the Web • 1945: Hypertext (Vannevar Bush) • 1980: Enquire (First hypertext browser) • 1990: WorldWideWeb )First web browser) • 1991: HTML and HTTP • 1993: Mosaic (Mark Andressen) • 1994: First WWW conference • 1994: W3C • 1994: Lycos (First commercial search engine) • 1994: Yahoo! (First web directory, Jerry Yang and David Filo) • 1995: AltaVista (DEC) • 1997: Google (First link-based search engine, Sergey Brin and Larry Page)
Basic terminology • Hypertext: document connected to other documents by links. • World-Wide Web: corpus of billions of hypertext documents (“pages”) that are stored on computers connected to the Internet. • Documents are written in HTML • Documents can be viewed using Web browsers
Information Retrieval (IR) • Information Retrieval System: a system that allows a user to retrieve documents that match her “information need” from a large corpus. • Example: Get documents about Java, except for ones that are about the Java coffee. • Data Retrieval System: a system that allows a user to retrieve all documents that match her query from a large corpus. • Example: Get all documents containing the term “Java” but not containing the term “coffee”.
Information retrieval systems ranked retrieved docs IR System query processor user query ranking procedure retrieved docs User system query index postings Corpus text processor indexer tokenized docs
Search engines ranked retrieved docs Search Engine query processor user query ranking procedure retrieved docs User system query index postings Web text processor indexer tokenized docs crawler repository global analyzer