1 / 44

Algorithms for Large Data Sets

Algorithms for Large Data Sets. Ziv Bar-Yossef. Lecture 1 March 19, 2006. http://www.ee.technion.ac.il/courses/049011. Course overview Part 1: Web. Architecture of search engines Information retrieval Index construction, Vector Space Model, evaluation criteria Ranking methods

hei
Download Presentation

Algorithms for Large Data Sets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 1 March 19, 2006 http://www.ee.technion.ac.il/courses/049011

  2. Course overviewPart 1: Web • Architecture of search engines • Information retrieval • Index construction, Vector Space Model, evaluation criteria • Ranking methods • Google’s PageRank, Kleinberg’s Hubs & Authorities • Spectral methods • Latent Semantic Indexing • Web structure • Power laws, small-world phenomenon • Random graph models for the web • Preferential attachment, the copying model • Web sampling • Sampling from the web, sampling from search engines • Rank aggregation

  3. Course overviewPart 2: Algorithms • Random sampling • Mean, median, MST • Data streams • Distinct elements, frequency moments, Lp distances • Sketching • Shingling, Hamming distance, Bloom filters • Dimension reduction • Low distortion embeddings, locality sensitive hash functions • Lower bounds • Communication complexity, reductions

  4. Prerequisites • Algorithms and data structures • Complexity analysis • Hashing • Probability theory • Conditional probabilities, expectation, variance • Linear algebra • Matrices, eigenvalues, vector spaces • Combinatorics • Graph theory

  5. Course requirements & grading • Papers review 90% of final grade • Readings & participation 10% of final grade

  6. Papers review • Write a critical review of 2-3 papers on a subject not studied in class. • Deliverables: • Choice of papers to review (due: 4/6/06) • 5 page review report (due: 9/7/06) • 20 minute class presentation (9/7/06)

  7. Textbooks Randomized Algorithms by Rajeev Motwani and Prabhakar Raghavan Mining the Web by Soumen Chakrabarti Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto

  8. Instructor • Ziv Bar-Yossef • Tel: 5737 Email: zivby@ee • Office hours: Mondays, 15:30-17:30 at 917 Meyer • Mailing list: ee049011s-l

  9. Large Data SetsExamples, Challenges, and Models

  10. Examples of large data sets:Astronomy • Astronomical sky surveys • 120 Gigabytes/week • 6.5 Terabytes/year The Hubble Telescope

  11. Examples of large data sets:Genomics • 25,000 genes in human genome • 3 billion bases • 3 Gigabytes of genetic data

  12. Examples of large data sets:Phone call billing records • 250M calls/day • 60G calls/year • 40 bytes/call • 2.5 Terabytes/year

  13. Examples of large data sets:Credit card transactions • 47.5 billion transactions in 2005 worldwide • 115 Terabytes of data transmitted to VisaNet data processing center in 2004

  14. Examples of large data sets:Internet traffic • Traffic in a typical router: • 42 kB/second • 3.5 Gigabytes/day • 1.3 Terabytes/year

  15. Examples of large data sets:The World-Wide Web • 25 billion pages indexed • 10kB/Page • 250 Terabytes of indexed text data • “Deep web” is supposedly 100 times as large

  16. Reasons for the emergence of large data sets:Better technology • Storage & disks • Cheaper • More volume • Physically smaller • More efficient Large data sets are affordable

  17. Reasons for the emergence of large data sets:Better networking • High speed Internet • Cellular phones • Wireless LAN More data consumers More data producers

  18. Reasons for the emergence of large data sets:Better IT • More processes are automatic • E-commerce and V-commerce • Online and telephone banking • Online and telephone customer service • E-learning • Chats, news, blogs • Online journals • Digital libraries • More enterprises are computerized • Companies • Banks • Governmental institutions • Universities More data is available in digital form World’s yearly production of data: 5 billion Gigabyes

  19. Reasons for the emergence of large data sets:Growing needs • Science • Astronomy • Earth and environmental studies • Meteorology • Genetics • Business • Billing • Mining customer data • Intelligence • Emails • Web sites • Phone calls • Search • Web pages • Images • Audio & Video More incentive to construct large data sets

  20. Characteristics of large data sets • Huge • Distributed • Dispersed over many servers • Dynamic • Items add/deleted/modified continuously • Heterogeneous • Many agents access/update data • Noisy • Inherent • Unintentional • Malicious • Unstructured / semi-structured • No database schema

  21. New challengesRestricted access • Large data sets are kept on magnetic and optical storage devices • Access to data is sequential • Random access is costly

  22. New challengesStringent efficiency requirements • Traditionally, “efficient” algorithms • Run in (small) polynomialtime. • Use linear space. • For large data sets, efficient algorithms • Must run in linear or even sub-linear time. • Must use up to poly-logarithmic space.

  23. New challengesSearch the data • Traditionally, input data is: • Either small and thus easily searchable • Moderately large, but organized in database tables. • In large data sets, input data is: • Immense • Disorganized, unstructured, non-standardized Hard to find what you want

  24. New challengesMine the data • Association rules • “Beers and diapers” • Patterns • Clusters • Statistical data • Graph structure

  25. New challengesClean the data • Noise in data distorts • Computation results • Search results • Mining results • Need automatic methods for “cleaning” the data • Spam filters • Duplicate elimination • Quality evaluation

  26. Abstract model of computing (n is very large) Examples Mean Data Parity Computer Program • Approximation of f(x) is sufficient • Program can be randomized Approximation of

  27. Models for computing over large data sets • Random sampling • Data Streams • Sketching

  28. Random sampling (n is very large) Examples Mean O(1) queries Data Parity n queries Computer Program Query a few data items Approximation of

  29. Random sampling • Advantages • Ultra-efficient • Sub-linear running time & space (could even be independent of data set size) • Disadvantages • May require random access • Doesn’t fit many problems • Hard to sample from disorganized data sets

  30. Data streams (n is very large) Examples Mean O(1) memory Data Parity 1 bit of memory Computer Program Stream through the data; Use limited memory Approximation of

  31. Data streams • Advantages • Sequential access • Limited memory • Disadvantages • Running time is at least linear • Too restricted for some problems

  32. Sketching (n is very large) Examples Data1 Data1 Data2 Sketch2 Sketch1 Data2 Equality O(1) size sketch Hamming distance O(1) size sketch Compress each data segment into a small “sketch” Compute over the sketches Lp distance (p > 2) (n1-2/p) size sketch Approximation of

  33. Sketching • Advantages • Appropriate for distributed data sets • Useful for “dimension reduction” • Disadvantages • Too restricted for some problems • Usually, at least linear running time

  34. Algorithms for large data sets Data Streams Sampling • Distinct elements • Frequency moments • Lp distances • Geometric problems • Graph problems • Database problems • Mean and other moments • Median and other quantiles • Volume estimations • Histograms • Graph problems • Property testing Sketching • Equality • Hamming distance • Set membership • Edit distance • Resemblance

  35. WebHistory and Architecture of Search Engines

  36. A brief history of the Internet • 1961: First paper on packet switching (Kleinrock, MIT) • 1966: ARPANET (first design of a wide area computer network) • 1969: First packet sent from UCLA to SRI. • 1971: First E-mail (Ray Tomlinson) • 1974: Transmission Control Protocol (TCP) (Vint Cerf & Bob Kahn) • 1978: TCP splits into TCP and IP (Internet Protocol) • 1979: USENET (newsgroups) • 1984: Domain Name System (DNS) • 1988: First Internet worm • 1990: The World-Wide Web (Tim Berners-Lee, CERN)

  37. A brief history of the Web • 1945: Hypertext (Vannevar Bush) • 1980: Enquire (First hypertext browser) • 1990: WorldWideWeb )First web browser) • 1991: HTML and HTTP • 1993: Mosaic (Mark Andressen) • 1994: First WWW conference • 1994: W3C • 1994: Lycos (First commercial search engine) • 1994: Yahoo! (First web directory, Jerry Yang and David Filo) • 1995: AltaVista (DEC) • 1997: Google (First link-based search engine, Sergey Brin and Larry Page)

  38. End of Lecture 1

  39. Basic terminology • Hypertext: document connected to other documents by links. • World-Wide Web: corpus of billions of hypertext documents (“pages”) that are stored on computers connected to the Internet. • Documents are written in HTML • Documents can be viewed using Web browsers

  40. Information Retrieval (IR) • Information Retrieval System: a system that allows a user to retrieve documents that match her “information need” from a large corpus. • Example: Get documents about Java, except for ones that are about the Java coffee. • Data Retrieval System: a system that allows a user to retrieve all documents that match her query from a large corpus. • Example: Get all documents containing the term “Java” but not containing the term “coffee”.

  41. Information Retrieval vs. Data Retrieval

  42. Information retrieval systems ranked retrieved docs IR System query processor user query ranking procedure retrieved docs User system query index postings Corpus text processor indexer tokenized docs

  43. Search engines ranked retrieved docs Search Engine query processor user query ranking procedure retrieved docs User system query index postings Web text processor indexer tokenized docs crawler repository global analyzer

  44. Classical IR vs. Web IR

More Related