250 likes | 385 Views
The presentation by the University of Illinois and Magnify, Inc. explores the dramatic increase in data production since 1995 and the challenges posed by its format and distribution. Highlighting trends such as the immediacy of data transfer and the deadlock in analyzing vast amounts of distributed data, it proposes solutions for making data useful amidst the chaos. By utilizing complementary computing resources like clusters and mining methodologies, the aim is to lower the overall cost of making data accessible and actionable.
E N D
Distributed Tera-Mining R. L. Grossman Laboratory for Advanced Computing University of Illinois & Magnify, Inc.
… All in the Wrong Format With no one to analyze it.
The Data Gap Most data comes a GB and a TB at a time. The Data Gap Total new disk (TB) since 1995 New Ph.D.s
Trend 2. Sonet is dead. Lambda Rules. Gigabytes can be moved in seconds.
Trend 3: Most Data is Distributed • Bush’s Law: The usefulness of a column of data varies as the square of the number of columns it is compared to.
Example 1: ENSO & Cholera El Nino Data at NCAR Cholera Data at WHO
Table 2 Table 1 Example 2: Voting
DataSpace – One Approach to Making Data Useful Complementary to the grid, which we view as a distributed computer. • html • http • search by keyword • workstations servers • pmml & dtml • dstp • correlate & mine • data & compute clusters • 16 terabytes of documents • 4 billion documents Today’sMulti-media Web Tomorrow’sData Web • petabytes of data • tens of billions to trillions of records
DSTP Server 2 DSTP Server 1 k[i], y[j] k[i], x[i] Click to obtain graph UCK [uckid] attributes [aid]
Terra Mining Testbed Optical testbed for distributed tera miningof scientific data. Goal also to be testbed forbroadband based business services.
Lessons Learned • It’s the data stupid. Cycles, cylinders & lambdas are all commodities. • The fundamental challenge: lower the cost to make data useful. • The emergence of internet infrastructure for data is inevitable. Opens up possibilities for new types of scientific discoveries.
For More Information • DataSpace http://www.dataspaceweb.net http://www.ncdm.uic.edu • DataSpace Standards http://www.dmg.org • Selected articles http://www.twocultures.net • Magnify • http://www.magnify.com
OC-3 OC-12 OC-48 Trend 2. Bandwidth is a Commodity
Distributed Exabytes (New Disks) Petabytes 1 Exabyte Source: IDC (1999) "1999 Winchester Disk Drive Market Forecast and Review"
Trend 3: Most Data is Distributed • W’s Law: The usefulness of a column of data varies as the square of the number of columns it is compared to.