1 / 36

Data-Intensive Computing Symposium: Report Out

Data-Intensive Computing Symposium: Report Out. Phillip B. Gibbons Intel Research Pittsburgh. Data-Intensive Computing Symposium. Held 3/26/08 @Yahoo! in Sunnyvale, CA Sponsored by: Yahoo! Research

ardith
Download Presentation

Data-Intensive Computing Symposium: Report Out

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data-Intensive ComputingSymposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh

  2. Data-Intensive Computing Symposium • Held 3/26/08 @Yahoo! in Sunnyvale, CA • Sponsored by: • Yahoo! Research • Computing Community Consortiumsupports the computing research community in creating compelling research visions and the mechanisms to realize these visions (http://www.cra.org/ccc/) • ~100 invited attendees, ~12 invited talks • Slides and video to be posted on CCC web site • Blog: http://dita.ncsa.uiuc.edu/xllora (thanks!)

  3. Randy Bryant (CMU)Data-Intensive Scalable Computing • Local speaker; I’ll skip in interest of time • DISC has been renamed

  4. ChengXiang Zhai (UIUC)Text Information Management

  5. ChengXiang Zhai (UIUC) Proposal 1: Maximum Personalization

  6. ChengXiang Zhai (UIUC)

  7. ChengXiang Zhai (UIUC)

  8. Dan Reed (Microsoft)Clouds and ManyCore: The Revolution • Big Data: Should focus more on the user experience • How to manage resources • Cloud computing can help organically orchestrate resources on demand • Initiative to bring academics, business, and users together under the big data problem (PCAST NITRD review)

  9. Jill Mesirov (Broad Institute)Comput. Paradigms for Genomic Medicine • Broad has 4.8K processors, 1.4 PBs storage on site • Big Data Problem: Mining genome expression arrays • Row: patients; Column: genes, Value: expression values • Example: classify leukemias based on expression arrays • Solved by grad student over the weekend using web sources • Challenge: Computation/Analysis/Provenance infrastructure needed • Developed GenePattern 3.1: Software infrastructure for interoperable informatics • Usable by biologists

  10. Garth Gibson (CMU)Simplicity and Complexity in Data Systems at Scale • Petascale Data Storage Institute • Understanding disk failures, cfdr.usenix.org • Another local speaker, so I’ll skip in interest of time

  11. Jeff Dean (Google)Handling Large Datasets at Google

  12. Jeff Dean (Google)

  13. Jeff Dean (Google)

  14. Jeff Dean (Google) GFS Usage

  15. Jeff Dean (Google)

  16. Jeff Dean (Google)

  17. Jeff Dean (Google)

  18. Jeff Dean (Google)

  19. Jon Kleinberg (Cornell)Large-Scale Social Network Data Diffusion in Social Networks Why is chain letter diffusion so deep & narrow? Iraq war authorization protest chain letter diffusion (18K nodes)

  20. Jon Kleinberg (Cornell)

  21. Jon Kleinberg (Cornell)

  22. Marc Najork (Microsoft Research)Mining the Web Graph Query-dependent link-based ranking algorithm (HITS, SALSA) Scalable Hyperlink Store: used internally within MSR, for web graphs

  23. Joe Hellerstein (UC Berkeley)“What” Goes Around • Industrial revolution of data: sensors, logs, cameras • Hardware revolution: datacenters/virtualization, many-core • Industrial revolution in software? Declarative languages in some domains Why “What”: • Rapid prototyping • Pocket-size code bases • Independent from the runtime • Ease of analysis and security • Allow optimization and adaptability

  24. Joe Hellerstein (UC Berkeley)

  25. Joe Hellerstein (UC Berkeley) • Sensor Networks, Mobile Networks, Modular Robotics, computer games, program analysis • Distributive inference (junction trees and loopy belief propagation), graphs upon graphs • Evita Raced: Overlog Metacompiler (compiler is written declaratively) • matches datalog optimizations (dynamic prog.), cycle tests • Datalog with known extensions and tweaks • Centrality of Rendezvous & graphs • Challenges: • performance beyond number of messages (e.g., memory hierarchy), availability, real programs, not Turing complete

  26. Raghu Ramakrishnan (Yahoo! Res.)Sherpa: Cloud Computing of the Third Kind

  27. Raghu Ramakrishnan (Yahoo! Res.)

  28. Raghu Ramakrishnan (Yahoo! Res.)

  29. Alex Szalay (Johns Hopkins)Scientific Applications of Large Databases

  30. Alex Szalay (Johns Hopkins)

  31. Alex Szalay (Johns Hopkins)

  32. Phillip Gibbons (Intel Research)Data-Rich Computing: Where It’s At I know where it’s at, man! • Important, interesting, exciting research area • Cluster approach:computing is co-locatedwhere the storage is at • Memory hierarchy issues:where the (intermediate) data are at, over the course of the computation • Pervasive multimedia sensing: processing & querying must be pushed out of the data center towhere the sensors are at Focus of this talk:

  33. Hierarchy-Savvy Parallel Algorithm Design (HI-SPADE) project • Hierarchy-savvy: • Hide what can be hid • Expose what must be exposed • Sweet-spot between ignorant and fully aware • Support: • Develop the compilers, runtime systems,architectural features, etc. to realize the model • Important component: fine-grain threading Goal: Support a hierarchy-savvy model of computation for parallel algorithm design

  34. IrisNet’s Two-Tier Architecture Query User OA XML database OA XML database OA XML database . . . SA SA SA senselet senselet senselet senselet senselet senselet Sensornet Sensor Sensor Sensor Two components: SAs: sensor feed processing OAs: distributed database Web Server for the url . . . . . .

  35. Jeannette Wing (CMU/NSF)NSF Plans for SupportingData-Intensive Computing Google/IBM Data Center • ~2000 processors, large Hadoop cluster • Allocate in units of rack weeks • NSF will review proposals for use: Cluster Exploratory (CluE) • Running Xen; Won’t open up performance monitoring • Goal: Show applicable outside of computer science Academic-Industry-Government partnership

  36. Randy Bryant (CMU)Big Data Computing Study Group • Collection of ~20 people (looking for volunteers) • Goals: • Fostering educational activities • Advocacy • Building community • CCC’s Big Data Computing Study Group seeks to foster collaborations between industry, academia, and the U.S. government to advance the state of art in the development and application of large scale computing systems for making intelligent use of the massive amounts of data being generated in science, commerce, and society

More Related