1 / 20

Big Data: Analytics Platforms

Big Data: Analytics Platforms. Donald Kossmann Systems Group, ETH Zurich http:// systems.ethz.ch. Why Big Data?. because bigger is smarter answer tough questions because we can push the limits and good things will happen. bigger = smarter?. Yes! tolerate errors

zoe
Download Presentation

Big Data: Analytics Platforms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Big Data: Analytics Platforms Donald Kossmann Systems Group, ETH Zurich http://systems.ethz.ch

  2. Why Big Data? • because bigger is smarter • answer tough questions • because we can • push the limits and good things will happen

  3. bigger = smarter? • Yes! • tolerate errors • discover the long tail and corner cases • machine learning works much better

  4. bigger = smarter? • Yes! • tolerate errors • discover the long tail and corner cases • machine learning works much better • But! • more data, more error (e.g., semantic heterogeneity) • with enough data you can prove anything • still need humans to ask right questions

  5. Fundamental Problem of Big Data • There is no ground truth • gets more complicated with self-fulfilling prophecies • e.g., stock market predictions change behavior of people • e.g., Web search engines determine behavior of people

  6. Fundamental Problem of Big Data • There is no ground truth • gets more complicated with self-fulfilling prophecies • Hard to debug: takes human out of the loop • Example: How to play lottery in Napoli • Step 1: You visit “oracles” who predict numbers to play • Step 2: You visit “interpreters” who explain predictions • Step 3: After you lost, “analysts” tell you that “oracles” and “interpreters” were right and that it was your fault. • [Luciano de Crescenzo: Thus SpakeBellavista]

  7. Why Big Data? • because bigger is smarter • answer tough questions • because we can • push the limits and good things will happen

  8. Because we can… Really? • Yes! • all data is digitally born • storage capacity is increasing • counting is embarrassingly parallel

  9. Because we can… Really? • Yes! • all data is digitally born • storage capacity is increasing • counting is embarrassingly parallel • But, • data grows faster than energy on chip • value / cost tradeoff unknown • ownership of data unclear (aggregate vs. individual) • I believe that all these “but’s” can be addressed

  10. Utiliy & Cost Functions of Data Utility Cost Noise / Error Noise / Error

  11. Utiliy & Cost Functions of Data Utility Cost curated curated malicious random random malicious Noise / Error Noise / Error

  12. Best Utility/Cost Tradeoff Utility Cost malicious malicious Noise / Error Noise / Error

  13. What is good enough? Utility Cost curated curated Noise / Error Noise / Error

  14. What about platforms? • Relational Databases • great for 20% of the data • not great for 80% of the data • Hadoop • great for nothing • good enough for (almost) everything (if tweaked)

  15. Why is Hadoop so popular? • availability: open source and free • proven technology: nothing new & simple • works for all data and queries • branding: the big guys use it • it has the right abstractions • MR abstracts “counting” (= machine learning) • it is an eco-system - it is NOT a platform • HDFS, HBase, Hive, Pig, Zookeeper, SOLR, Mahout, … • relational database systems • turned into a platform depending on app / problem

  16. Example: Amadeus Log Service • HDFS for compressed logs • HBase to index by timestamp and session id • SOLR for full text search • Hadoop (MR) for usage stats & disasters • Oracle to store meta-data (e.g., user information) • Disclaimer: under construction & evaluation!!! • current production system is proprietary

  17. Some things Hadoop got wrong? • performance: huge start-up time & overheads • productivity: e.g., joins, configuration knobs • SLAs: no response time guarantees, no real time • Essentially ignored 40 years of DB research 

  18. Some things Hadoop got right • scales without (much) thinking • moves the computation to the data • fault tolerance, load balance, …

  19. How to improve on Hadoop • Option 1: Push our knowledge into Hadoop? • implement joins, recursion, … • Option 2: Push Hadoop into RDBMS? • build a Hadoop-enabled database system • Option 3: Build new Hadoop components • real-time, etc. • Option 4: Patterns to compose components • log service, machine learning, … • but, do not build a “super-Hadoop”

  20. Conclusion • Focus on “because we can…” part • help data scientists to make everything work • Stick to our guns • develop clever algorithms & data structures • develop modeling tools and languages • develop abstractions for data, errors, failures, … • develop “glue”; get the plumbing right • Package our results correctly • find the right abstractions (=> APIs of building blocks)

More Related