1 / 22

Big data & Next generation analytics

Big data & Next generation analytics. Krishna Kulkarni Keith W. Hare ISO/IEC JTC1 SC32 Opening Plenary May 27, 2013, Gyeongju Korea. Introduction. Goal of this talk is to provide additional input to the discussion .

buffy
Download Presentation

Big data & Next generation analytics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ISO/IEC JTC1 SC32N2383 Big data & Next generation analytics Krishna Kulkarni Keith W. Hare ISO/IEC JTC1 SC32 Opening Plenary May 27, 2013, Gyeongju Korea

  2. ISO/IEC JTC1 SC32N2383 Introduction • Goal of this talk is to provide additional input to the discussion. • “Next Generation Analytics is essentially dealing with Big Data – with the same concepts for predictive analysis in discovering hidden patterns, discovering unknown correlations by analyzing  huge volumes of transactional data and other untapped data (data mining, data warehouses, unstructured etc.), and, essentially using the same toolsets (NoSQL, Hadoop etc.).” Baba Priprani

  3. ISO/IEC JTC1 SC32N2383 Next Generation Analytics Goals • Cost of acquiring and storing data is rapidly decreasing • Enterprises are collecting huge amounts of extremely fine-grained data. • Enable enterprises to get newer actionable business insights from vast amounts of raw fine-grained data dramatically faster than is possible today

  4. ISO/IEC JTC1 SC32N2383 Sample use case – Retail • Utilize transactional and query logs collected by retail companies • Finer segmentation of customers for direct marketing campaigns • Generate differentiated pricing structures • Predicting future customer demands.

  5. ISO/IEC JTC1 SC32N2383 For retail use cases… • Critical to narrow time gap between: • Data acquisition • And acting on a business decision based on the data. • Referred to as: • Near Real-time Business Analytics • Or Operational Business Intelligence • For example, a retailer would • Decide on promotions for the next week based on the data collected during this week • For on-line stores, take action based on data even more quickly • Real time marketing e.g. as customers are walking down the street

  6. ISO/IEC JTC1 SC32N2383 Sample use case – Medical • Cancer treatment regimen • 100% effective in 80% of the patients • Completely ineffective in 20% of patients • Need to identify the 20% • Sufficient to identify correlations • Causations can come later

  7. ISO/IEC JTC1 SC32N2383 Requirements for Achieving Goals • Handling diverse data formats/structures • Handling high speed of data collection • Analytics capability beyond what is offered by the traditional business intelligence • Low cost, highly scalable analytics platforms • Heterogonous infrastructure

  8. ISO/IEC JTC1 SC32N2383 Diversity of data • Small fraction is structured formats, Relational, XML, etc. • Fair amount is semi-structured, as web logs, etc. • Rest of the data is unstructured text, photographs, etc. Very difficult to implement a single data model can handle the diversity

  9. ISO/IEC JTC1 SC32N2383 Velocity of data • Continuously streaming data • Need to analyze data in-flight • Combine with data at-rest • Need a good answer quickly • A precisely correct answer • May not exist • May not be required

  10. ISO/IEC JTC1 SC32N2383 Analytics capability • Current technologies are not sufficient or are too static: • Business Intelligence (BI) techniques • Data Warehousing (static, batch oriented style) • Built-in analytic functions in SQL • Data Mining • “Machine learning” viewed as • key technology • will unlock novel insights in data. • Statistical packages • Project R – public domain • SAS – proprietary • SPSS – proprietary Effective leveraging of the machine learning tool kits requires understanding of probability and statistics.

  11. ISO/IEC JTC1 SC32N2383 Significant challenges in identifying deep insights from data • How to identify relevant fragments of data easily from a multitude of data sources? • How to use data cleaning techniques across multiple data sources? • How to sample results of a query progressively? • How to obtain rich visualization? Best successes so far have been vertically integrated machine learning software packages for use in specific use cases, e.g., detection of credit card fraud

  12. ISO/IEC JTC1 SC32N2383 Significant Challenges in Storing Data • Next Generation Analytics Operate on “Big Data” • Data Storage May Span • Multiple Servers • Multiple Storage sub systems • Multiple data centers • NoSQL Databases often used to store “Big Data” • Large variety of products • Diverse sets of features • No standard interface

  13. ISO/IEC JTC1 SC32N2383 Low Cost, Highly Scalable Analytics Platforms • Infrastructure based on MapReduce framework emerging as a popular retrieval and consolidation solution • However, this infrastructure is very low-level • Responsibility for exploiting the platform is on the user • Lacks much of the maturity of the relational world. Integration with existing relational/BI platforms is a must for long-term success

  14. ISO/IEC JTC1 SC32N2383 Significant Challenges for Retrieving Data • MapReduce • Framework for managing partitioned query & retrieval of distributed data • Retrieves data from distributed data stores and presents it to the analysis layer • Custom Map operation • Custom Reduce operation • No high level declarative language • Languages specific to underlying data stores No automated way to apply MapReduce to extremely complex questions

  15. ISO/IEC JTC1 SC32N2383 Summary • Community experimentation and understanding are evolving rapidly • Need complete eco-system make this all work • Standards are essential – Niche solutions will lead to vendor lock-in

  16. ISO/IEC JTC1 SC32N2383 How the pieces fit together Statistical Analysis EngineMachine Learning Engine Data Retrieval & Summary MapReduce Big Data NoSQL Relational XML

  17. ISO/IEC JTC1 SC32N2383 Sources • Chaudhuri, S., "What next?: A half-dozen data management research goals for big data and the cloud", In Proceedings of the 31st Symposium on Principles of Database Systems, ACM, 2012. • “Big Data Now: 2012 Edition”http://oreilly.com/data/radarreports/big-data-now-2012.csp

  18. ISO/IEC JTC1 SC32N2383 Additional Discussion… • The following slides were incomplete and beyond the scope of this presentation, but worth preserving for future discussions.

  19. ISO/IEC JTC1 SC32N2383 Domain, Range, & Function • In traditional mathematics • Given a domain and a function, solve for range • Given a domain and a range, identify a function, if it exsists • Example: • Given the set of pairs {(2,-3),(4,6),(3,-1),(6,6),(2,3)} • domain of relation is set {2,3,4,6} • Range is {-3,-1,3,6} • Answer is no, there is not a function • one X value (2) that produces 2 different Y values

  20. ISO/IEC JTC1 SC32N2383 In Analytics • Determine the range, given a set of candidate domains • Solve for function that will give range for candidate domains.

  21. ISO/IEC JTC1 SC32N2383 National Security Example • Range: • Find candidate national security issues related to attacks on American assets • Candidate domains: • Banking records • Money flows • E-mail • Social Media Networks • Telephone Calls • Reports from human intelligence • Satellite photos • Find function(s) that uses those domains to produce the range • Data is always incomplete

  22. ISO/IEC JTC1 SC32N2383 Cancer Research Example • Range • Identify patients who will not respond to specific treatment • Domains • Genotype • Health History • Family History • Geology of residence • Work history • Find function(s) that uses those domains to produce the range • Data is always incomplete

More Related