1 / 26

Big Data – Analytics What makes it difficult

Big Data – Analytics What makes it difficult. Kalapriya Kannan IBM Research Labs July, 2013. What is analytics?. Broadly refers to the methods of analysis Depends on what we want to learn from the data Method/Model used to make sense of the data Depends on the nature of the data

lea
Download Presentation

Big Data – Analytics What makes it difficult

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Big Data – Analytics What makes it difficult Kalapriya Kannan IBM Research Labs July, 2013

  2. What is analytics? • Broadly refers to the methods of analysis • Depends on what we want to learn from the data • Method/Model used to make sense of the data • Depends on the nature of the data • Will “Chennai Express” enter the 100 crore club? • Historical data • Promotions • Star cast • Release date • Budget • Other events? • Analytics Methods:??????

  3. A ever green story • |INSERT MAJOR RETAILER NAME| found on |INSERT DAY OF THE WEEK| that beer and diaper sales were strongly correlated.  Once noticed on |INSERT BI TOOL OF CHOICE|, it was found |PICK ONE|: • That diapers are too heavy for recently pregnant women so they ask their husbands to pick them up coming home from work and since hubby is off the clock and ready to get his drink on, he also picks up beer. • That a diaper emergency occurs fairly late in the evening and the husband is sent out while the new mother cares for the baby.  Being annoyed, he also picks up a 12 pack to relax. • The brilliant analyst at |SAME MAJOR RETAILER AS ABOVE| intuits that a simple relocation of beer next to diapers will lead to more purchases of beer and beer sales improve by |INSERT HIGHER %|.

  4. Another example

  5. Knowledge discovery from Big Data • Analyzing big data allows analysts, researchers, and business users to make better and faster decisions using data that was previously inaccessible or unusable. • Novelty discovery • Finding new, rare, one in a million (billion) (trillion) objects and events. • Class discovery • Finding new classes of objects and behaviors • Association discovery • Finding unusual (improbable) co-occurring associations

  6. It starts with….. Discover gold dust in Desert • VS Gold in Mine

  7. What matters when dealing with data (“Big Data”) ? • Smart Sampling of data • Reducing the original data while not losing the statistical properties of data • Finding similar terms • Efficient multi dimensional indexing • Incremental updating of models • (vs building models from scratch) • Crucial for streaming data • Distributed linear algebra • Dealing with large sparse matrices

  8. On top .. • We perform usual data mining/machine learning/statistics operators: • Supervised learning (classification, regression) • Non supervised learning (clustering, different types of decompositions) • We are just more careful with algorithms we choose (typically linear or sub-linear versions)..

  9. Meaningfulness of Analytics(1/2) • A risk with `big-data mining’ is that an analyst can ‘discover’ patterns that are not meaningful • Statisticians call it ‘Bonferroni’s principles’: • Roughly if you look in more places for interesting patterns, than your amount of data will support almost anything, .. And you are bound to find lots of nonsense.

  10. Meaningfulness of Analytics (2/2) • Example: • We want to find (unrelated) people who at least twice have stayed at the same hotel on the same day • 109 people being tracked. • 1000 days • Each person stays in a hotel 1% of the time (1 day out of 100) • Hotels hold 100 people (105 hotels). • If everyone behaves randomly (ie., no terrorist) will data mining detect anything suspicious? • Expected number of “suspicious” pairs of people: • 250,000 • .. Too many combinations to check- we need to have some additional evidence to find “suspicious” pair of people in some efficient way.

  11. Very inefficient Early days • Early days: • Customized applications built on top of file systems • Drawbacks of using file systems to store data: • Data redundancy and inconsistency • Difficulty in accessing data • Atomicity of updates • Concurrency control • Security • Data isolation — multiple files and formats • Integrity problems

  12. Balancing..

  13. What are big analytics capabilities

  14. How is analytics done • Prepare and clear data • Explore data set: Cubes and descriptive statistics • Model Computation • Scoring and deployment

  15. Application Demands on Big Data • Sub milli seconds – nano seconds are the demands of the applications. • Example: Tableau Desktop, Financial Analysis. • Low latency reads • Low Latency writes • Fault-tolerant • Scalable • Queries are ad-hoc • What is the next best optimal investment in the stocks. • Query = function (all data)

  16. Where is the problem….. • Two sample programs • Computing Average • JSON reader

  17. Backup

  18. Big Data Is not about size • Finds insights from complex, noisy, heterogeneous, longitudinal and voluminous data. • It aims to answer questions that were previously unanswered. • The challenges include • capturing, storing, • searching, • sharing & • analyzing.

  19. Question from Business vary

  20. Digital Marketing Data is a Mess • The problem is exacerbated by: • Most (or all) metrics not being aligned with business objectives • Disparate Data sources – website, social, mobile, CRM etc., • How do we over come them? • Ask the question: Do the numbers even matter? • Reasons why they might not: • Aesthetics • Brand Value • Overarching business strategy

  21. Attribute of big data

  22. Ten common Big Data Problems Modeling True risk Customer churn analysis Recommendation Engine Ad targetting PoS transaction analysis Analyzing network data to predict failure Threat analysis Trade surveillance Search Quality Data ‘sandbox’

  23. Look at a Few • Isolate metrics that matters, make sure: • They are actionable • They can be commonly interpreted • The calculations are transparent and simple • The data is easily accessible and credible. • Aggregate value • Visualize it • A visualization can be worth a thousand metrics • Use best practices, but utilize unique visualizations • “Data Visualizations” will become the new interface to your computing experiences”

  24. Storage and Memory Lagging behind CPU

  25. Commodity hardware economics

  26. Hadoop – Harvesting Cheap computation from commodity machines

More Related