1 / 39

Managing and mining smart meter data – at scale CSE Project Showcase 9 July 2013

Managing and mining smart meter data – at scale CSE Project Showcase 9 July 2013 Twitter: @cse_bristol #SmartMeterData. Introduction. Contents Introduction to the project, the data, and its applications Managing SM data at scale Getting valuable knowledge out of SM data

kalin
Download Presentation

Managing and mining smart meter data – at scale CSE Project Showcase 9 July 2013

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Managing and mining smart meter data – at scale CSE Project Showcase 9 July 2013 Twitter: @cse_bristol #SmartMeterData

  2. Introduction • Contents • Introduction to the project, the data, and its applications • Managing SM data at scale • Getting valuable knowledge out of SM data • Demo: Smart Meter Analytics, Scaled by Hadoop (SMASH) • Where next? • Discussion

  3. Introduction • Project Background • “Generating Value from Smart Electricity Meter Data” • 18 Month TSB-supported collaboration • CSE, University of Bristol, SSE and Western Power Distribution • Three themes: • Managing the data at scale • Extracting useful knowledge • Integrating the above in a user-facing application

  4. Introduction The data A half-hourly timeseries for each smart meter / register Content: date, time, consumption in the half hour. For a single register: 17,520 records per year. This is what 18 months look like:

  5. Introduction • The data • EDRP: • 18 months • 16,250 smart metered households • 16,250 smart electricity meters • 9,364 smart gas meters • 670m half-hourly records (E: 420m, G: 250m) • 40GB of raw csv file data • Post rollout, per year, domestic only: • 25m smart metered households • 25m smart electricity meters • 20m smart gas meters • 800 billion half-hourly records (E: 450Bn, G: 350Bn) • 50TB of raw csv file data • EDRP ~ 0.1% of a year’s domestic data

  6. Introduction • What might we use it for? • Improve existing processes • Settlement • Billing, reconciliation, audit • Demand profiling • Customer profiling & segmentation • New processes not possible without HH data at scale • Localised prediction • Distribution network planning and modelling • Automated DSM – prediction and verification • System state detection • Individualised consumer energy services

  7. Introduction • What are the essential processes? • Ingestion – getting the data into the system • Storage – keeping it there securely • Analysis and reporting • Ad-hoc queries • Transaction reports • Descriptives and summaries (e.g. OLAP) • Mining and modelling • Visualisation

  8. Data management & processing More fundamentally Moving data between storage, memory and CPU Transforming it in the CPU into desired forms There are physical constraints on the speed of this. (These are relevant at the scale of smart meter datasets).

  9. Data management & processing Single machine RDBMS Using SQL Server to sum half hourly consumption: 4 bn records: ~ 1 hour 40 bn records: ~ 10 hours 1 years’ worth: ~ 200 hours

  10. Data management & processing Single machine RDBMS Problem: the throughput of a single machine has not kept up with the growth in the size of datasets.

  11. Data management & processing Single machine RDBMS Problem: the throughput of a single machine has not kept up with the growth in the size of datasets. Solution: harness multiple individual machines (‘horizontal scaling’).

  12. Data management & processing Single machine RDBMS Problem: the throughput of a single machine has not kept up with the growth in the size of datasets. Solution: harness multiple individual machines. Problem: this is difficult and expensive using traditional relational database applications

  13. Data management & processing Solution Move away from traditional databases and use a purpose-designed (‘big data’) framework to get horizontal scaling: 1 machine ~£10k 2.5GHz 1 GB/s 100MB/s ~ a week

  14. Data management & processing Solution Move away from traditional databases and use a purpose-designed (‘big data’) framework to get horizontal scaling: 10 node cluster ~£50k 25GHz 10 GB/s 1 GB/s ~ a day 1 machine ~£10k 2.5GHz 1 GB/s 100MB/s ~ a week

  15. Data management & processing Solution Move away from traditional databases and use a purpose-designed (‘big data’) framework to get horizontal scaling: 100 node cluster ~£300k 250GHz 100 GB/s 10 GB/s ~ an hour 10 node cluster ~£50k 25GHz 10 GB/s 1 GB/s ~ a day 1 machine ~£10k 2.5GHz 1 GB/s 100MB/s ~ a week

  16. Data management & processing Hadoop Designed to solve the problem of exponentially growing data volumes (originally, google’s searchable copy of the web) Harness a large number of commodity machines and low cost networking and storage. Software takes a job (query, calculation, whatever) and ‘maps’ it out across the cluster. In parallel each node locally processes a subset of the problem, before the results are ‘reduced’ back to a single dataset. (Hence ‘Map/Reduce’)

  17. Data management & processing Experiments: SQL server Single high performance machine: bottlenecked by the speed of the hard drive ~ 400GB

  18. Data management & processing Experiments: Hadoop 11 node physical cluster (~£50k hardware cost) ~2,500GB

  19. Data management & processing Experiments compared Not straightforward to get SQL Server to run over ~ 10Bn records. ~2,500GB

  20. Data management & processing Experiments: growing the cluster Fixed dataset size of 500m records

  21. Data management & processing • Hadoop • Pros • Open source software – free and customisable • Adjustable data redundancy (data is replicated over the cluster) • Incrementally scalable – on both performance and cost measures: just add machines, system adapts automatically. • Responsive and cooperative developer community • Cons • Not the last word in user-friendliness (but this is changing) • Sledgehammer to crack a nut below a certain scale • Less mature (but rapidly developing) software ecosystem • Algorithms must fit the framework • Conclusion: low cost option for smart meter data processing

  22. Data mining and visualisation • Finding value in the data • Improve existing processes • Settlement • Billing, reconciliation, audit • Demand profiling • Customer profiling & segmentation • New processes not possible without HH data at scale • Localised prediction • Distribution network planning and modelling • Automated DSM – prediction and verification • System state detection • Individualised consumer energy services

  23. Data mining and visualisation Finding value in the data Collaborative approach with industry partners to identify business needs Focus on: (1) Datamining for subgroup discovery – classifying end users (2) Cluster analysis on demand data – finding profiles (3) Innovative visualisation of consumption data and datamining results

  24. Data mining and visualisation • Subgroup discovery • “Pattern features”: 14 variables describing each household • Income, geography, access to gas, size of house, value of house etc. • “Target features”: describe the behaviour of interest • Profile error: how different is usage from the assigned profile? • Outputs: • groups of households with significantly different profile errors

  25. Data mining and visualisation Subgroup discovery Looking at % annual profile error against sociodemographics

  26. Data mining and visualisation Subgroup discovery Looking at % annual profile error against sociodemographics

  27. Data mining and visualisation Subgroup discovery Looking at % annual profile error against sociodemographics

  28. Data mining and visualisation Subgroup discovery Looking at % annual profile error against sociodemographics

  29. Data mining and visualisation Subgroup discovery Looking at % annual profile error against sociodemographics

  30. Data mining and visualisation Clustering Can we use demand data to create better profiles? Define target features: waveform’s properties of interest Two examples: using imposed and emergent properties. Each using 3 clusters.

  31. Data mining and visualisation Clustering E.g. 1 the average weekday as 5 pairs of numbers: Consumption (not to scale) Time of day (half hours from midnight)

  32. Data mining and visualisation Clustering E.g. 2: Frequency spectrum of the demand timeseries

  33. Data mining and visualisation Cluster analysis Project competition results (the University won)

  34. Data mining and visualisation Conclusions from datamining Subgroup discovery results suggest the approach is useful as long as you have metadata on the households Cluster analysis work suggests it is possible to improve on the standard profile classes using SM data Further work needs to be carried out on more representative datasets There are many other potential applications!

  35. The SMASH application Web application Installation of Hadoop on UoB and CSE clusters 11 Node physical cluster at the university (£50k) 8 Node virtual cluster at CSE (£15k) Integration of a range of Hadoop-friendly data management components Development of a proof-of-concept web application for user interaction, job management, visualisation etc. Deployment on both clusters

  36. The SMASH application Web application Currently running on the CSE virtual Hadoop cluster

  37. Generating Value from SM Data • Where next? • We have a proof-of-concept system developed with TSB R&D funding support. • We have mastered the underlying technologies and established that this approach has the potential to be a low-cost solution to a number of industry data challenges. • On a technical level the next steps are to • Further develop the web application • Refine the datamining algorithms (with more data) • Implement selected DM algorithms directly on the cluster • On a policy/programme level we want ensure this knowledge is incorporated into SM rollout infrastructure decision making.

  38. Questions and discussion @cse_bristol #SmartMeterData

  39. Contacts: Simon Roberts simon.roberts@cse.org.uk Joshua Thumim joshua.thumim@cse.org.uk Web: www.cse.org.uk Sign up to our monthly e-news through our website Follow us on Twitter @cse_bristol

More Related