1 / 19

Enhancing Hive with a Cost-Effective Multi-dimensional Range Index for Smart Grid Data

This paper presents DGFIndex, a cost-effective multi-dimensional range index for Smart Grid data in Hive. It overcomes limitations of traditional indexing methods and improves query performance for large-scale smart meter data. Experimental results demonstrate its effectiveness and cost efficiency.

bernadinel
Download Presentation

Enhancing Hive with a Cost-Effective Multi-dimensional Range Index for Smart Grid Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DGFIndex for Smart Grid: Enhancing Hive with a Cost Effective Multi-dimensional Range Index Yue Liu, Songlin Hu*, Tilmann Rabl, Wantao Liu, Hans-Arno Jacobsen, Kaifeng Wu, Jian Chen, Jintao Li MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 国网电力科学研究院 STATE GRID ELECTRIC POWER RESEARCH INSTITUTE 国网浙江省电力公司 STATE GRID ZHEJIANG ELECTRIC POWER COMPANY

  2. Outline • Big Data challenges in Smart Grid • DGFIndex design • Experiments on smart grid data • Conclusions

  3. Big Data Challenges in Smart Grid Smart Meter • Example • 22 million smart meters in Zhejiang province, as required by China State Grid, it should be 96 measurements/day • Will be 2.1 billion records in a single table The Electricity ConsumptionInformation Collection System of the Grid Collector RDBMS Figure 1 Data flow in State Grid

  4. Features of Smart Meter Data • Smart meter data features: • Time stamp field • Append only • Schema is static Table 1 An example format of smart meter data • Queries features: • Multi-dimensional range query • Lots of aggregation query • Query examples: • What is the average power consumption of user ids in the range 100 to 1000 and dates in the ranges “2013-01-01” to “2013-02-01”?

  5. Why Migrate to Hadoop/Hive • Limitations of RDBMS • Low write throughput • Weak scalability • High license cost 16 times faster with only 1/10 cost Figure 2 Write throughput comparison of RDBMS and HDFS Hadoop/Hive is a good choice for solving smart meter big data problem • Hadoop and Hive • High write throughput • Flexible scalability • Low budget and cost effective

  6. Indexes in Hive • Index • Compact Index, Aggregate Index and Bitmap Index • Store all combinations of index dimensions and location • Data partition • Each partition is a directory, reorganize data into different directory Table 2 3-dimensional compact index • Disadvantages on multi-dimensional range queries when large number of distinct value in index dimensions • Index: large index table size • Data partition: lots of directories and small files

  7. Limitations of Indexing in Hive • Limitations: • Storing combination of index dimensions leads to extremely large index table • High selectivity leads to large temporary file, which may overflow the JobTracker’s memory • Poor filter effect when the value of index dimension scattered evenly in data file Temporary File File:Offset 2 4 1 Predicate 3 JobTracker InputFormat.getSplits 5 Chosen Splits 6 MR Job Scan Index Table Figure 3 Query with current index

  8. Recall Data Features • Smart meter data features: • Time stamp field • Append only • Schema is static • Queries features: • Multi-dimensional range query • Lots of aggregation query

  9. DGFIndex Design data file on HDFS • Using grid file to split logical data space into units(GFU) • Data in same GFU is stored together in the file of HDFS, named Slice • GFU is stored as a GFUKey/GFUValue pair in key/value store • GFUKey is the left lower coordinate of GFU in the data space • GFUValue consists of header and location • Header contains some pre-computed aggregation values • Location is the start and end offset of corresponding data segment in file of HDFS 2-dimensional DGFIndex aggregation values Slice dimension Y GFUKey GFUValue header location GFU dimension X Figure 4 DGFIndex architecture

  10. DGFIndex Construction Input Map Phase Reduce Phase Output Splitting Policy DGFIndex Table test test after reorganization A:9|B:14 7_13 Reducer Mapper Input:{36,9|14|0.8} Input:{7_13,<9|14|0.8,8|13|0.2>} Output:{null,<9|14|0.8,8|13|0.2>} Output:{7_13,9|14|0.8} Figure 6 DGFIndex construction

  11. DGFIndex Query SELECT SUM(C) FROM test WHERE A>=5 AND A<12 AND B>=12 AND B<16 1.0 Step 1 (Hive) overlapped? test:0 test:0 2.2 combine Step 2 (InputFormat.getSplits) chosen splits all splits Step 3 (Reader.next) 1.2 test Figure 7 DGFIndex query

  12. Advantages of DGFIndex • Smaller index size • High index read speed, selective • Pre-computation Figure 8 DGFIndex architecture

  13. Experiments • Comparison System • Hive with Compact Index, HadoopDB • Environment • Hardware • 29 virtual nodes, each has 8 cores, 8GB RAM, 300GB disk • Software • CentOS 6.5 b4bit, Jdk 1.6.0_45 64bit, Hadoop-1.2.1, HBase-0.94.13 • DGFIndex is implemented in Hive-0.10.0 • Replication factor is 2 in HDFS, mapred.task.io.sort.mb=512MB • PostgreSQL 8.4.20 for HadoopDB • Data Set and Query • Real meter data (1TB in TextFile and 890GB in RCFile, no compression) and ad-hoc queries from Zhejiang Grid • Lineitem table (518GB in TextFile and 468GB in RCFile, no compression) and Q6 from TPC-H

  14. Index Size and Construction Time DGFIndex construction costs more time, but has smaller size Table 3 The number of distinct value in index dimensions Table 4 The number of intervals in UserId dimension Table 5 Index size and construction time

  15. Aggregation Query SELECTSUM(powerConsumed) FROM meterdata WHERE regionId>r1 AND regionId<r2 AND userId>u1 AND userId<u2 AND time>t1 AND time<t2 For aggregation query, DGFIndex is 2-50 faster than Compact Index and HadoopDB Figure 9 Point query Figure 10 5% selectivity Figure 11 12% selectivity Listing 5 Aggregation query Table 6 number of records needed to read after being filtered by index

  16. GroupyBy Query SELECT time,SUM(powerConsumed) FROM meterdata WHERE regionId>r1 AND regionId<r2 AND userId>u1 AND userId<u2 AND time>t1 AND time<t2 GROUP BY time For non-aggregation query, DGFIndex is 2-5 faster than Compact Index and HadoopDB, only need to read 1/40-1/5 data of Compact Index. Figure 12 Point query Figure 13 5% selectivity Figure 14 12% selectivity Listing 6 GroupBy query Table 7 number of records needed to read after being filtered by index

  17. TPC-H Data Set and Q6 SELECTSUM(l_extendedprice*l_discount) as revenue FROM lineitem WHERE l_shipdate>=date’[DATE]’ AND l_shipdate<date’[DATE]’+interval ‘1’ year AND l_discount between [DISCOUNT]-0.01 and [DISCOUNT]+0.01 AND l_quantity<[QUANTITY] DGFIndex is also efficient for general case data Listing 8 Q6 from TPC-H Figure 16 Q6 from TPC-H cost time Table 10 number of records needed to read after being filtered by index Table 8 Index size and construction time

  18. Conclusions • Multi-dimensional range index is essential for Hive-based smart meter data processing • We propose a cost effective multi-dimensional range index for Hadoop/Hive • Experimental results show the efficiency of our DGFIndex

  19. Thanks Questions?

More Related