SEWI ZG514 Data Warehousing

SEWI ZG514 Data Warehousing • Performance Enhancing Techniques: • Partitioning Strategy • Aggregation • Purushotham BV • utham74@gmail.com

Performance Enhancing Techniques • Partitioning Strategy • Introduction • Horizontal Partitioning • Vertical Partitioning • Hardware Partitioning • Which Key to Partition by? • Sizing the Partition • Aggregations • Introduction • Why Aggregate? • What is an Aggregation? • Designing Summary Tables • Which Summaries to create? • Summary

Partitioning • Partitioning is performed for a number of performance related and manageability reasons, and the strategy as a whole must balance all the various requirements. • Partitioning is needed in any large data ware house to ensure that the performance and manageability is improved. • It can help the query redirection to send the queries to the appropriate partition, thereby reducing the overall time taken for query processing. • Three types: • Horizontal Partitioning. • Vertical partitioning. • Hardware Partitioning.

Horizontal Partitioning • The table is partitioned after the first few thousand entries, and the next few thousand entries etc. • This is because in most cases, not all the information in the fact table needed all the time. • Thus horizontal partitioning helps to reduce the query access time, by directly cutting down the amount of data to be scanned by the queries. • Horizontal partitioning the fact table was a good way to speed up Queries, by minimizing the set of data to be scanned(without using an index). • Partition a fact table into segments.

Horizontal Partitioning (Contd.,) • Each segment of different size, because the number of transaction within the business at a given point in the year may not be the same. • Example • Higher transaction volume at peak periods, such as Christmas etc. • If sales fact table is partitioned monthly.

Horizontal Partitioning • Various ways in which fact data can be partitioned, before deciding on the optimum solution, we have to consider the requirements for manageability of the data warehouse. • Partitioning by Time into Different –sized segments. • Partitioning on a Different Dimension. • Partitioning by Size of Table. • Using Round Robin Partitions

Partitioning by Time into Equal Segments • Partition the fact table on a time period basis. • Example • Partitioning into monthly segments, • Number of tables does not exceed in the order of 500. • Number of the partitions will store transactions over a busy period in the business, and that the rest may be substantially smaller. • This is the most straight forward method of partitioning by months or years etc.

Partitioning by Time into Equal Segments (Contd.,) • This will help if the queries often come regarding the fortnightly or monthly performance / sales etc.

Advantages and Disadvantages • The advantage is that the slots are reusable. • Suppose we are sure that we will no more need the data of 10 years back, then we can simply delete the data of that slot and use it again. • Serious draw back in this scheme • If the partitions tend to differ too much in size. • The number of visitors visiting a hill station, say in summer months, will be much larger than in winter months and hence the size of the segment should be big enough to take case of the summer rush. • This, of course, would mean wastage of space during winter month data space. • Partitioning tables into same sized segments course, would mean wastage of space during winter

Partitioning by Time into Different –Sized Segments. • Three monthly partitions for the last three months (including current month). • One quarterly partition for the previous quarter. • One half-year partition for the remainder of the year.

Advantages and Disadvantages • Detailed information remains available online, without having to restore to using aggregations. • Number of physical tables is kept relatively small, reducing operating costs. This technique may be particularly appropriate in environments that require a mix of data dipping recent history. • The partitioning profile will change on a regular basis • Repartitioning will increase the operational cost of the data warehouse.

Partitioning on a Different Dimension • Data collection and storing need not always be partitioned based on time, though it is a very safe and relatively straight forward method. • It can be partitioned based on the different regions of operation, different items under consideration or any other such dimension. • Most of the queries are likely to be based on the region wise performance, region wise sales etc.

Partitioning on a Different Dimension (Contd.,) • If we are worried about the total performance of all regions, total sales of a month or total sales of a product etc, then region wise partitioning could be a disadvantage, since each such queries will have to move across several partitions.

Partitioning by size of table • We will not be sure of any dimension on which partitions can be made. • Neither the time nor the products or regions etc. • We are sure of the type of queries that we are likely to frequently encounter. • In such cases, it is ideal to partition by size. • Loading the data until a pre-specified memory is consumed, then create a new partition. • However, this creates a very complex situation similar to simply dumping the objects in a room. • Normally metadata (data about data) may be needed to keep track of the identifications of data stored in each of the partitions.

Using Round Robin Partitions • Once the warehouse is holding full amount of data, if a new partition is required, it can be done only by reusing the oldest partition. • Then meta data is needed to note the beginning and ending of the historical data. • This method, though simple, may land into trouble, if the sizes of the partitions are not same. • Special techniques to hold the overflowing data may become necessary.

Vertical Partitioning • As the name suggests, a vertical partitioning scheme divides the table vertically – i.e. each row is divided into 2 or more partitions.

Vertical Partitioning (Contd.,) • Consider the following table:

Normalization • The usual approach in normalization in database applications is to ensure that the data is divided into two or more tables, such that when the data in one of them is updated, it does not lead to anomalies of data

Row Splitting • The method involves identifying the not so frequently used fields and putting them into another table. • This would ensure that the frequently used fields can be accessed more often, at much lesser computation time.

Hardware Partitioning • The data ware design process should try to maximize the performance of the system. • One of the ways to ensure this is to try to optimize by designing the data base with respect to specific hardware architecture. • The exact details of optimization depends on the hardware platforms. • Normally the following guidelines are useful: • maximize the processing power availability, • maximize disk and I/O operations. • reduce bottlenecks at the CPU and I/O throughput.

Maximizing the Processing and Avoiding Bottlenecks • One of the ways of ensuring faster processing is to split the data query into several parallel queries, convert them into parallel threads and run them parallelly. • This method will work only when there are sufficient number of processors or sufficient processing power to ensure that they can actually run in parallel. • Example: • To run five threads, it is not always necessary that we should have five processors. • But to ensure optimality, even a lesser number of processors should be able to do the job, provided they are able to do it fast enough to avoid bottlenecks at processor. • Shared architectures are ideal for such situations, because one can be almost sure that sufficient processing powers are available at most of the times.

Maximizing the Processing and Avoiding Bottlenecks • In such a networked environment, where each of the processors is able access data on several active disks, several problems of data contention and data integrity need to be resolved

Stripping Data Across MPP Nodes • This mechanism distributes the data by dividing a large table into several smaller units and storing them in each of the disks. • There sub tables need not be of equal size, but are so distributed to ensure optimum query performance. • The trick is to ensure that the queries are directed to the respective processors, which access the corresponding data disks to service the queries.

Stripping Data Across MPP Nodes (Contd.,) • The method is unsuitable for smaller data volumes.

Horizontal Hardware Partitioning • This technique spreads the processing load by horizontally partitioning the fact table into smaller segments and physically storing each segment into a different node. • When a query needs to access in several partitions, the accessing is done in a way similar to the above methods. • If the query is parallelized, then each sub query can run on the other nodes

Horizontal H/w Partitioning (Contd.,) • This technique will minimize the traffic on the network.

Why Key to Partition By? • It is very crucial • If working key is chosen, eventually end up having to totally recognize your fact data

Why Key to Partition By? (Contd.,) • Could be chosen to partition on any key, possibly: • region • transaction_date • Suppose the business is organized into 20 geographical regions, each with a varying number of branches of different sizes • It leads to 20 regions, which is reasonable • Nice partitioning scheme, covers vast majority of queries are restricted to the user’s own business region

Why Key to Partition By? (Contd.,) • If partitioned by transaction_date rather than region • All the latest transactions from every region will be in one partition • This is horrible, because user wanted by region has to look across multiple partitions • So partition by region is better.

Sizing the Partition • Key decision made on the size of partition used, will affect the consideration • The SLA also acts as a limit on the size of any partitioning scheme • A partition will most likely become the unit of backup and recovery • The availability stipulations in the SLA will act as a limit on the size of a partition • The disk setup used will act as a constraint on the number of partitions you can use • Query performance is a major consideration

Summary • Partitioning • Horizontal Partitioning • Vertical Partitioning • Hardware Partitioning • Which Key to Partition by? • Sizing the Partition

Thank You

SEWI ZG514 Data Warehousing

SEWI ZG514 Data Warehousing

Presentation Transcript

Data Warehousing

Data Warehousing

Data Warehousing

Data Warehousing

Data Warehousing

Data Warehousing

Data Warehousing

Data Warehousing

DATA WAREHOUSING

Data Warehousing

Data Warehousing

Data Warehousing

Data Warehousing

Data Warehousing

Data Warehousing

DATA WAREHOUSING

Data Warehousing

Data Warehousing

Data Warehousing

Data Warehousing

Data Warehousing