Scott Pickett – WW Informix Technical Sales

1 / 117

# Scott Pickett – WW Informix Technical Sales - PowerPoint PPT Presentation

TimeSeries. Scott Pickett – WW Informix Technical Sales For questions about this presentation contact: spickett@us.ibm.com. Agenda. Time Series Real Time Loader. Key Points. Describe two complex extenders. TimeSeries Real Time Loader Why time series data can be a problem.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Scott Pickett – WW Informix Technical Sales' - wilson

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

TimeSeries

Scott Pickett – WW Informix Technical Sales

Agenda
• Time Series
Key Points
• Describe two complex extenders.
• TimeSeries
• Why time series data can be a problem.
• How TimeSeries solves these problems.
• Applications.
• Why streaming data can be a problem.
• How the RealTime Loader solves these problems.
TimeSeries
• TimeSeries:
• What is a time series ?
• Regular
• Irregular
• Where they are used ?
• What they are used for ?
• How they are used ?
Regular TimeSeries Data
• Timestamps have a regularly repeating pattern of intervalsdaily, hourly, etc...
• Might be breaks in the pattern:
• A work week is data captured for 5 days in a row then nothing for 2 days.
• Only one piece of data per interval:
• If an interval has not been inserted into then it has the value of NULL
• Intervals not inserted into occupy some space
• Heuristic: intervals not inserted into at the end of series do not take up space.
• Can be thought of as an array.
• Optimized to return data at offset, not timestamp.
Irregular Time Series Data
• Data in an irregular time series does not have a regularly repeating pattern of intervals:
• Any interval may have zero or more pieces of data.
• Missing data takes no space on disk:
• There really is no concept of missing data.
• Only efficient way to access data is by timestamp:
• You can retrieve the Nth piece of data, but the code does a linear search.
• Data can be stair stepped:
• Value persists until next value arrives – for example stock prices.
• Data can be discrete points:
• Value is valid only at the given time – for example heart beats.
The TimeSeries "Problem"
• Timeseries are usually stored as "tall – thin" tables with a very large number of rows.
• May need one index to enforce uniqueness and another for index only read, more space used for index than data.
• Can be difficult to write SQL to work with the data.

Stock

Timestamp

Price

Volume

ABC

2006-01-01 09:30:00.00000

12.34

100

XYZ

2006-01-01 09:30:00.01000

9.34

1000

ABC

2006-01-01 09:30:00.19000

12.44

10

KKK

2006-01-01 09:30:01.05000

194.00

10000

ABC

2006-01-01 09:30:01.70000

12.54

250

Data

35 Mb

Index on stock + timestamp

28 Mb

Index on all columns

43 Mb

• Irregular time series of stock "ticks".
• Typical volumes are 50M per day and increasing.
• Could access by "Stock" and "Timestamp" but often need an index on all columns for index only read.
• Can be difficult to write SQL to process tick data:
• Applying stock splits.
• Calculating Volume Weighted Average Price (VWAP)‏

select t.tstamp, t.price * s.fac, t.vol/ s.fact

from tick t, split s

where t.stock = s.stock

and t.tstamp::datetime year to day = s.tstamp

and t.stock = "ABC"

and t.tstamp::date between

('2006-01-01' and '2006-01-05')‏

select stock, ststamp::datetime year to hour,

sum(price * vol) / sum(vol)‏

from tick

where stock = "ABC"

and tstamp::date between

('2006-01-01' and '2006-01-05')‏

group by 1, 2

TimeSeries Solution (1 of 7)
• Store timeseries elements as an ordered set of elements:
• Uses less space because the "key" is factored out and the time can be represented in 4 bytes (regular)‏.
• Access is as fast (or faster) than index only read but no additional index.
• SQL can be made much simpler.

Stock

Ticks

ABC

2006-01-01, daily, {(12.34,1234567), (12.56,950000), (12.34,5555567),..}

KKK

2006-05-05, daily, {(199.08,678900), (198.55,345000), (198.12,850250),..}

XYZ

2003-09-01, daily, {(9.34,890567), (9.56,989000), (9.40,1000780),..}

TimeSeries Solution (2 of 7)
• Store timeseries data as:
• Start time.
• Calendar information.
• Ordered set of elements.

Timeseries

30 Mb

Data

35 Mb

Index on stock + timestamp

28 Mb

Index on all columns

43 Mb

TimeSeries Solution (3 of 7)
• How a timeseries is stored:
• Start time and calendar name.
• Index (btree) of the first element in each page.
• Each page contains all the elements between 2 timestamps.
• Traditional methods use 2 to 3 times more space or even more depending on the number of NULL values.

Singleton

1,000 consec.

Aggregate 80,000

1,000 consec.

f(t1, t2)‏

Timeseries

1

1

1

1

.5

2

3

5

2

400

1000

650

TimeSeries Solution (4 of 7)
• Performance:
• Slower for singleton selects.
• Faster accessing sets of data.
• Much faster combining timeseries.
• Traditional(2) = index on key columns.
• Traditional(1) = index on key and all columns.
TimeSeries Solution (5 of 7)
• Much simpler SQL, consider the "stock split"/calibration problem:

select t.tstamp, t.price * s.fac, t.vol/ s.fact

from tick t, split s

where t.stock = s.stock

and t.tstamp::datetime year to day = s.tstamp

and t.stock = "ABC"

and t.tstamp::date between ('2006-01-01' and '2006-01-05')‏

select func(ticks, splits, '2006-01-01', '2006-01-05')‏

from tick

where stock = "ABC"

TimeSeries Solution (6 of 7)
• Much simpler SQL, the VWAP problem:

select stock, tstamp::datetime year to hour, sum(price * vol) / sum(vol)‏

from tick

where stock = "ABC"

and tstamp::date between ('2006-01-01' and '2006-01-05')‏

group by 1, 2

select stock, vwap(ticks, '2006-01-01', '2006-01-05', '01:00:00')‏

from tick

where stock = "ABC"

TimeSeries Solution (7 of 7)
• Other Timeseries features:
• Calendar support.
• VTI to make time series look like traditional tables.
• C API.
• Java Class Library.
• Easy to combine with own or 3rd party code‏.
TimeSeries Practicalities (1 of 6)
• Calendar Patterns – stored in the CalendarPatterns table
• All time series are associated with these.
• Used to bound the times at which timeseries data is recorded:
• Data recorded during 'on' times.
• Data not recorded during 'off' times.
• Data recorded for the interval specified within the bounds.
• Within the calendar definition.
• A Monday to Friday 5 day work week can be represented as 1 day off (Sunday), 5 days on, 1 day off (Saturday):
• Insert into CalendarPatterns values ('Five_Day_Work_Week', '{1 off, 5 on, 1 off}, day');
• Note the brackets '{ }' above.
• The trailing day is an interval for a single unit of time.
TimeSeries Practicalities (2 of 6)
• How do I create a data type that defines each element in a time series ?
• Must first create a row type whose first column timestamp must be a datetime year to fraction(5) value:
• create row type stockVals (

tstamp datetime year to fraction(5),

• price decimal(16,2),
• high decimal(16,2),
• low decimal(16,2),
• median decimal(16,2),
• volume integer,
• …..
• );
• Each timestamp must be unique.
• Row types cannot contain types:
• blob, clob, text, serial
TimeSeries Practicalities (3 of 6)
• Once a calendar pattern is established, create the calendar to use it in the table CalendarTable:
• insert into CalendarTable ( c_name, c_calendar)
• values ('Trading Week','startdate(2010-01-01 00:00:00.00000), pattstart(2010-01-07 00:00:00.00000), pattname(Five_Day_Work_Week)');
• Regular timeseries calendars are used to guide the mapping between the timestamp and the offset
• Represented by Calendar Data Type, consisting of:
• Start date of the calendar.
• Calendar repeating pattern during the duration of the data type for the valid and invalid recording of the data.
• Interval – calibration of the calendar pattern.
• Starting date of the calendar pattern
• Must be >= Calendar start date and
• Less than one calendar pattern length after calendar start.
TimeSeries Practicalities (4 of 6)
• Containers are required data structures that hold data for one or more time series.
• You cannot mix data for regular and irregular time series in the same container.
• How do I create a container to store a time series ?
• execute procedure TSContainerCreate(“ts_ctrname”,”dbspace1”,”stockVals”,40000,40000);
• A container is made up of index pages and data pages.
• Containers allow time series data to be spread onto many disk partitions.
TimeSeries Practicalities (5 of 6)
• How do I create a table that includes a time series ?
• create table dailyStocks (
• stockId integer,
• abbrev char(4),
• stockData timeseries(stockVals));
• How do I create a time series record ?
• insert into dailyStocks values (901, 'IBM', TSCReate('Trading Week','2010-01-03 00:00:00.00000', 20, 0, 0, 'stockCon'));
TimeSeries Practicalities (6 of 6)
• How do I add data to a time series ?
• update dailyStocks
• set stockData = PutElem(stockData, row(NULL::datetime year to fraction(5), 3.3, 4.4, 2.2, 3.0, 123456, 789, NULL)::stockData)‏
• where stock = 'XYZ';
• How do I query a row with a time series ?
• select clip(stockData, current – interval(1) month to month, NULL) from stockData;
• Get all of IBM since the beginning of the 2001
• select clip(series, ‘2001-01-01 00:00:00.00000’, Current) from S_P_500 where name = ‘IBM’;
Building Applications with TimeSeries
• Several interfaces are available:
• SQL
• VTI
• SPL
• Java
• C-API
• It’s a toolkit approach!
• Allow people to build their analytics.
TimeSeries SQL Interface
• TimeSeries data is usually accessed through user defined routines (UDR’s) from SQL, some of these are:
• Clip()- clip a range of a time series and return it.
• LastElem(), FirstElem() - return the last (first) element in the time series.
• Apply() - apply a predicate and expression to a range of elements in a time series.
• AggregateBy() - change the interval of a time series using a aggregate function.
• SetContainerName() - move a time series from one container to another.
• BulkLoad() - load data into a time series from a file.
• See Appendix A for a list of all of SQL Interface UDR's.
TimeSeries VTI Interface
• Makes time series data look like standard relational data:
• Useful for programs that can’t handle objects.
• Useful when application connects with ODBC.
• There is a small (10%) penalty for using VTI.
• Restrictions:
• A VTI table can only reference one time series column from the base table.
• No secondary indices are allowed.
• SQL to create a VTI table:
• execute procedure tscreatevirtualtab(‘t_vti’, ‘t’);
TimeSeries C-API Interface
• Client and server versions of the API.
• Treats a time series like a table (sort of).
• Functions to:
• Open and close a time series.
• Scan a time series between 2 timestamps.
• Create a time series.
• Retrieve, insert, delete, update.
• Plus another 70 functions defined.
• See Appendix C for a list of these.
Stored Procedure (SPL) Example
• -- count non-null elements in a time series
• create function spl_test(arg lvarchar) returning integer
• define var daybar;
• define cnt integer;
• let cnt = 0;
• foreach execute function
• transpose((select series from S_P_500 where name = arg))
• into var
• let cnt = cnt + 1;
• end foreach
• return cnt;
• end function;
TimeSeries Application
• Existing applications:
• Finance (Capital Markets)‏.
• Some in Manufacturing and Environmental.
• Smart Metering / Electrical Grid Usage
• Possible applications:
• Telco.
• Real Time Cell Phone Tower Data
• Sales/sales analysis.
• Energy
• Oil Field Geology data.
• Water/Sewer System Usage
• Satellite Telemetry
• Assembly Line Q/A & Analysis
• Next problem after time series "problem" solved.
• Large number of ticks generated at market open/close and in reaction to events.
• Money is made at these times.
• Need to make the data available as soon as possible.

Edison Stock Ticker - 1892

• It takes time for the data to become available.
• The table can be locked while the data is loaded.
• The RealTime Loader solves this problem by holding the data in memory and making it available to the SQL layer as if it was already in the database:
• Allows real-time analysis of data as it come thru the feeds
• Uncommitted and committed data reads are possible.

Data Feeds

RTL

RTL Memory

RTL

Database Memory

Client

Database Server

RealTime Loader Performance (1 of 3)
• Load performance of traditional systems depends on the system but rates of 1,000s of ticks per second are reasonable but there is a high latency as the data is batched and loaded.
• With the RTL the data is made accessible within a few ms. There are 2 load rates to consider:
• How fast the data can be added to shared memory.
• How fast the data can be loaded into the database.
RealTime Loader Performance (2 of 3)
• How fast can the data be added to shared memory ?
• Rates of 10,000s per cpu sec are common.
• How fast can the data can be loaded into the database ?
• This is roughly the same as the traditional RDBS solution.
• The system can run at the "top" rate until the shared memory allocated to RTL is full.
RealTime Loader Performance (3 of 3)
• The peaks don’t last very long, the system can be sized for something just above the average rate.
• Traditional systems have to be sized to handle the peak rate or the user has to accept a high latency during the crucial market open/close periods.
RTL Data Access
• Access can be through the usual SQL layer.
• Even faster access through an API that gets the data directly from the shared memory segment.
• The SQL layer can see all the ticks, or just those in RTL, or just those on the database.

Ticks in RTL Shared Memory

Ticks in Database

• The RealTime Loader can do some simple aggregation and statistics on the incoming data making this available to SQL or through the API:
• VWAP - Volume Weighted Moving Average Price
• = total amount spent on a security / total number of shares traded.
• This needs all of the share purchase history stored unbounded by time.
• EWAP – Exponentially Weighted Moving Average Price
• A share price moving average where previous moving average data is considered but dampened by the effect of the new, more recent moving average.
• Only needs 3 pieces of data:
• the old and new MAP data.
• plus the damping factor.
• max, min etc.
• Stock ticker trades data analysis.
• Gaming
• Telco:
• Cell Phone Tower Call Analysis Apps.
• Cell Phone Account Apps.
• 24 x 7 Satellite Telemetry Data Applications:
• Millions of records per minute .......
• Geological Analysis Applications:
• Oil Exploration Data.
• Land Resource Analysis Applications.
• Water/Sewer Flow Data
Conclusion
• TimeSeries allows time series data to be:
• Efficiently stored.
• Quickly accessed.
• Conveniently queried.
• The RealTimeLoader allows time series data to be:
• Quickly stored.
• Accessed by applications with the minimum of delay.
• Appendix D has more information on a recent successful POC and customer win worth a read.
Time Series – Delete Performance Improvements
• Delete Performance Improvements
• When a user deletes rows from a table that contain a TimeSeries, the TimeSeries needs to be deleted. Since they can be very large, improvements on the delete performance is important to provide a good response time to the users. Performance improves as more records are deleted:
• < 30,000 --> 2X faster
• 100,000 --> 3X
• 1,000,000 --> 18X
• 3,000,000 --> 101X
Time Series – Container Improvements
• Container Improvements:
• A TimeSeries is stored in a container. We removed the requirement of having to identify or even create containers when inserting a new TimeSeries into a table. This makes using TimeSeries more user-friendly.
• The container name limit was raised from 18 characters to 128 characters to allow customers to be more descriptive in the naming of their containers.
• Container statistics functions were added so that System Administrators can get a better view of their space utilization and thereforecan better plan for future upgrades.
Container Pools
• New column (pool) in the function TSContainerTable.
• autopool: default pool of containers created by auto-create container.
• Create containers (as before):TSContainerCreate(<container_name>, <dbspace>, <initial_size_kb>, <next_size_kb>);
• Add a container to a pool:TSContainerSetPool(<container_name>, <pool_name>);
• Pool created automatically.
• A container can be in only one pool at a time.
• Remove a container from its current pool: TSContainerSetPool(<container_name>);
Container Selection
• Search autopool for a set of candidate containers.
• Select the container with the greatest amount of free space.
Pool Policy: TSContainerPoolRoundRobin
• Select a container from a pool in a round robin manner and return the name.
• Example*:

INSERT INTO smartmeters(meter_id,rawreadings) VALUES ('met00001','origin(2006-01-01 00:00:00.00000),

calendar(smartmeter),regular,threshold(0),

[(33070,-13.00,100.00,9.98e+34), (19347,-4.00,100.00,1.007e+35), (17782,-18.00,100.00,9.83e+34)]');

• Position kept in memory.

* Note: the call to TSContainerPoolRoundRobin is a string.

Pool Policy: TSContainerPoolRoundRobin

function TSContainerPoolRoundRobin(

tabname lvarchar,

colname lvarchar,

subtype lvarchar,

irregular integer,

poolname lvarchar)

returns lvarchar;

• tabname: Table name where the TS value is being inserted.
• colname: TS column name where a TS value is being inserted.
• subtype: Subtype TS value (row type) being inserted.
• irregular: Kind of TS(0– regular, 1– irregular, integer).
• poolname: Name of the container pool to use.

function <pool_policy_name>(

tabname lvarchar,

colname lvarchar,

subtype lvarchar,

irregular integer

[,user_data lvarchar] )

returns lvarchar;

• Tailor the container selection policy to fit your environment.
Time Series – New Container Functions (1)
• All functions take a container name or NULL
• If NULL, cover all containers
• TSContainerTotalPages
• Total number of pages allocated to a container.
• TSContainerTotalUsed
• Number of pages containing TimeSeries data.
• TSContainerNElems
• Total number of elements stored in a container.
• TSContainerPctUsed
• (used * 100) / total
• TSContainerUsage
• Returns used, nelems, and total
Time Series – New Container Functions (2)

EXECUTE FUNCTION TSContainerTotalUsed(NULL); pages 2029

EXECUTE FUNCTION TSContainerTotalPages(NULL); total 2169

EXECUTE FUNCTION TSContainerNElems(NULL); elements 241907

EXECUTE FUNCTION TSContainerPctUsed(NULL); percent 93.545

EXECUTE FUNCTION TSContainerUsage(NULL); pages slots total 2029 241907 2169

Time Series – VTI Improvements
• Virtual Table Interface (VTI) Improvements
• The VTI view of a table containing a TimeSeries now has more complete information to the user on the column’s characteristics:
• Allow nulls in non-primary keys.
• More than one “view” can be created on a TimeSeries.
• Expression pushdown (keyword substitution).
• A VTI view can include pre-processing with TimeSeries functions that will give the user the exact view he wants to see in a relational format while getting the performance of the TimeSeries API functions.
• Example:
• Creating an hourly moving average.
• Getting the max value per day.
TS VTI Improvements
• Make the NOT NULL constraint on base table visible through VTI.
• Use only the primary key columns to locate the appropriate row to update:
• Benefit: Allow NULLs in non-primary key columns.
• If the primary key of the base table is altered, any TS VTI tables based on that base table will need to be recreated.
• If no primary key, use all non-TS columns.
• New TSVTMode flags
• TS_VTI_UPDATE_NONKEY_NOT_NULLS (32)
• Only not null updatable columns of the insert row will be added to the set clause.
• TS_VTI_UPDATE_NONKEY_INCLUDE_NULLS (64)
• All the updateable columns of the insert row will be added to the set clause.
VTI Expression Pushdown
• Provide pre-processed views of TimeSeries
• Example:“Create a hourly moving average and get the max value per day”(Table: smartmeters, column: readings)

EXECUTE PROCEDURETSCreateExpressionVirtualTab(

'smartmeters_expr1_vti', 'smartmeters',

"AggregateBy('max(\$value)','smartmeter_daily',

Apply('TSRunningAvg(\$energy, 4)',

)::TimeSeries(onereal_row),

0)",'onereal_row');

VTI Expression Pushdown: Result

SELECT *

FROM smartmeters_expr1_vti

WHERE t >= '2011-01-01 00:00:00.00000'::datetime year to fraction(5)

AND t <= '2011-01-02 23:59:59.99999'::datetime year to fraction(5);

meter_id t value

met00000 2011-01-01 00:00:00.00000 30.25000000000

met00000 2011-01-02 00:00:00.00000 29.50000000000

met00001 2011-01-01 00:00:00.00000 29.75000000000

met00001 2011-01-02 00:00:00.00000 31.00000000000

. . .

VTI: TSCreateExpressionVirtualTab
• More than one “view” can be created on a TimeSeries.
• Very similar to TSCreateVirtualTab but a read-only interface.

TSCreateExpressionVirtualTab(

VirtualTable VARCHAR(128,1),

BaseTablename VARCHAR(128,1),

expression LVARCHAR,

subtype VARCHAR(128,1)

TSVTMode INTEGER DEFAULT 0

TSColName VARCHAR(128,1) DEFAULT NULL);

VTI Expression Pushdown
• Keyword substitution:
• \$ts_begin_time, \$ts_end_time
• Used to substitute the time from the runtime query in the expression view:
• Example expression:

Apply('TSRunningAvg(\$energy, 4)',

\$ts_begin_time,

\$ts_end_time,

)::TimeSeries(onereal_row)

• Query including:

WHERE t >= '2011-01-01 00:00:00.00000'::datetime year to fraction(5)

AND t <= '2011-01-02 23:59:59.99999'::datetime year to fraction(5);

Time Series – All IDS Platform Support
• Increased Platform Support
• Customers now have more choices when it comes to platform support to run Informix with TimeSeries:
• Redhat Linux 32 and 64 bit
• Mac OS 10
• Linux on pSeries
• Sol x86_64
• TimeSeries is now built in for all platforms for IDS 11.70xC3:
• https://www-304.ibm.com/support/docview.wss?rs=630&uid=swg27013343&S_CMP=rnav
Time Series – New Standard Calendars
• General Usage Improvements
• Customers don't need to define commonly used calendars. They can now use new, built-in, predefined calendars based on the following intervals of time:
Time Series - XML Generation (1)
• Customers can now use a standard function format to exchange messages using XML. With this capability, they can easily transform TimeSeries data into XML messages.
• Syntax:CREATE FUNCTION TSToXML( doctype lvarchar, id lvarchar, ts TimeSeries, output_max integer DEFAULT 0)RETURNS lvarchar;
Time Series - XML Generation (2)
• doctype: Name of the topmost XML element
• id: Identification for the TimeSeries
• ts: TimeSeries to process
• output_max: Max output size, 0 default represents 32KB, minimum of 4096, maximum of 2GB
• <AllData>1</AllData>: All data was returned
• <AllData>0</AllData>: The TS data was truncated to fit theoutput max.
• Still a valid XML document.
XML Generation Example

SELECT TSToXML('meterdata', esi_id,

Clip(actual,

'2010-09-08 12:00:00'::datetime year to second,

'2010-09-08 13:00:00'::datetime year to second) )

FROM meters

WHERE esi_id = '22505613346851835';

<meterdata>

<id>2250561334</id>

<AllData>1</AllData>

<meter_data>

<tstamp>2010-09-08 12:15:00.00000</tstamp>

<value>0.9170000000</value>

</meter_data>

<meter_data>

. . .

</meterdata>

Time Series (7)
• The ‘stores demo’ database has been updated with smart meter information so that customers now have a ready to use environment to experiment with TimeSeries and increase their understanding of the technology.
• Multi-Cultural Support:
• Time Series now supports multiple locales.
• AggregateBy accumulator too small:
• Accumulators for smaller integer types have been increased for the AVG function to prevent overflow.
• This is a bug fix.
Appendix - Time Series Built in Routines (9 of 9)
• Following routines are used only with regular time series:
• Lag
• PutNthElem
• TSCreate
• The TSCreateIrr function is only used with regular time series.
Appendix - Time Series API's

Application Programmer Interfaces for a Time Series:

• Two types -
• Client version – tsfeapi
• Sample Location: \$INFORMIXDIR/extend/TimeSeries.4.07.UC7/lib/tsfeapi.a (Unix/linux)
• Similar directory for Windows.
• Operates on a private copy of timeseries datum.
• Changed data must be written back into the database.
• Functionts_copyis used prior to the execution of a prepared statement.
• Server version – tsbeapi
• Sample Location: \$INFORMIXDIR/extend/TimeSeries.4.07.UC7/lib/tsbeapi.a (Unix/Linux)
• Similar directory for Windows.
• Operates directly on timeseries datum.
• Direct execution of prepared/unprepared statements.
• Header file – tseries.h -
• Must be included in header whenever using bsfeapior tsfeapi
Appendix - Time Series API – Data Structures
• ts_timeseries – the header for a time series data.
• ts_tscan – allows you to look at no more than two time series elements at a time.
• ts_tsdesc – A structure containing a time series and the data structures it works with.
• ts_tselem – A pointer to on element of a time series.
• Who is Oncor?

Oncor is a regulated electric distribution and transmission business that uses superior asset management skills to provide reliable electricity delivery to consumers. Oncor operates the largest distribution and transmission system in Texas, providing power to 3 million electric delivery points over more than 120,000 miles of distribution and 14,000 miles of transmission lines.

• Current Solution
• Architected by IBM GBS:
• Primary architect Keith Hall.
• A number of different components used including Oracle.
• They currently manage about 1 million smart meters.
• Every meter generates one record every 15 minutes.
• ERCOT (Energy Regulatory Council of Texas) requires data be kept for 25 months.

(from the Oncor website)

25 months * 1 million meters every 15 minutes = 73 billion records

• Oracle is barely keeping up with the data
• It is taking Oracle about 7 hours to read in the data for 1 million meters.
• The ERCOT reports are taking between 2 and 7 hours to run.
• Some problems with reading and writing data simultaneously.
• Oncor is planning to expand
• In the next 1 to 2 years they expect to manage 3.5 million meters:
• Their 73 billion records stored becomes 220 billion records stored.
• Even if Oracle scales linearly there will not be enough time in the day to read and process all this data.
• Oncor is looking for ways to cut their costs
• Oracle is currently using about 1.3 TB to store 90 days worth of data.
• They would like to do more with the hardware they have:
• Process the data more quickly to allow other applications to run against the data.
Design Decisions
• Why they picked Informix:
• Low cost/Low administration well established.
• High performance.
• Security and High Availability also of interest.
• Why they used the native time series support in Informix:
• Provides high level of compression.
• Provides continuous loading of data with minimal impact on concurrent queries.
• Provides very high degree of parallelism on reads and writes.
• Time series tool kit allows custom analytics to be written.
• No other RDBMS has native time series support.
Results of the POC with Informix
• Informix takes about 18 minutes to read in the data for 1 million meters:
• Oracle is taking about 7 hours.
• Informix takes about 6 minutes to run each ERCOT report and about 25 seconds if the data is already cached:
• Oracle is taking from 2 to 7 hours depending on the report.
• Data space used by Informix is about 350GB:
• Oracle is using about 1.3TB.
• Results were very linear:
• Better results if you increase cpus and storage.
• If less performance is acceptable then cpus and storage can be reduced.
Next Steps
• Oncor feels they can save a tremendous amount of money with Informix due to:
• Reduced storage requirements (350GB vs 1.3TB per million meters for 90 days).
• Ability to do more work on less hardware
• They currently run on Sun hardware.
• As of Aug 14, 2010 Oncor has been given a price for an Informix system to handle 3.5 million meters.
• Oncor is deciding whether they want to expand Informix to other systems:
• Customer Portal.
• Some additional billing applications currently running on Oracle.
• They are also now considering high availability and whether they should have one or more secondary servers.
POC Details
• Simulation:
• 90 days worth of meter data for 1 million meters.
• 200 locations.
• 500 feeders.
• 34 substations.
• Hardware used in POC
• Power7 with 2 sockets each with 8 cores.
• 64 bit SUSE Linux 11.
• 128 GB of memory:
• Memory actually needed = 44GB, although could probably be less.
• 6 disks dedicated to the database, 2 additional for OS and LSE staging:
• Disk space actually used by the database = about 350GB.
• Additional disks for the operating system and staging area for files.
• Software:
• Informix Ultimate edition.
• Informix Time Series.
Data Generation
• We started with one data file (LSE file) with data for 9960 meters.
• We duplicated this file by adding a suffix to the meter id (esi_id) and changing the date.
• End result is 1 million meters for 90 days.
• Because we are duplicating data it means aggregates for each meter looks the same.
Informix Time Series Schema

The Meter table looks like this:

create table meters (

esi_id char(64) not null primary key,

suffix char(32),

location char(16),

feeder char(16),

sub_station char(16),

dbspace varchar(128),

container varchar(128),

actual timeseries(meter_data),

estimated timeseries(meter_data),

valid timeseries(update_day)

)

A Meter reading looks like this:

create row type meter_data (

tstamp datetime year to fraction(5),

value decimal (14,3)

);

An update (correction) record looks like:

create row type update_day (

tstamp datetime year to fraction(5),

last_update datetime year to fraction(5),

);

Hierarchy is sub_station->feeder->meter. There are also tables for location, sub_station and feeder not shown above.

Primary Use Cases
• Load 90 days worth of data for 1 million meters from LSE files
• Original set of LSE files massaged to generate 1 million distinct meters.
• 6-day ERCOT Settlement Extract
• Show support for the ERCOT settlement processes by creating LSE file consisting of every record (every meter) for operating day - 6 (calendar day that occurred 6 days prior to current day). Must be able to extract and create the LSE files for 1M meters for a specific day.
• 22-Day Update ERCOT Settlement Extract
• Show support for the ERCOT settlement processes by creating LSE files consisting of every record that has had a consumption interval record update since the prior extract / pull (6-Day). Only extract the last or most current update for each meter, so if a meter has been updated four times, only the last / current record is sent. The entire 96 15 minute intervals are sent each time as well.
• Missing Record ERCOT Settlement Extract
• Show support for the ERCOT settlement processes by creating an LSE file consisting of only the meter IDs and date that is provided in a missing meter ID file from ERCOT. The dates will be as far back as 90 days and no sooner than 28 days back in time.
Other Use Cases
• Determine the count and the list of meter ID's for all meters with missing intervals and / or register reads on a given day.
• Determine the 90 day history for a given meter (90 record pull for same meter).
• Determine the count and list of meter IDs that exceeded a given high interval value for a given day or given time period (multiple days). For example, count and list of meters that had interval value of 12 or higher for a given period of time.
• Determine list of meters that have 5 consecutive or more days with estimated values only (no actual interval reads during a 5 day or more period).
Results: Load 90 days worth of data for 1 million meters from LSE files
• Query syntax:
• Performance:
• 6 Parallel streams - 18 minutes to load 1 million meters for 1 day.
• Scalability:
• Depends on parallelism of disk write performance.
• Increase disks and CPUs for higher speed/more data.
Results: 6-day ERCOT Settlement Extract
• Query syntax:
• The LSE file is generated from the base query:

SELECT esi_id,    daily_output (actual, estimated, \$db_start_date)

FROM meters;

• daily_output() is a custom UDR that merges the actual & estimated interval data into a single day
• Performance:
• 6 Streams - 6 minutes 30 second to extract 1 day (1 million meters)
• Scalability:
• Depends on parallelism of disk read performance
• Increase disks and CPUs for higher speed/more data
Results: 22-Day Update ERCOT Settlement Extract
• Query syntax:
• Uses standard Timeseries functions and an stored procedure (SPL) routine:

SELECT apply ('"'||trim(esi_id)||'", (\$last_update)', 'was_updated(\$last_update,\$tstamp)',        '\${day} 00:00:00.00000'::datetime year to fraction(5),       '\${day} 00:00:00.00000'::datetime year to fraction(5) +

\${period} units day, m.valid)::timeseries(esi_one_day)    FROM meters m WHERE dbspace = 'cont\${c}'     AND getnelems (apply ('\$last_update', 'was_updated(\$last_update,\$tstamp)',        '\${day} 00:00:00.00000'::datetime year to fraction(5),       '\${day} 00:00:00.00000'::datetime year to fraction(5) +

\${period} units day, m.valid)::timeseries(one_day)) > 0

• Performance:
• 22 day - 6 Streams 4 minutes
• 90 day - 6 Streams 11 minutes
• Scalability:
• Depends on period and number of meters.
• Increase in speed possible with custom UDR.
Results: Missing Record ERCOT Settlement Extract
• Query syntax:
• Joins meter+date info to LSE extract query.
• Performance:
• 4000 readings for a particular day in 6 seconds – when data is cached.
• 4000 readings in 24 seconds for 90 days – when data is not cached.
• Depends on number of meters+days and whether data is in memory already.
• Directly proportional to settlement extract.
• Scalability:
• Depends on parallelism of disk read performance.
• Increase disks and CPUs for higher speed/more data.
• Query syntax:
• Uses a custom UDR to join actual and estimated readings.

SELECT esi_id    FROM meters    WHERE ClipGetCount(tagged_reading (actual, estimated,             '\$day 00:00'::datetime year to minute,            '\$day 23:45'::datetime year to minute    )::timeseries(tagged_reading)) < 96    AND dbspace = 'cont\${c}'

• Performance:
• 6 Streams - 6 minutes 10 seconds all meters, 1 day
• Scalability:
• Depends on parallelism of disk read performance.
• Increase disks and CPUs for higher speed/more data.
Results: Determine the 90 Day History for a Given Meter
• Query
• Uses standard Timeseries functions and a user defined aggregate row type:

SELECT AggregateBy('sum(\$value), min(\$value), max(\$value), avg(\$value)',‘cal1day', actual, 0,       '2010-01-01 00:00:00.00000'::datetime year to fraction(5),   '2010-03-31 23:45:00.00000'::datetime year to fraction(5))       ::timeseries(agg_stats1)    FROM meters    WHERE esi_id = '80937225860964649#1'

• Performance:
• Single query - 0.04 seconds for 1 meter with 90 day aggregation.
• Scalability:
• Not investigated, but will depend on whether time series is already in memory.
Result: Determine the count and list the number of meter IDs that exceeded a given high interval value for a given day
• Query:
• Uses standard Timeseries functions:

SELECT Apply ('("'||trim(esi_id)||'") , \$value', '((\$value) > \${limit})',        '\${day} 00:00:00.00000'::datetime year to fraction(5),        '\${day} 23:45:00.00000'::datetime year to fraction(5) + \${period} units day, actual )::timeseries(esi_reading)    FROM meters    WHERE getnelems (apply ('\$value',            '\$value > \${limit}',            '\${day} 00:00:00.00000'::datetime year to fraction(5),         '\${day} 23:45:00.00000'::datetime year to

fraction(5) + \${period} units day,            actual)::timeseries(meter_data)) > 1    AND dbspace = 'cont\${c}'

• Performance:
• 6 Streams -  5 minutes 40 seconds all meters for 1 day.
• Scalability:
• Depends on number of days to be assessed.
• Depends on parallelism of disk read performance.
• Increase disks and CPUs for higher speed/more data.
Results: Determine list of meters that have 5 consecutive or more days with estimated values only
• Query:
• Joins to a list of required dates.
• Filter function is faster but less selective.
• Uses standard Timeseries functions and an Stored procedure routine.

SELECT esi_id, start_date,       ClipGetCount (Apply ('non_null_value (\$0.value, \$1.value)',start_date,

extend ((start_date + \$days units day), year to minute) - 15 units minute,           estimated, actual)::timeseries(single_reading))   FROM meters, date_listWHERE dbspace = 'cont\${c}'    AND ClipGetCount (estimated, start_date,       extend ((start_date + \$days units day), year to minute) - 15

units minute) = 480     AND start_date BETWEEN '\$start' AND '\$end'

• Performance:
• 6 Streams - 17 minutes for all meters in 5 out of 30 days.
• Scalability:
• Depends on start/end number of days to be scanned.
• Depends on parallelism of disk read performance.
• Increase disks and CPUs for higher speed/more data.
• Custom UDR would be faster.
Some Notes
• Oncor is very enthusiastic about Informix:
• They feel having time series support built into an RDBMS is a perfect fit for smart meter applications.
• They have mentioned they would like to write a white paper with us.
• The GBS team feels Oncor would be very interested in promoting this solution at IOD, for instance in one of the keynotes.
What is TimeSeries Data?
• Time series data is:
• A set of data where each item is time-stamped
• Think of an array where each element is indexed by time.

“Give me the Jan 1st element from time series “X”

• Most useful when a range of data is normally read:

“Give me the Jan 1st thru Jan 10th elements from time series “X”

• Access to one time series is usually completed before moving to the next time series.
Key Strengths of Informix TimeSeries
• Performance
• Extremely fast data access:
• Data clustered on disk to reduce I/O.
• Handles operations hard or impossible to do in standard SQL.
• Space Savings:
• Can be over 50% space savings over standard relational layout.
• Toolkit approach allows users to develop their own algorithms:
• Algorithms run in the database to leverage buffer pool for speed.
• Conceptually closer to how users think of time series.
Relational Schema for Smart Meters Data

Smart_Meters

Primary Key

mtr_id

date

Col1

Col2

ColN

Value 1

Value 2

…….

Value N

1

Mon

Value 1

Value 2

…….

Value N

1

Tue

Value 1

Value 2

…….

Value N

1

Wed

...

...

...

...

…….

...

Value 1

Value 2

…….

Value N

13

Mon

Value 1

Value 2

…….

Value N

13

Tue

Value 1

Value 2

…….

Value N

13

Wed

...

...

...

...

…….

...

Same Table using an Informix TimeSeries Schema

Smart_Meters

mtr_id

Series

(int) timeseries(mtr_data)

1

[(Mon, v1, ...)(Tue,v1…)]

2

[(Mon, v1, ...)(Tue,v1…)]

3

[(Mon, v1, ...)(Tue,v1…)]

4

[(Mon, v1, ...)(Tue,v1…)]

TimeSeries Data is stored outside the Table in “Containers”

Smart_Meters

mtr_id

Series

(int) timeseries(mtr_data)

Container “A”

1

2

3

4

Container “B”

5

6

7

Container “C”

8

What a Container Looks Like

Each time series has a unique ID. This plus the time stamp is used to search the btree

BTREE

MTR1

MTR1

MTR2

MTR3

MTR4

MTR5

Each data page holds sorted data for exactly one time series

Data Pages:

Timeseries

30 Mb

Data

35 Mb

Index on stock + timestamp

28 Mb

Index on all columns

43 Mb

TimeSeries Space Savings Example

• Time series takes much less space than traditional relational storage
• For instance, for 30 MB of time series storage relational will take either:
• 30MB of data + 28MB of index on stock and timestamp = 58MB on disk.
• 30MB of data + 43MB of index on all columns = 73MB bytes on disk.
• In either case time series is a huge savings.
• The reason for this is that:
• The time series does not require a separate index.
• Only earliest item on each time series page is indexed.
• Missing data takes no space.
• Savings could be more if there are many NULL values

Singleton

1,000 consec.

Aggregate 80,000

1,000 consec.

f(t1, t2)

Timeseries

1

1

1

1

.5

2

3

5

2

400

1000

650

TimeSeries Performance

• Performance:
• Slower for singleton selects.
• Faster accessing sets of data.
• Much faster combining timeseries.

(values in table are relative to each other, not in seconds)

• Traditional(2) = index on key columns
• Traditional(1) = index on key and all columns

TimeSeries Table

TimeSeries Virtual Table

Smart_meter

SM_vt

mtr_id

Series

mtr_id

date

col_1

col_2

(int) timeseries(mtr_data)

Value 1

Value 2

1

[(Mon, v1, ...)(Tue,v1…)]

1

Mon

...

Value 1

Value 2

...

1

Tue

2

[(Mon, v1, ...)(Tue,v1…)]

Value 1

Value 2

...

1

Wed

3

[(Mon, v1, ...)(Tue,v1…)]

...

...

...

...

...

4

[(Mon, v1, ...)(Tue,v1…)]

Value 1

Value 2

...

3

Mon

5

Value 1

Value 2

...

[(Mon, v1, ...)(Tue,v1…)]

3

Tue

Value 1

Value 2

...

6

3

Wed

[(Mon, v1, ...)(Tue,v1…)]

...

...

...

...

...

7

[(Mon, v1, ...)(Tue,v1…)]

Execute procedure tscreatevirtualtable

8

[(Mon, v1, ...)(Tue,v1…)]

(‘SM_vt’, ‘Smart_meter’);