Integration Services in SQL Server 2008

Integration Services in SQL Server 2008 Allan Mitchell SQL Server MVP

Who am I • SQL Server MVP • SQL Server Consultant • Joint author on Wrox Professional SSIS book • Worked with SQL Server since version 6.5 • www.SQLDTS.com and www.SQLIS.com

Today’s Schedule • SSIS in SQL Server 2008 • Behind the scenes • Threading • Pipeline Limiter • Front of shop • Lookup Component. Caching and other good stuff. • Data Profiling • Change Data Capture

Behind the Scenes Threading

Threading • Loading speed will benefit from getting as many threads spinning as possible. • In SSIS 2005 this was not always optimised • Blocking Transforms really wrecked your day. • Quaint workarounds

Threading - Multicast Transform • SSIS 2005 • Incredibly quick to generate n copies of the input • All outputs are Synchronous • All outputs on same execution tree • All outputs on same thread. • Round robin

Threading - Multicast Transform • SSIS 2008 • Incredibly quick to generate n copies of the input • All outputs are still Synchronous • All outputs on same execution Path • All outputs can get their own thread

Execution “Tree” 2005 begin execution tree 0 output "Generated" (8) input "Multicast Input 1" (20) output "Multicast Output 1" (21) input "TrashInput" (25) output "Multicast Output 2" (32) input "TrashInput" (29) output "Multicast Output 3" (44) end execution tree 0

Execution “Tree” 2008 Begin Path 0 output "Generated" (7); component "Konesans Data Generator Source" (1) input "Multicast Input 1" (36); component "Multicast" (35) Begin Subpath 0 output "Multicast Output 1" (37); component "Multicast" (35) input "TrashInput" (41); component "Trash Destination" (39) End Subpath 0 Begin Subpath 1 output "Multicast Output 2" (48); component "Multicast" (35) input "TrashInput" (45); component "Trash Destination 1" (43) End Subpath 1 End Path 0

Demo Multicast Transform in 2005 and 2008

Behind the Scenes Pipeline Limiter

Pipeline What? • That’s right – Limiter • Doesn’t that mean things will go slower in 2008? • No. It happens now in 2005 you just don’t know it • Why would I want to restrict the pipeline? • Data = buffers = memory • Memory is reused when the buffer terminates • Push back from a component = no reuse of memory = run out of memory !!!!!

Pipeline Limiter (Nov CTP) Information: 0x400492E0 at Data Flow Task, SSIS.Pipeline: During last execution the pipeline suspended output "Union All Output 1" (487) of component "Union All" (485) for 10445 milliseconds to limit the number of in-memory buffers.

Pipeline Limiter (Feb CTP/RTM) • No “Information” event is now raised externally • Everything sent to the log messages • User:PipelineComponentTime • Filter out “Message” attribute starting with “The Component%”

Demo Pipeline Limiter

Front of Shop Lookup Component Caching Options

The only way it works in 2005 • Caching options • Full • Before the Data Flow task really gets going (Initialize) the lookup components cache the entirety of their reference datasets • Partial • The lookup component will try for a match in cache first then go to disk. No pre-caching • None • Misnomer really. The last query result is cached but that’s it. Every other query is to disk.

The only way it works in 2005 • Full Cache is fastest but • Might take longest to cache • Could take large memory amount to cache • The reference datasets are not passed around • In a loop on the same task you read the reference dataset n times • Not transferrable across Data Flow tasks

The way it can work in 2008 • The Cache Transform • No transformation happens! • Can be used as a destination • Cache Connection Manager • File. Can be read by the Raw File adapter. • In Memory • Allow the reference dataset to be passed around like luggage

The way it can work in 2008 • Just because you can cache the dataset like this does not mean you have to • Default option is still to use the 2005 way.

Row Redirection • One of my main dislikes about this component in 2005 • If a lookup component gets no match then by default it fails • (Row yielded no match during lookup) • You have to configure the error output to handle this. • Redirect down the error output • Ignore (pass the looked up value as a NULL)

Row redirection • In 2008 we say hello to the “No Match Output”

Demo Caching options and redirection

Points to note • Lookup transform is still case sensitive when done in cache • When done on disk via OLE DB Connection Manager it is governed by their rules. • Watch out for different behaviour – Be careful out there.

Data Profiling Task • Control Flow Task • Profiles data in your database • Quick Setup or verbose • Loads of different metrics you can extract • Uses an ADO.Net Connection Manager • Only SQL2K and above can be profiled

Data Profiling Task • Fantastic for seeing how rubbish the data is • Useful for identifying distribution of values • Candidate keys in tables • Lengths of columns • Can be read externally or through a variable • XML whichever way you look at it

Demo Data Profiling Task

Change Data Capture (CDC) • Billed as an ETL feature • We’ll use it if it’s there • What is it? • How do I use it?

CDC-based ETL Traditional BULK ETL BULK BULK Real-time Updates Change Stream • Moves Entire Data Set • Requires ‘Window of Operation’ (hours to days) • Frequency/Latency – monthly, weekly, sometimes daily • Moves Only Changes to the Data • No ‘Window of Operation’ • Frequency/Latency – multiple times a day, real-time • Periodic or Continuous Change Flow Traditional ETL vs. CDC-ETL extract transform load Operational Data Sources ETL Engine DW

Why CDC is Cool • Increased efficiency of ETL Process • Incremental Extractions • Identify changed rows and columns • Identify operation on data, all changes or net changes • Log Based (Transactional Replication LR) • Lightweight

Why CDC is Cool • Viewable as net changes or complete journal • Key point is about the “Change” what you view about those changes is up to you • Differs from Change Capture in that you see the data • No excuses for not considering incremental extracts now.

Terminology around CDC • Capture Instance – Base Object (Max 2 per object) • Capture Process – Reading of the log and placing the rows into the change tables • Retention Periods

Demo Using Change Data Capture

Some of the things I did not cover • C# is now a scripting option

Questions?

Contact Details/Resources allan.mitchell@konesans.com www.SQLIS.com / www.SQLDTS.com www.SQLServerFAQ.com

Integration Services in SQL Server 2008