datastage enterprise edition l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
DataStage Enterprise Edition PowerPoint Presentation
Download Presentation
DataStage Enterprise Edition

Loading in 2 Seconds...

play fullscreen
1 / 374

DataStage Enterprise Edition - PowerPoint PPT Presentation


  • 669 Views
  • Uploaded on

DataStage Enterprise Edition. Day 1 Review of EE Concepts Sequential Access Best Practices DBMS as Source Day 2 EE Architecture Transforming Data DBMS as Target Sorting Data. Day 3 Combining Data Configuration Files Extending EE Meta Data in EE Day 4 Job Sequencing

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'DataStage Enterprise Edition' - paul2


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
proposed course agenda
Day 1

Review of EE Concepts

Sequential Access

Best Practices

DBMS as Source

Day 2

EE Architecture

Transforming Data

DBMS as Target

Sorting Data

Day 3

Combining Data

Configuration Files

Extending EE

Meta Data in EE

Day 4

Job Sequencing

Testing and Debugging

Proposed Course Agenda
the course material
The Course Material

Course Manual

Exercise Files and Exercise Guide

Online Help

using the course material
Using the Course Material
  • Suggestions for learning
    • Take notes
    • Review previous material
    • Practice
    • Learn from errors
intro part 1

IntroPart 1

Introduction to DataStage EE

what is datastage
What is DataStage?
  • Design jobs for Extraction, Transformation, and Loading (ETL)
  • Ideal tool for data integration projects – such as, data warehouses, data marts, and system migrations
  • Import, export, create, and managed metadata for use within jobs
  • Schedule, run, and monitor jobs all within DataStage
  • Administer your DataStage development and execution environments
developing in datastage
Developing in DataStage
  • Define global and project properties in Administrator
  • Import meta data into Manager
  • Build job in Designer
  • Compile Designer
  • Validate, run, and monitor in Director
quiz true or false
Quiz– True or False
  • DataStage Designer is used to build and compile your ETL jobs
  • Manager is used to execute your jobs after you build them
  • Director is used to execute your jobs after you build them
  • Administrator is used to set global and project properties
intro part 2

IntroPart 2

Configuring Projects

module objectives
Module Objectives
  • After this module you will be able to:
    • Explain how to create and delete projects
    • Set project properties in Administrator
    • Set EE global properties in Administrator
project properties
Project Properties
  • Projects can be created and deleted in Administrator
  • Project properties and defaults are set in Administrator
setting project properties
Setting Project Properties
  • To set project properties, log onto Administrator, select your project, and then click “Properties”
intro part 3

IntroPart 3

Managing Meta Data

module objectives28
Module Objectives
  • After this module you will be able to:
    • Describe the DataStage Manager components and functionality
    • Import and export DataStage objects
    • Import metadata for a sequential file
what is metadata
What Is Metadata?

Data

Source

Target

Transform

Meta Data

Meta Data

Meta Data

Repository

manager contents
Manager Contents
  • Metadata describing sources and targets: Table definitions
  • DataStage objects: jobs, routines, table definitions, etc.
import and export
Import and Export
  • Any object in Manager can be exported to a file
  • Can export whole projects
  • Use for backup
  • Sometimes used for version control
  • Can be used to move DataStage objects from one project to another
  • Use to share DataStage jobs and projects with other developers
export procedure
Export Procedure
  • In Manager, click “Export>DataStage Components”
  • Select DataStage objects for export
  • Specified type of export: DSX, XML
  • Specify file path on client machine
quiz true or false34
Quiz: True or False?
  • You can export DataStage objects such as jobs, but you can’t export metadata, such as field definitions of a sequential file.
quiz true or false35
Quiz: True or False?
  • The directory to which you export is on the DataStage client machine, not on the DataStage server machine.
import procedure
Import Procedure
  • In Manager, click “Import>DataStage Components”
  • Select DataStage objects for import
exercise
Exercise
  • Import DataStage Component (table definition)
metadata import
Metadata Import
  • Import format and column destinations from sequential files
  • Import relational table column destinations
  • Imported as “Table Definitions”
  • Table definitions can be loaded into job stages
sequential file import procedure
Sequential File Import Procedure
  • In Manager, click Import>Table Definitions>Sequential File Definitions
  • Select directory containing sequential file and then the file
  • Select Manager category
  • Examined format and column definitions and edit is necessary
intro part 4

IntroPart 4

Designing and Documenting Jobs

module objectives47
Module Objectives
  • After this module you will be able to:
    • Describe what a DataStage job is
    • List the steps involved in creating a job
    • Describe links and stages
    • Identify the different types of stages
    • Design a simple extraction and load job
    • Compile your job
    • Create parameters to make your job flexible
    • Document your job
what is a job
What Is a Job?
  • Executable DataStage program
  • Created in DataStage Designer, but can use components from Manager
  • Built using a graphical user interface
  • Compiles into Orchestrate shell language (OSH)
job development overview
Job Development Overview
  • In Manager, import metadata defining sources and targets
  • In Designer, add stages defining data extractions and loads
  • And Transformers and other stages to defined data transformations
  • Add linkss defining the flow of data from sources to targets
  • Compiled the job
  • In Director, validate, run, and monitor your job
designer toolbar
Designer Toolbar

Provides quick access to the main functions of Designer

Show/hide metadata markers

Job properties

Compile

adding stages and links
Adding Stages and Links
  • Stages can be dragged from the tools palette or from the stage type branch of the repository view
  • Links can be drawn from the tools palette or by right clicking and dragging from one stage to another
sequential file stage
Sequential File Stage
  • Used to extract data from, or load data to, a sequential file
  • Specify full path to the file
  • Specify a file format: fixed width or delimited
  • Specified column definitions
  • Specify write action
job creation example sequence
Job Creation Example Sequence
  • Brief walkthrough of procedure
  • Presumes meta data already loaded in repository
transformer stage
Transformer Stage
  • Used to define constraints, derivations, and column mappings
  • A column mapping maps an input column to an output column
  • In this module will just defined column mappings (no derivations)
adding job parameters
Adding Job Parameters
  • Makes the job more flexible
  • Parameters can be:
    • Used in constraints and derivations
    • Used in directory and file names
  • Parameter values are determined at run time
adding job documentation
Adding Job Documentation
  • Job Properties
    • Short and long descriptions
    • Shows in Manager
  • Annotation stage
    • Is a stage on the tool palette
    • Shows on the job GUI (work area)
intro part 5

IntroPart 5

Running Jobs

module objectives75
Module Objectives
  • After this module you will be able to:
    • Validate your job
    • Use DataStage Director to run your job
    • Set to run options
    • Monitor your job’s progress
    • View job log messages
prerequisite to job execution
Prerequisite to Job Execution

Result from Designer compile

datastage director77
DataStage Director
  • Can schedule, validating, and run jobs
  • Can be invoked from DataStage Manager or Designer
    • Tools > Run Director
other director functions
Other Director Functions
  • Schedule job to run on a particular date/time
  • Clear job log
  • Set Director options
    • Row limits
    • Abort after x warnings
module 1

Module 1

DSEE – DataStage EE

Review

ascential s enterprise data integration platform

DISCOVER

TRANSFORM

PREPARE

Gather relevant information for target enterprise applications

Standardize and enrich data and load to targets

Cleanse, correct and match input data

Extract, Transform, Load

Data Profiling

Data Quality

Parallel Execution

Meta Data Management

Ascential’s Enterprise Data Integration Platform

Command & Control

ANY TARGET

ANY SOURCE

CRM

ERP

SCM

RDBMS

Legacy

Real-time

Client-server

Web services

Data Warehouse

Other apps.

CRM

ERP

SCM

BI/Analytics

RDBMS

Real-time

Client-server

Web services

Data Warehouse

Other apps.

course objectives
Course Objectives
  • You will learn to:
    • Build DataStage EE jobs using complex logic
    • Utilize parallel processing techniques to increase job performance
    • Build custom stages based on application needs
  • Course emphasis is:
    • Advanced usage of DataStage EE
    • Application job development
    • Best practices techniques
course agenda
Day 1

Review of EE Concepts

Sequential Access

Standards

DBMS Access

Day 2

EE Architecture

Transforming Data

Sorting Data

Day 3

Combining Data

Configuration Files

Day 4

Extending EE

Meta Data Usage

Job Control

Testing

Course Agenda
module objectives87
Module Objectives
  • Provide a background for completing work in the DSEE course
  • Tasks
    • Review concepts covered in DSEE Essentials course
  • Skip this module if you recently completed the DataStage EE essentials modules
review topics
Review Topics
  • DataStage architecture
  • DataStage client review
    • Administrator
    • Manager
    • Designer
    • Director
  • Parallel processing paradigm
  • DataStage Enterprise Edition (DSEE)
client server architecture

Command & Control

Parallel Execution

Meta Data Management

Client-Server Architecture

Microsoft® Windows NT/2000/XP

ANY TARGET

ANY SOURCE

CRM

ERP

SCM

BI/Analytics

RDBMS

Real-Time

Client-server

Web services

Data Warehouse

Other apps.

Repository

Manager

Designer

Director

Administrator

Discover

Prepare

Transform

Extend

Extract

Cleanse

Transform

Integrate

Server

Repository

Microsoft® Windows NT or UNIX

process flow
Process Flow
  • Administrator – add/delete projects, set defaults
  • Manager – import meta data, backup projects
  • Designer – assemble jobs, compile, and execute
  • Director – execute jobs, examine job run logs
administrator project creation removal
Administrator – Project Creation/Removal

Functions specific to a project.

administrator project properties
Administrator – Project Properties

RCP for parallel jobs should be enabled

Variables for parallel processing

administrator environment variables
Administrator – Environment Variables

Variables are category specific

export objects to metastage
Export Objects to MetaStage

Push meta data to MetaStage

designer workspace
Designer Workspace

Can execute the job from Designer

datastage generated osh
DataStage Generated OSH

The EE Framework runs OSH

director executing jobs
Director – Executing Jobs

Messages from previous run in different color

stages
Stages

Can now customize the Designer’s palette

Select desired stages and drag to favorites

popular developer stages
Popular Developer Stages

Row generator

Peek

row generator
Row Generator
  • Can build test data

Edit row in column tab

Repeatable property

slide104
Peek
  • Displays field values
    • Will be displayed in job log or sent to a file
    • Skip records option
    • Can control number of records to be displayed
  • Can be used as stub stage for iterative development (more later)
why ee is so effective
Why EE is so Effective
  • Parallel processing paradigm
    • More hardware, faster processing
    • Level of parallelization is determined by a configuration file read at runtime
  • Emphasis on memory
    • Data read into memory and lookups performed like hash table
parallel processing systems

2

1

4

3

6

5

Parallel Processing Systems
  • DataStage EE Enables parallel processing = executing your application on multiple CPUs simultaneously
    • If you add more resources (CPUs, RAM, and disks) you increase system performance
  • Example system containing6 CPUs (or processing nodes)and disks
scaleable systems examples
Scaleable Systems: Examples

Three main types of scalable systems

  • Symmetric Multiprocessors (SMP): shared memory and disk
  • Clusters: UNIX systems connected via networks
  • MPP: Massively Parallel Processing

note

smp shared everything

cpu

cpu

cpu

cpu

SMP: Shared Everything
  • Multiple CPUs with a single operating system
  • Programs communicate using shared memory
  • All CPUs share system resources (OS, memory with single linear address space, disks, I/O)

When used with Enterprise Edition:

  • Data transport uses shared memory
  • Simplified startup

Enterprise Edition treats NUMA (NonUniform Memory Access) as plain SMP

traditional batch processing

Transform

Clean

Operational Data

Load

Data

Warehouse

Archived Data

Disk

Disk

Disk

Target

Source

Traditional Batch Processing

Traditional approach to batch processing:

  • Write to disk and read from disk before each processing operation
  • Sub-optimal utilization of resources
    • a 10 GB stream leads to 70 GB of I/O
    • processing resources can sit idle during I/O
  • Very complex to manage (lots and lots of small jobs)
  • Becomes impractical with big data volumes
    • disk I/O consumes the processing
    • terabytes of disk required for temporary staging
pipeline multiprocessing
Pipeline Multiprocessing

Data Pipelining

  • Transform, clean and load processes are executing simultaneously on the same processor
  • rows are moving forward through the flow

Operational Data

Transform

Clean

Load

Data

Warehouse

Archived Data

Target

Source

  • Start a downstream process while an upstream process is still running.
  • This eliminates intermediate storing to disk, which is critical for big data.
  • This also keeps the processors busy.
  • Still has limits on scalability

Think of a conveyor belt moving the rows from process to process!

partition parallelism

Node 1

Transform

A-F

Node 2

Transform

G- M

Source

Data

N-T

Node 3

Transform

U-Z

Node 4

Transform

Partition Parallelism

Data Partitioning

  • Break up big data into partitions
  • Run one partition on each processor
  • 4X times faster on 4 processors - With data big enough:

100X faster on 100 processors

  • This is exactly how the parallel databases work!
  • Data Partitioning requires the same transform to all partitions: Aaron Abbott and Zygmund Zorn undergo the same transform
combining parallelism types

Pipelining

Partitioning

Clean

Transform

Load

Target

Combining Parallelism Types

Putting It All Together: Parallel Dataflow

Source

repartitioning

Pipelining

Partitioning

Repartitioning

Repartitioning

U-Z

N-T

G- M

Transform

Clean

Load

A-F

Customer last name

Customer zip code

Credit card number

Target

Repartitioning

Putting It All Together: Parallel Dataflow with Repartioning on-the-fly

Source

Without Landing To Disk!

ee program elements
EE Program Elements
  • Dataset: uniform set of rows in the Framework's internal representation

- Three flavors:

1. file sets*.fs : stored on multiple Unix files as flat files

2. persistent: *.ds : stored on multiple Unix files in Framework format

read and written using the DataSet Stage

3. virtual: *.v : links, in Framework format, NOT stored on disk

- The Framework processes only datasets—hence possible need for Import

- Different datasets typically have different schemas

- Convention: "dataset" = Framework data set.

  • Partition: subset of rows in a dataset earmarked for processing by the same node (virtual CPU, declared in a configuration file).

- All the partitions of a dataset follow the same schema: that of the dataset

datastage ee architecture

Orchestrate Framework:

Provides application scalability

DataStage:

Provides data integration platform

DataStage Enterprise Edition:

Best-of-breed scalable data integration platform

No limitations on data volumes or throughput

DataStage EE Architecture
introduction to datastage ee
Introduction to DataStage EE
  • DSEE:
    • Automatically scales to fitthe machine
    • Handles data flow among multiple CPU’s and disks
  • With DSEE you can:
    • Create applications for SMP’s, clusters and MPP’s… Enterprise Edition is architecture-neutral
    • Access relational databases in parallel
    • Execute external applications in parallel
    • Store data across multiple disks and nodes
job design vs execution
Job Design VS. Execution

Developer assembles data flow using the Designer

…and gets: parallel access, propagation, transformation, and load.

The design is good for 1 node, 4 nodes,

or N nodes. To change # nodes, just swap configuration file.

No need to modify or recompile the design

partitioners and collectors
Partitioners and Collectors
  • Partitioners distribute rows into partitions
    • implement data-partition parallelism
  • Collectors = inverse partitioners
  • Live on input links of stages running
    • in parallel (partitioners)
    • sequentially (collectors)
  • Use a choice of methods
exercise120
Exercise
  • Complete exercises 1-1 and 1-2, and 1-3
module 2

Module 2

DSEE Sequential Access

module objectives122
Module Objectives
  • You will learn to:
    • Import sequential files into the EE Framework
    • Utilize parallel processing techniques to increase sequential file access
    • Understand usage of the Sequential, DataSet, FileSet, and LookupFileSet stages
    • Manage partitioned data stored by the Framework
types of sequential data stages
Types of Sequential Data Stages
  • Sequential
    • Fixed or variable length
  • File Set
  • Lookup File Set
  • Data Set
sequential stage introduction
Sequential Stage Introduction
  • The EE Framework processes only datasets
  • For files other than datasets, such as flat files, Enterprise Edition must perform import and export operations – this is performed by import and export OSH operators generated by Sequential or FileSet stages
  • During import or export DataStage performs format translations – into, or out of, the EE internal format
  • Data is described to the Framework in a schema
how the sequential stage works
How the Sequential Stage Works
  • Generates Import/Export operators, depending on whether stage is source or target
  • Performs direct C++ file I/O streams
using the sequential file stage
Using the Sequential File Stage

Both import and export of general files (text, binary) are performed by the SequentialFile Stage.

  • Data import:
  • Data export

Importing/Exporting Data

EE internal format

EE internal format

working with flat files
Working With Flat Files
  • Sequential File Stage
    • Normally will execute in sequential mode
    • Can be parallel if reading multiple files (file pattern option)
    • Can use multiple readers within a node
    • DSEE needs to know
      • How file is divided into rows
      • How row is divided into columns
processes needed to import data
Processes Needed to Import Data
  • Recordization
    • Divides input stream into records
    • Set on the format tab
  • Columnization
    • Divides the record into columns
    • Default set on the format tab but can be overridden on the columns tab
    • Can be “incomplete” if using a schema or not even specified in the stage if using RCP
sequential file stage130
Sequential File Stage
  • To set the properties, use stage editor
    • Page (general, input/output)
    • Tabs (format, columns)
  • Sequential stage link rules
    • One input link
    • One output links (except for reject link definition)
    • One reject link
      • Will reject any records not matching meta data in the column definitions
general tab sequential source
General Tab – Sequential Source

Show records

Multiple output links

properties multiple files
Properties – Multiple Files

Click to add more files having the same meta data.

properties multiple readers
Properties - Multiple Readers

Multiple readers option allows you to set number of readers

format tab
Format Tab

Record into columns

File into records

reject link
Reject Link
  • Reject mode = output
  • Source
    • All records not matching the meta data (the column definitions)
  • Target
    • All records that are rejected for any reason
  • Meta data – one column, datatype = raw
file set stage
File Set Stage
  • Can read or write file sets
  • Files suffixed by .fs
  • File set consists of:
    • Descriptor file – contains location of raw data files + meta data
    • Individual raw data files
  • Can be processed in parallel
file set stage example
File Set Stage Example

Descriptor file

file set usage
File Set Usage
  • Why use a file set?
    • 2G limit on some file systems
    • Need to distribute data among nodes to prevent overruns
    • If used in parallel, runs faster that sequential file
lookup file set stage
Lookup File Set Stage
  • Can create file sets
  • Usually used in conjunction with Lookup stages
lookup file set properties
Lookup File Set > Properties

Key column specified

Key column dropped in descriptor file

data set
Data Set
  • Operating system (Framework) file
  • Suffixed by .ds
  • Referred to by a control file
  • Managed by Data Set Management utility from GUI (Manager, Designer, Director)
  • Represents persistent data
  • Key to good performance in set of linked jobs
persistent datasets
Persistent Datasets
  • Accessed from/to disk with DataSet Stage.
  • Two parts:
    • Descriptor file:
      • contains metadata, data location, but NOT the data itself
    • Data file(s)
      • contain the data
      • multiple Unix files (one per node), accessible in parallel

input.ds

record ( partno: int32; description: string; )

node1:/local/disk1/…node2:/local/disk2/…

slide145

Quiz!

  • True or False?

Everything that has been data-partitioned must be collected in same job

data set stage
Data Set Stage

Is the data partitioned?

engine data translation
Engine Data Translation
  • Occurs on import
    • From sequential files or file sets
    • From RDBMS
  • Occurs on export
    • From datasets to file sets or sequential files
    • From datasets to RDBMS
  • Engine most efficient when processing internally formatted records (I.e. data contained in datasets)
managing datasets
Managing DataSets
  • GUI (Manager, Designer, Director) – tools > data set management
  • Alternative methods
    • Orchadmin
      • Unix command line utility
      • List records
      • Remove data sets (will remove all components)
    • Dsrecords
      • Lists number of records in a dataset
data set management
Data Set Management

Display data

Schema

data set management from unix
Data Set Management From Unix
  • Alternative method of managing file sets and data sets
    • Dsrecords
      • Gives record count
        • Unix command-line utility
        • $ dsrecords ds_name

I.e.. $ dsrecords myDS.ds

156999 records

    • Orchadmin
      • Manages EE persistent data sets
        • Unix command-line utility

I.e. $ orchadmin rm myDataSet.ds

exercise151
Exercise
  • Complete exercises 2-1, 2-2, 2-3, and 2-4.
module 3

Module 3

Standards and Techniques

objectives
Objectives
  • Establish standard techniques for DSEE development
  • Will cover:
    • Job documentation
    • Naming conventions for jobs, links, and stages
    • Iterative job design
    • Useful stages for job development
    • Using configuration files for development
    • Using environmental variables
    • Job parameters
job presentation
Job Presentation

Document using the annotation stage

job properties documentation155
Job Properties Documentation

Organize jobs into categories

Description shows in DS Manager and MetaStage

naming conventions
Naming conventions
  • Stages named after the
    • Data they access
    • Function they perform
    • DO NOT leave defaulted stage names like Sequential_File_0
  • Links named for the data they carry
    • DO NOT leave defaulted link names like DSLink3
stage and link names
Stage and Link Names

Stages and links renamed to data they handle

create reusable job components
Create Reusable Job Components
  • Use Enterprise Edition shared containers when feasible

Container

use iterative job design
Use Iterative Job Design
  • Use copy or peek stage as stub
  • Test job in phases – small first, then increasing in complexity
  • Use Peek stage to examine records
transformer stage techniques
Transformer StageTechniques
  • Suggestions -
    • Always include reject link.
    • Always test for null value before using a column in a function.
    • Try to use RCP and only map columns that have a derivation other than a copy. More on RCP later.
    • Be aware of Column and Stage variable Data Types.
      • Often user does not pay attention to Stage Variable type.
    • Avoid type conversions.
      • Try to maintain the data type as imported.
the copy stage
The Copy Stage

With 1 link in, 1 link out:

the Copy Stage is the ultimate "no-op" (place-holder):

  • Partitioners
  • Sort / Remove Duplicates
  • Rename, Drop column

… can be inserted on:

  • input link (Partitioning): Partitioners, Sort, Remove Duplicates)
  • output link (Mapping page): Rename, Drop.

Sometimes replace the transformer:

  • Rename,
  • Drop,
  • Implicit type Conversions
  • Link Constraint – break up schema
developing jobs
Developing Jobs
  • Keep it simple
    • Jobs with many stages are hard to debug and maintain.
  • Start small and Build to final Solution
    • Use view data, copy, and peek.
    • Start from source and work out.
    • Develop with a 1 node configuration file.
  • Solve the business problem before the performance problem.
    • Don’t worry too much about partitioning until the sequential flow works as expected.
  • If you have to write to Disk use a Persistent Data set.
good things to have in each job
Good Things to Have in each Job
  • Use job parameters
  • Some helpful environmental variables to add to job parameters
    • $APT_DUMP_SCORE
      • Report OSH to message log
    • $APT_CONFIG_FILE
      • Establishes runtime parameters to EE engine; I.e. Degree of parallelization
setting job parameters
Setting Job Parameters

Click to add environment variables

dump score output
DUMP SCORE Output

Setting APT_DUMP_SCORE yields:

Double-click

Partitoner

And

Collector

Mapping

Node--> partition

use multiple configuration files
Use Multiple Configuration Files
  • Make a set for 1X, 2X,….
  • Use different ones for test versus production
  • Include as a parameter in each job
exercise169
Exercise
  • Complete exercise 3-1
module 4

Module 4

DBMS Access

objectives171
Objectives
  • Understand how DSEE reads and writes records to an RDBMS
  • Understand how to handle nulls on DBMS lookup
  • Utilize this knowledge to:
    • Read and write database tables
    • Use database tables to lookup data
    • Use null handling options to clean data
parallel database connectivity
Parallel Database Connectivity

TraditionalClient-Server

Enterprise Edition

Client

Client

Sort

Client

Client

Client

Load

Client

Parallel RDBMS

Parallel RDBMS

  • Parallel server runs APPLICATIONS
  • Application has parallel connections to RDBMS
  • Suitable for large data volumes
  • Higher levels of integration possible
  • Only RDBMS is running in parallel
  • Each application has only one connection
  • Suitable only for small data volumes
rdbms access supported databases
RDBMS AccessSupported Databases

Enterprise Edition provides high performance / scalable interfaces for:

  • DB2
  • Informix
  • Oracle
  • Teradata
rdbms access
RDBMS Access
  • Automatically convert RDBMS table layouts to/from Enterprise Edition Table Definitions
  • RDBMS nulls converted to/from nullable field values
  • Support for standard SQL syntax for specifying:
    • field list for SELECT statement
    • filter for WHERE clause
  • Can write an explicit SQL query to access RDBMS
  • EE supplies additional information in the SQL query
rdbms stages
DB2/UDB Enterprise

Informix Enterprise

Oracle Enterprise

Teradata Enterprise

RDBMS Stages
rdbms usage
RDBMS Usage
  • As a source
    • Extract data from table (stream link)
        • Extract as table, generated SQL, or user-defined SQL
        • User-defined can perform joins, access views
    • Lookup (reference link)
        • Normal lookup is memory-based (all table data read into memory)
        • Can perform one lookup at a time in DBMS (sparse option)
        • Continue/drop/fail options
  • As a target
    • Inserts
    • Upserts (Inserts and updates)
    • Loader
dbms source user defined sql
DBMS Source - User-defined SQL

Columns in SQL statement must match the meta data in columns tab

exercise179
Exercise
  • User-defined SQL
    • Exercise 4-1
lookup reject link
Lookup Reject Link

“Output” option automatically creates the reject link

null handling
Null Handling
  • Must handle null condition if lookup record is not found and “continue” option is chosen
  • Can be done in a transformer stage
lookup stage properties
Lookup Stage Properties

Reference link

Must have same column name in input and reference links. You will get the results of the lookup in the output column.

dbms as target
DBMS As Target
  • Write Methods
    • Delete
    • Load
    • Upsert
    • Write (DB2)
  • Write mode for load method
    • Truncate
    • Create
    • Replace
    • Append
target properties
Target Properties

Generated code can be copied

Upsert mode determines options

checking for nulls
Checking for Nulls
  • Use Transformer stage to test for fields with null values (Use IsNull functions)
  • In Transformer, can reject or load default value
exercise189
Exercise
  • Complete exercise 4-2
module 5

Module 5

Platform Architecture

objectives191
Objectives
  • Understand how Enterprise Edition Framework processes data
  • You will be able to:
    • Read and understand OSH
    • Perform troubleshooting
concepts
Concepts
  • The Enterprise Edition Platform
    • Script language - OSH (generated by DataStage Parallel Canvas, and run by DataStage Director)
    • Communication - conductor,section leaders,players.
    • Configuration files (only one active at a time, describes H/W)
    • Meta data - schemas/tables
    • Schema propagation - RCP
    • EE extensibility - Buildop, Wrapper
    • Datasets (data in Framework's internal representation)
ds ee stage elements
DS-EE Stage Elements

EE Stages Involve A Series Of Processing Steps

  • Piece of Application Logic Running Against Individual Records
  • Parallel or Sequential

Output Data Set schema:

prov_num:int16;

member_num:int8;

custid:int32;

Input Data Set schema:

prov_num:int16;

member_num:int8;

custid:int32;

Output Interface

Business Logic

InputInterface

Partitioner

EE Stage

dsee stage execution
DSEE Stage Execution

Dual Parallelism Eliminates Bottlenecks!

  • EE Delivers Parallelism in Two Ways
    • Pipeline
    • Partition
  • Block Buffering Between Components
    • Eliminates Need for Program Load Balancing
    • Maintains Orderly Data Flow

Producer

Pipeline

Consumer

Partition

stages control partition parallelism
Stages Control Partition Parallelism
  • Execution Mode (sequential/parallel) is controlled by Stage
    • default = parallel for most Ascential-supplied Stages
    • Developer can override default mode
    • Parallel Stage inserts the default partitioner (Auto) on its input links
    • Sequential Stage inserts the default collector (Auto) on its input links
    • Developer can override default
      • execution mode (parallel/sequential) of Stage > Advanced tab
      • choice of partitioner/collector on Input > Partitioning tab
how parallel is it
How Parallel Is It?
  • Degree of parallelism is determined by the configuration file
    • Total number of logical nodes in default pool, or a subset if using "constraints".
      • Constraints are assigned to specific pools as defined in configuration file and can be referenced in the stage
slide197
OSH
  • DataStage EE GUI generates OSH scripts
    • Ability to view OSH turned on in Administrator
    • OSH can be viewed in Designer using job properties
  • The Framework executes OSH
  • What is OSH?
    • Orchestrate shell
    • Has a UNIX command-line interface
osh script
OSH Script
  • An osh script is a quoted string which specifies:
    • The operators and connections of a single Orchestrate step
    • In its simplest form, it is:

osh “op < in.ds > out.ds”

  • Where:
    • op is an Orchestrate operator
    • in.ds is the input data set
    • out.ds is the output data set
osh operators
OSH Operators
  • OSH Operator is an instance of a C++ class inheriting from APT_Operator
  • Developers can create new operators
  • Examples of existing operators:
    • Import
    • Export
    • RemoveDups
enable visible osh in administrator
Enable Visible OSH in Administrator

Will be enabled for all projects

view osh in designer
View OSH in Designer

Operator

Schema

osh practice
OSH Practice
  • Exercise 5-1 – Instructor demo (optional)
elements of a framework program
Elements of a Framework Program
  • Operators
  • Datasets: set of rows processed by Framework
    • Orchestrate data sets:
      • persistent (terminal) *.ds, and
      • virtual (internal) *.v.
      • Also: flat “file sets” *.fs
  • Schema: data description (metadata) for datasets and links.
datasets
Datasets

data files

of x.ds

  • Consist of Partitioned Data and Schema
  • Can be Persistent (*.ds) or Virtual (*.v, Link)
  • Overcome 2 GB File Limit

What you program:

What gets processed:

Node 1

Node 2

Node 4

Node 3

GUI

OSH

Operator

A

Operator

A

Operator

A

Operator

A

=

. . .

What gets generated:

Multiple files per partition

Each file up to 2GBytes (or larger)

$ osh “operator_A > x.ds“

computing architectures definition
Computing Architectures: Definition

Dedicated Disk

Shared Nothing

Shared Disk

Disk

Disk

Disk

Disk

Disk

Disk

CPU

Shared Memory

Memory

Memory

Memory

Memory

Memory

Uniprocessor

SMP System

(Symmetric Multiprocessor)

Clusters and MPP Systems

  • PC
  • Workstation
  • Single processor server
  • IBM, Sun, HP, Compaq
  • 2 to 64 processors
  • Majority of installations
  • 2 to hundreds of processors
  • MPP: IBM and NCR Teradata
  • each node is a uniprocessor or SMP
job execution orchestrate

Conductor Node

C

SL

SL

P

P

P

P

P

P

Job Execution:Orchestrate
  • Conductor - initial DS/EE process
    • Step Composer
    • Creates Section Leader processes (one/node)
    • Consolidates massages, outputs them
    • Manages orderly shutdown.
  • Section Leader
    • Forks Players processes (one/Stage)
    • Manages up/down communication.
  • Players
    • The actual processes associated with Stages
    • Combined players: one process only
    • Send stderr to SL
    • Establish connections to other players for data flow
    • Clean up upon completion.

Processing Node

Processing Node

  • Communication:

- SMP: Shared Memory

- MPP: TCP

working with configuration files
Working with Configuration Files
  • You can easily switch between config files:
      • '1-node' file - for sequential execution, lighter reports—handy for testing
      • 'MedN-nodes' file - aims at a mix of pipeline and data-partitioned parallelism
      • 'BigN-nodes' file - aims at full data-partitioned parallelism
  • Only one file is active while a step is running
      • The Framework queries (first) the environment variable:

$APT_CONFIG_FILE

  • # nodes declared in the config file needs not match # CPUs
      • Same configuration file can be used in development and target machines
scheduling nodes processes and cpus

Nodes = # logical nodes declared in config. file

Ops = # ops. (approx. # blue boxes in V.O.)

Processes = # Unix processes

CPUs = # available CPUs

SchedulingNodes, Processes, and CPUs
  • DS/EE does not:
    • know how many CPUs are available
    • schedule
  • Who knows what?
  • Who does what?
    • DS/EE creates (Nodes*Ops) Unix processes
    • The O/S schedules these processes on the CPUs
configuring dsee node pools
Configuring DSEE – Node Pools

{

node "n1" {

fastname "s1"

pool "" "n1" "s1" "app2" "sort"

resource disk "/orch/n1/d1" {}

resource disk "/orch/n1/d2" {}

resource scratchdisk "/temp" {"sort"}

}

node "n2" {

fastname "s2"

pool "" "n2" "s2" "app1"

resource disk "/orch/n2/d1" {}

resource disk "/orch/n2/d2" {}

resource scratchdisk "/temp" {}

}

node "n3" {

fastname "s3"

pool "" "n3" "s3" "app1"

resource disk "/orch/n3/d1" {}

resource scratchdisk "/temp" {}

}

node "n4" {

fastname "s4"

pool "" "n4" "s4" "app1"

resource disk "/orch/n4/d1" {} resource scratchdisk "/temp" {}

}

3

4

1

2

configuring dsee disk pools
Configuring DSEE – Disk Pools

{

node "n1" {

fastname "s1"

pool "" "n1" "s1" "app2" "sort"

resource disk "/orch/n1/d1" {}

resource disk "/orch/n1/d2" {"bigdata"}

resource scratchdisk "/temp" {"sort"}

}

node "n2" {

fastname "s2"

pool "" "n2" "s2" "app1"

resource disk "/orch/n2/d1" {}

resource disk "/orch/n2/d2" {"bigdata"}

resource scratchdisk "/temp" {}

}

node "n3" {

fastname "s3"

pool "" "n3" "s3" "app1"

resource disk "/orch/n3/d1" {}

resource scratchdisk "/temp" {}

}

node "n4" {

fastname "s4"

pool "" "n4" "s4" "app1"

resource disk "/orch/n4/d1" {} resource scratchdisk "/temp" {}

}

3

4

1

2

re partitioning
Re-Partitioning

Parallel to parallel flow may incur reshuffling:

Records may jump between nodes

node 1

node 2

partitioner

partitioning methods
Partitioning Methods
  • Auto
  • Hash
  • Entire
  • Range
  • Range Map
collectors
Collectors
  • Collectors combine partitions of a dataset into a single input stream to a sequential Stage

...

data partitions

collector

sequential Stage

  • Collectors do NOT synchronize data
reading messages in director
Reading Messages in Director
  • Set APT_DUMP_SCORE to true
  • Can be specified as job parameter
  • Messages sent to Director log
  • If set, parallel job will produce a report showing the operators, processes, and datasets in the running job
exercise219
Exercise
  • Complete exercise 5-2
module 6

Module 6

Transforming Data

module objectives221
Module Objectives
  • Understand ways DataStage allows you to transform data
  • Use this understanding to:
    • Create column derivations using user-defined code or system functions
    • Filter records based on business criteria
    • Control data flow based on data conditions
transformed data
Transformed Data
  • Transformed data is:
    • Outgoing column is a derivation that may, or may not, include incoming fields or parts of incoming fields
    • May be comprised of system variables
  • Frequently uses functions performed on something (ie. incoming columns)
    • Divided into categories – I.e.
      • Date and time
      • Mathematical
      • Logical
      • Null handling
      • More
stages review
Stages Review
  • Stages that can transform data
    • Transformer
      • Parallel
      • Basic (from Parallel palette)
    • Aggregator (discussed in later module)
  • Sample stages that do not transform data
    • Sequential
    • FileSet
    • DataSet
    • DBMS
transformer stage functions
Transformer Stage Functions
  • Control data flow
  • Create derivations
flow control
Flow Control
  • Separate records flow down links based on data condition – specified in Transformer stage constraints
  • Transformer stage can filter records
  • Other stages can filter records but do not exhibit advanced flow control
    • Sequential can send bad records down reject link
    • Lookup can reject records based on lookup failure
    • Filter can select records based on data value
rejecting data
Rejecting Data
  • Reject option on sequential stage
    • Data does not agree with meta data
    • Output consists of one column with binary data type
  • Reject links (from Lookup stage) result from the drop option of the property “If Not Found”
    • Lookup “failed”
    • All columns on reject link (no column mapping option)
  • Reject constraints are controlled from the constraint editor of the transformer
    • Can control column mapping
    • Use the “Other/Log” checkbox
rejecting data example
Rejecting Data Example

Constraint – Other/log option

“If Not Found” property

Property Reject Mode = Output

transformer stage variables
Transformer Stage Variables
  • First of transformer stage entities to execute
  • Execute in order from top to bottom
    • Can write a program by using one stage variable to point to the results of a previous stage variable
  • Multi-purpose
    • Counters
    • Hold values for previous rows to make comparison
    • Hold derivations to be used in multiple field dervations
    • Can be used to control execution of constraints
stage variables
Stage Variables

Show/Hide button

transforming data
Transforming Data
  • Derivations
    • Using expressions
    • Using functions
      • Date/time
  • Transformer Stage Issues
    • Sometimes require sorting before the transformer stage – I.e. using stage variable as accumulator and need to break on change of column value
  • Checking for nulls
checking for nulls232
Checking for Nulls
  • Nulls can get introduced into the dataflow because of failed lookups and the way in which you chose to handle this condition
  • Can be handled in constraints, derivations, stage variables, or a combination of these
transformer handling rejects
Transformer - Handling Rejects

Constraint Rejects

  • All expressions are false and reject row is checked
transformer execution order
Transformer: Execution Order
  • Derivations in stage variables are executed first
  • Constraints are executed before derivations
  • Column derivations in earlier links are executed before later links
  • Derivations in higher columns are executed before lower columns
parallel palette two transformers
All > Processing >

Transformer

Is the non-Universe transformer

Has a specific set of functions

No DS routines available

Parallel > Processing

Basic Transformer

Makes server style transforms available on the parallel palette

Can use DS routines

Parallel Palette - Two Transformers
  • Program in Basic for both transformers
transformer functions from derivation editor
Transformer Functions From Derivation Editor
  • Date & Time
  • Logical
  • Null Handling
  • Number
  • String
  • Type Conversion
exercise237
Exercise
  • Complete exercises 6-1, 6-2, and 6-3
module 7

Module 7

Sorting Data

objectives239
Objectives
  • Understand DataStage EE sorting options
  • Use this understanding to create sorted list of data to enable functionality within a transformer stage
sorting data
Sorting Data
  • Important because
    • Some stages require sorted input
    • Some stages may run faster – I.e Aggregator
  • Can be performed
    • Option within stages (use input > partitioning tab and set partitioning to anything other than auto)
    • As a separate stage (more complex sorts)
sorting alternatives
Sorting Alternatives
  • Alternative representation of same flow:
sort utility
Sort Utility
  • DataStage – the default
  • UNIX
sort stage outputs
Sort Stage - Outputs
  • Specifies how the output is derived
sort specification options
Sort Specification Options
  • Input Link Property
    • Limited functionality
    • Max memory/partition is 20 MB, then spills to scratch
  • Sort Stage
    • Tunable to use more memory before spilling to scratch.
  • Note: Spread I/O by adding more scratch file systems to each node of the APT_CONFIG_FILE
removing duplicates
Removing Duplicates
  • Can be done by Sort stage
    • Use unique option

OR

  • Remove Duplicates stage
    • Has more sophisticated ways to remove duplicates
exercise248
Exercise
  • Complete exercise 7-1
module 8

Module 8

Combining Data

objectives250
Objectives
  • Understand how DataStage can combine data using the Join, Lookup, Merge, and Aggregator stages
  • Use this understanding to create jobs that will
    • Combine data from separate input streams
    • Aggregate data to form summary totals
combining data
Combining Data
  • There are two ways to combine data:
    • Horizontally: Several input links; one output link (+ optional rejects) made of columns from different input links. E.g.,
      • Joins
      • Lookup
      • Merge
    • Vertically: One input link, one output link with column combining values from all input rows. E.g.,
      • Aggregator
join lookup merge stages
Join, Lookup & Merge Stages
  • These "three Stages" combine two or more input links according to values of user-designated "key" column(s).
  • They differ mainly in:
    • Memory usage
    • Treatment of rows with unmatched key values
    • Input requirements (sorted, de-duplicated)
not all links are created equal
Not all Links are Created Equal
  • Enterprise Edition distinguishes between:

- The Primary Input (Framework port 0)

- Secondary - in some cases "Reference" (other ports)

  • Naming convention:

Tip:

Check "Input Ordering" tab to make sure intended Primary is listed first

join stage editor
Join Stage Editor

Link Order immaterial for Inner and Full Outer Joins (but VERY important for Left/Right Outer and Lookup and Merge)

One of four variants:

  • Inner
  • Left Outer
  • Right Outer
  • Full Outer

Several key columns allowed

1 the join stage
1. The Join Stage

Four types:

  • 2 sorted input links, 1 output link
    • "left outer" on primary input, "right outer" on secondary input
    • Pre-sort make joins "lightweight": few rows need to be in RAM
  • Inner
  • Left Outer
  • Right Outer
  • Full Outer
2 the lookup stage
2. The Lookup Stage

Combines:

  • one source link with
  • one or more duplicate-freetable links

no pre-sort necessary

allows multiple keys LUTs

flexible exception handling forsource input rows with no match

Sourceinput

One or more tables (LUTs)

0

1

2

0

1

Lookup

Reject

Output

the lookup stage
The Lookup Stage
  • Lookup Tables should be small enough to fit into physical memory (otherwise, performance hit due to paging)
  • On an MPP you should partition the lookup tables using entire partitioning method, or partition them the same way you partition the source link
  • On an SMP, no physical duplication of a Lookup Table occurs
the lookup stage258
The Lookup Stage
  • Lookup File Set
    • Like a persistent data set only it contains metadata about the key.
    • Useful for staging lookup tables
  • RDBMS LOOKUP
    • NORMAL
      • Loads to an in memory hash table first
    • SPARSE
      • Select for each row.
      • Might become a performance bottleneck.
3 the merge stage
3. The Merge Stage
  • Combines
    • one sorted, duplicate-freemaster(primary) link with
    • one or more sortedupdate(secondary) links.
    • Pre-sort makes merge "lightweight": few rows need to be in RAM (as with joins, but opposite to lookup).
  • Follows theMaster-Update model:
    • Master row and one or more updates row are merged if they have the same value in user-specified key column(s).
    • A non-key column occurs in several inputs? The lowest input port number prevails (e.g., master over update; update values are ignored)
    • Unmatched ("Bad") master rows can be either
      • kept
      • dropped
    • Unmatched ("Bad") update rows in input link can be captured in a "reject" link
    • Matched update rows are consumed.
the merge stage
The Merge Stage

Allows composite keys

Multiple update links

Matched update rows are consumed

Unmatched updates can be captured

Lightweight

Space/time tradeoff: presorts vs. in-RAM table

One or more updates

Master

1

2

0

0

2

1

Merge

Rejects

Output

slide261

Synopsis:Joins, Lookup, & Merge

In this table:

  • , <comma> = separator between primary and secondary input links

(out and reject links)

the aggregator stage
The Aggregator Stage

Purpose: Perform data aggregations

Specify:

  • Zero or more key columns that define the aggregation units (or groups)
  • Columns to be aggregated
  • Aggregation functions:

count (nulls/non-nulls) sum

max/min/range

  • The grouping method (hash table or pre-sort) is a performance issue
grouping methods
Grouping Methods
  • Hash: results for each aggregation group are stored in a hash table, and the table is written out after all input has been processed
    • doesn’t require sorted data
    • good when number of unique groups is small. Running tally for each group’s aggregate calculations need to fit easily into memory. Require about 1KB/group of RAM.
    • Example: average family income by state, requires .05MB of RAM
  • Sort: results for only a single aggregation group are kept in memory; when new group is seen (key value changes), current group written out.
    • requires input sorted by grouping keys
    • can handle unlimited numbers of groups
    • Example: average daily balance by credit card
aggregator functions
Aggregator Functions
  • Sum
  • Min, max
  • Mean
  • Missing value count
  • Non-missing value count
  • Percent coefficient of variation
aggregation types
Aggregation Types

Aggregation types

containers
Containers
  • Two varieties
    • Local
    • Shared
  • Local
    • Simplifies a large, complex diagram
  • Shared
    • Creates reusable object that many jobs can include
creating a container
Creating a Container
  • Create a job
  • Select (loop) portions to containerize
  • Edit > Construct container > local or shared
using a container
Using a Container
  • Select as though it were a stage
exercise270
Exercise
  • Complete exercise 8-1
module 9

Module 9

Configuration Files

objectives272
Objectives
  • Understand how DataStage EE uses configuration files to determine parallel behavior
  • Use this understanding to
    • Build a EE configuration file for a computer system
    • Change node configurations to support adding resources to processes that need them
    • Create a job that will change resource allocations at the stage level
configuration file concepts
Configuration File Concepts
  • Determine the processing nodes and disk space connected to each node
  • When system changes, need only change the configuration file – no need to recompile jobs
  • When DataStage job runs, platform reads configuration file
    • Platform automatically scales the application to fit the system
processing nodes are
Processing Nodes Are
  • Locations on which the framework runs applications
  • Logical rather than physical construct
  • Do not necessarily correspond to the number of CPUs in your system
    • Typically one node for two CPUs
  • Can define one processing node for multiple physical nodes or multiple processing nodes for one physical node
optimizing parallelism
Optimizing Parallelism
  • Degree of parallelism determined by number of nodes defined
  • Parallelism should be optimized, not maximized
    • Increasing parallelism distributes work load but also increases Framework overhead
  • Hardware influences degree of parallelism possible
  • System hardware partially determines configuration
more factors to consider
More Factors to Consider
  • Communication amongst operators
    • Should be optimized by your configuration
    • Operators exchanging large amounts of data should be assigned to nodes communicating by shared memory or high-speed link
  • SMP – leave some processors for operating system
  • Desirable to equalize partitioning of data
  • Use an experimental approach
    • Start with small data sets
    • Try different parallelism while scaling up data set sizes
factors affecting optimal degree of parallelism
Factors Affecting Optimal Degree of Parallelism
  • CPU intensive applications
    • Benefit from the greatest possible parallelism
  • Applications that are disk intensive
    • Number of logical nodes equals the number of disk spindles being accessed
configuration file
Configuration File
  • Text file containing string data that is passed to the Framework
    • Sits on server side
    • Can be displayed and edited
  • Name and location found in environmental variable APT_CONFIG_FILE
  • Components
    • Node
    • Fast name
    • Pools
    • Resource
node options
Node Options
  • Node name – name of a processing node used by EE
    • Typically the network name
    • Use command uname –n to obtain network name
  • Fastname –
    • Name of node as referred to by fastest network in the system
    • Operators use physical node name to open connections
    • NOTE: for SMP, all CPUs share single connection to network
  • Pools
    • Names of pools to which this node is assigned
    • Used to logically group nodes
    • Can also be used to group resources
  • Resource
    • Disk
    • Scratchdisk
sample configuration file
Sample Configuration File

{

node “Node1"

{

fastname "BlackHole"

pools "" "node1"

resource disk "/usr/dsadm/Ascential/DataStage/Datasets" {pools "" }

resource scratchdisk "/usr/dsadm/Ascential/DataStage/Scratch" {pools "" }

}

}

disk pools
Disk pools allocate storage

By default, EE uses the default pool, specified by “”

Disk Pools

pool "bigdata"

sorting requirements
Sorting Requirements

Resource pools can also be specified for sorting:

  • The Sort stage looks first for scratch disk resources in a “sort” pool, and then in the default disk pool
another configuration file example

6

4

5

2

3

1

Another Configuration File Example

{

node "n1" {

fastname “s1"

pool "" "n1" "s1" "sort"

resource disk "/data/n1/d1" {}

resource disk "/data/n1/d2" {}

resource scratchdisk "/scratch" {"sort"}

}

node "n2" {

fastname "s2"

pool "" "n2" "s2" "app1"

resource disk "/data/n2/d1" {}

resource scratchdisk "/scratch" {}

}

node "n3" {

fastname "s3"

pool "" "n3" "s3" "app1"

resource disk "/data/n3/d1" {}

resource scratchdisk "/scratch" {}

}

node "n4" {

fastname "s4"

pool "" "n4" "s4" "app1"

resource disk "/data/n4/d1" {}

resource scratchdisk "/scratch" {}

}

...

}

resource types
Resource Types
  • Disk
  • Scratchdisk
  • DB2
  • Oracle
  • Saswork
  • Sortwork
  • Can exist in a pool
    • Groups resources together
using different configurations
Using Different Configurations

Lookup stage where DBMS is using a sparse lookup type

building a configuration file
Building a Configuration File
  • Scoping the hardware:
    • Is the hardware configuration SMP, Cluster, or MPP?
    • Define each node structure (an SMP would be single node):
      • Number of CPUs
      • CPU speed
      • Available memory
      • Available page/swap space
      • Connectivity (network/back-panel speed)
    • Is the machine dedicated to EE? If not, what other applications are running on it?
    • Get a breakdown of the resource usage (vmstat, mpstat, iostat)
    • Are there other configuration restrictions? E.g. DB only runs on certain nodes and ETL cannot run on them?
exercise287
Exercise
  • Complete exercise 9-1 and 9-2
module 10

Module 10

Extending DataStage EE

objectives289
Objectives
  • Understand the methods by which you can add functionality to EE
  • Use this understanding to:
    • Build a DataStage EE stage that handles special processing needs not supplied with the vanilla stages
    • Build a DataStage EE job that uses the new stage
ee extensibility overview
EE Extensibility Overview

Sometimes it will be to your advantage to leverage EE’s extensibility. This extensibility includes:

  • Wrappers
  • Buildops
  • Custom Stages
when to leverage ee extensibility
When To Leverage EE Extensibility

Types of situations:

Complex business logic, not easily accomplished using standard EE stages

Reuse of existing C, C++, Java, COBOL, etc…

wrappers vs buildop vs custom
Wrappers vs. Buildop vs. Custom
  • Wrappers are good if you cannot or do not want to modify the application and performance is not critical.
  • Buildops: good if you need custom coding but do not need dynamic (runtime-based) input and output interfaces.
  • Custom (C++ coding using framework API): good if you need custom coding and need dynamic input and output interfaces.
building wrapped stages
Building “Wrapped” Stages

You can “wrapper” a legacy executable:

  • Binary
  • Unix command
  • Shell script

… and turn it into a Enterprise Edition stage capable, among other things, of parallel execution…

As long as the legacy executable is:

  • amenable to data-partition parallelism
      • no dependencies between rows
  • pipe-safe
      • can read rows sequentially
      • no random access to data
wrappers cont d
Wrappers (Cont’d)

Wrappers are treated as a black box

  • EE has no knowledge of contents
  • EE has no means of managing anything that occurs inside the wrapper
  • EE only knows how to export data to and import data from the wrapper
  • User must know at design time the intended behavior of the wrapper and its schema interface
  • If the wrappered application needs to see all records prior to processing, it cannot run in parallel.
ls example
LS Example
  • Can this command be wrappered?
creating a wrapper
Creating a Wrapper

To create the “ls” stage

Used in this job ---

wrapper starting point
Wrapper Starting Point

Creating Wrapped Stages

From Manager:

Right-Click on Stage Type

> New Parallel Stage > Wrapped

We will "Wrapper” an existing Unix executables – the ls command

wrapper general page
Wrapper - General Page

Name of stage

Unix command to be wrapped

the creator page
The "Creator" Page

Conscientiously maintaining the Creator page for all your wrapped stages will eventually earn you the thanks of others.

wrapper properties page
Wrapper – Properties Page
  • If your stage will have properties appear, complete the Properties page

This will be the name of the property as it appears in your stage

wrapper wrapped page
Wrapper - Wrapped Page

Interfaces – input and output columns - these should first be entered into the table definitions meta data (DS Manager); let’s do that now.

interface schemas
Interface schemas
  • Layout interfaces describe what columns the stage:
    • Needs for its inputs (if any)
    • Creates for its outputs (if any)
    • Should be created as tables with columns in Manager
how does the wrapping work
How Does the Wrapping Work?
  • Define the schema for export and import
    • Schemas become interface schemas of the operator and allow for by-name column access

input schema

export

stdin ornamed pipe

UNIX executable

stdout ornamed pipe

import

output schema

QUIZ: Why does export precede import?

update the wrapper interfaces
Update the Wrapper Interfaces
  • This wrapper will have no input interface – i.e. no input link. The location will come as a job parameter that will be passed to the appropriate stage property. Therefore, only the Output tab entry is needed.
resulting job
Resulting Job

Wrapped stage

job run
Job Run
  • Show file from Designer palette
wrapper story cobol application
Wrapper Story: Cobol Application
  • Hardware Environment:
    • IBM SP2, 2 nodes with 4 CPU’s per node.
  • Software:
    • DB2/EEE, COBOL, EE
  • Original COBOL Application:
    • Extracted source table, performed lookup against table in DB2, and Loaded results to target table.
    • 4 hours 20 minutes sequential execution
  • Enterprise Edition Solution:
    • Used EE to perform Parallel DB2 Extracts and Loads
    • Used EE to execute COBOL application in Parallel
    • EE Framework handled data transfer between DB2/EEE and COBOL application
    • 30 minutes 8-way parallel execution
buildops
Buildops

Buildop provides a simple means of extending beyond the functionality provided by EE, but does not use an existing executable (like the wrapper)

Reasons to use Buildop include:

  • Speed / Performance
  • Complex business logic that cannot be easily represented using existing stages
    • Lookups across a range of values
    • Surrogate key generation
    • Rolling aggregates
  • Build once and reusable everywhere within project, no shared container necessary
  • Can combine functionality from different stages into one
buildops310
BuildOps
  • The DataStage programmer encapsulates the business logic
  • The Enterprise Edition interface called “buildop” automatically performs the tedious, error-prone tasks: invoke needed header files, build the necessary “plumbing” for a correct and efficient parallel execution.
  • Exploits extensibility of EE Framework
buildop process overview
BuildOp Process Overview

From Manager (or Designer):

Repository pane:

Right-Click on Stage Type

> New Parallel Stage > {Custom | Build | Wrapped}

  • "Build" stages from within Enterprise Edition
  • "Wrapping” existing “Unix” executables
general page
General Page

Identical

to Wrappers,

except:

Under the Build

Tab, your program!

logic tab for business logic
Logic Tab forBusiness Logic

Enter Business C/C++ logic and arithmetic in four pages under the Logic tab

Main code section goes in Per-Record page- it will be applied to all rows

NOTE:Code will need to be Ansi C/C++ compliant. If code does not compile outside of EE, it won’t compile within EE either!

code sections under logic tab
Code Sections under Logic Tab

Temporary variables declared [and initialized] here

Logic here is executed once BEFORE processing the FIRST row

Logic here is executed once AFTER processing the LAST row

i o and transfer
I/O and Transfer

Under Interface tab: Input, Output & Transfer pages

First line: output 0

In-Repository

Table Definition

Write row

Input page: 'Auto Read'

Read next row

'False' setting,

not to interfere with Transfer page

Optional renaming of

output port from default "out0"

i o and transfer316
I/O andTransfer

First line:

Transfer of index 0

  • Transfer all columns from input to output.
  • If page left blank or Auto Transfer = "False" (and RCP = "False") Only columns in output Table Definition are written
buildop simple example

sumNoTransfer

BuildOp Simple Example
  • Example - sumNoTransfer
    • Add input columns "a" and "b"; ignore other columns that might be present in input
    • Produce a new "sum" column
    • Do not transfer input columns

a:int32; b:int32

sum:int32

no transfer
No Transfer

From Peek:

NO TRANSFER

    • RCP set to "False" in stage definitionand
    • Transfer page left blank, or Auto Transfer = "False"
  • Effects:
    • input columns "a" and "b" are not transferred
    • only new column "sum" is transferred

Compare with transfer ON…

transfer
Transfer

TRANSFER

    • RCP set to "True" in stage definitionor
    • Auto Transfer set to "True"
  • Effects:
    • new column "sum" is transferred, as well as
    • input columns "a" and "b" and
    • input column "ignored" (present in input, but not mentioned in stage)
columns vs temporary c variables
Columns

DS-EE type

Defined in Table Definitions

Value refreshed from row to row

Temp C++ variables

C/C++ type

Need declaration (in Definitions or Pre-Loop page)

Value persistent throughout "loop" over rows, unless modified in code

Columns vs. Temporary C++ Variables
exercise321
Exercise
  • Complete exercise 10-1 and 10-2
exercise322
Exercise
  • Complete exercises 10-3 and 10-4
custom stage
Custom Stage
  • Reasons for a custom stage:
    • Add EE operator not already in DataStage EE
    • Build your own Operator and add to DataStage EE
  • Use EE API
  • Use Custom Stage to add new operator to EE canvas
custom stage324
Custom Stage

DataStage Manager > select Stage Types branch > right click

custom stage325
Custom Stage

Number of input and output links allowed

Name of Orchestrate operator to be used

module 11

Module 11

Meta Data in DataStage EE

objectives329
Objectives
  • Understand how EE uses meta data, particularly schemas and runtime column propagation
  • Use this understanding to:
    • Build schema definition files to be invoked in DataStage jobs
    • Use RCP to manage meta data usage in EE jobs
establishing meta data
Establishing Meta Data
  • Data definitions
    • Recordization and columnization
    • Fields have properties that can be set at individual field level
      • Data types in GUI are translated to types used by EE
    • Described as properties on the format/columns tab (outputs or inputs pages) OR
    • Using a schema file (can be full or partial)
  • Schemas
    • Can be imported into Manager
    • Can be pointed to by some job stages (i.e. Sequential)
data formatting record level
Data Formatting – Record Level
  • Format tab
  • Meta data described on a record basis
  • Record level properties
data formatting column level
Data Formatting – Column Level
  • Defaults for all columns
column overrides
Column Overrides
  • Edit row from within the columns tab
  • Set individual column properties
extended column properties
Extended Column Properties

Field and string settings

extended properties string type
Extended Properties – String Type
  • Note the ability to convert ASCII to EBCDIC
editing columns
Editing Columns

Properties depend on the data type

schema
Schema
  • Alternative way to specify column definitions for data used in EE jobs
  • Written in a plain text file
  • Can be written as a partial record definition
  • Can be imported into the DataStage repository
creating a schema
Creating a Schema
  • Using a text editor
    • Follow correct syntax for definitions
    • OR
  • Import from an existing data set or file set
    • On DataStage Manager import > Table Definitions > Orchestrate Schema Definitions
    • Select checkbox for a file with .fs or .ds
importing a schema
Importing a Schema

Schema location can be on the server or local work station

data types
Date

Decimal

Floating point

Integer

String

Time

Timestamp

Vector

Subrecord

Raw

Tagged

Data Types
runtime column propagation
Runtime Column Propagation
  • DataStage EE is flexible about meta data. It can cope with the situation where meta data isn’t fully defined. You can define part of your schema and specify that, if your job encounters extra columns that are not defined in the meta data when it actually runs, it will adopt these extra columns and propagate them through the rest of the job. This is known as runtime column propagation (RCP).
  • RCP is always on at runtime.
  • Design and compile time column mapping enforcement.
    • RCP is off by default.
    • Enable first at project level. (Administrator project properties)
    • Enable at job level. (job properties General tab)
    • Enable at Stage. (Link Output Column tab)
enabling rcp at stage level
Enabling RCP at Stage Level
  • Go to output link’s columns tab
  • For transformer you can find the output links columns tab by first going to stage properties
using rcp with sequential stages
Using RCP with Sequential Stages
  • To utilize runtime column propagation in the sequential stage you must use the “use schema” option
  • Stages with this restriction:
    • Sequential
    • File Set
    • External Source
    • External Target
runtime column propagation346
Runtime Column Propagation
  • When RCP is Disabled
    • DataStage Designer will enforce Stage Input Column to Output Column mappings.
    • At job compile time modify operators are inserted on output links in the generated osh.
runtime column propagation347
Runtime Column Propagation
  • When RCP is Enabled
    • DataStage Designer will not enforce mapping rules.
    • No Modify operator inserted at compile time.
    • Danger of runtime error if column names incoming do not match column names outgoing link – case sensitivity.
exercise348
Exercise
  • Complete exercises 11-1 and 11-2
module 12

Module 12

Job Control Using the Job Sequencer

objectives350
Objectives
  • Understand how the DataStage job sequencer works
  • Use this understanding to build a control job to run a sequence of DataStage jobs
job control options
Job Control Options
  • Manually write job control
    • Code generated in Basic
    • Use the job control tab on the job properties page
    • Generates basic code which you can modify
  • Job Sequencer
    • Build a controlling job much the same way you build other jobs
    • Comprised of stages and links
    • No basic coding
job sequencer
Job Sequencer
  • Build like a regular job
  • Type “Job Sequence”
  • Has stages and links
  • Job Activity stage represents a DataStage job
  • Links represent passing control

Stages

example
Example

Job Activity stage – contains conditional triggers

job activity properties
Job Activity Properties

Job to be executed – select from dropdown

Job parameters to be passed

job activity trigger
Job Activity Trigger
  • Trigger appears as a link in the diagram
  • Custom options let you define the code
options
Options
  • Use custom option for conditionals
    • Execute if job run successful or warnings only
  • Can add “wait for file” to execute
  • Add “execute command” stage to drop real tables and rename new tables to current tables
job activity with multiple links
Job Activity With Multiple Links

Different links having different triggers

notification stage
Notification Stage

Notification

sample datastage log from mail notification
Sample DataStage log from Mail Notification
  • Sample DataStage log from Mail Notification
exercise363
Exercise
  • Complete exercise 12-1
module 13

Module 13

Testing and Debugging

objectives365
Objectives
  • Understand spectrum of tools to perform testing and debugging
  • Use this understanding to troubleshoot a DataStage job
the director
The Director

Typical Job Log Messages:

  • Environment variables
  • Configuration File information
  • Framework Info/Warning/Error messages
  • Output from the Peek Stage
  • Additional info with "Reporting" environments
  • Tracing/Debug output
    • Must compile job in trace mode
    • Adds overhead
job level environmental variables
Job Level Environmental Variables
  • Job Properties, from Menu Bar of Designer
  • Director willprompt you before eachrun
troubleshooting
Troubleshooting

If you get an error during compile, check the following:

  • Compilation problems
    • If Transformer used, check C++ compiler, LD_LIRBARY_PATH
    • If Buildop errors try buildop from command line
    • Some stages may not support RCP – can cause column mismatch .
    • Use the Show Error and More buttons
    • Examine Generated OSH
    • Check environment variables settings
  • Very little integrity checking during compile, should run validate from Director.

Highlights source of error

generating test data
Generating Test Data
  • Row Generator stage can be used
    • Column definitions
    • Data type dependent
  • Row Generator plus lookup stages provides good way to create robust test data from pattern files