Pentaho Data Integration Suite

Pentaho Data Integration Suite

Pentaho Data Integration Suite • Kettle is an acronym for "Kettle E.T.T.L. Environment" • Extraction, Transformation, Transportation and Loading of data. • Spoon is a graphical user interface that allows you to design transformations and jobs that can be run with the Kettle tools — Pan and Kitchen • Pan is a data transformation engine that performs a multitude of functions such as reading, manipulating, and writing data to and from various data sources • itchenis a program that executes jobs designed by Spoon in XML or in a database repository • Jobs are usually scheduled in batch mode to be run automatically at regular intervals.

Getting Started • Follow the instructions below to install Spoon: • Install the Sun Microsystems Java Runtime Environment version 1.5 or higher. You can download a JRE for free at http://www.javasoft.com/. • Unzip the binary distribution zip-file in a directory of your choice. • Under Unix-like environments (Solaris, Linux, MacOS, for example), you must make the shell scripts executable. Execute these commands to make all shell scripts in the Kettle directory executable: • cd Kettle chmod +x *.sh

Launching Spoon • Spoon.bat: launch Spoon on the Windows platform. • spoon.sh: launch Spoon on a Unix-like platform such as Linux, Apple OSX, Solaris • To make a shortcut under the Windows platform, an icon is provided. Use "spoon.ico" to set the correct icon. Point the shortcut to the spoon.bat file.

User Interface Overview • The Main tree in the upper-left panel of Spoon allows you to browse connections associated with the jobs and transformations you have open. • When designing a transformation, the Core Objects palate in the lower left-panel contains the available steps used to build your transformation including input, output, lookup, transform, joins, scripting steps and more. • When designing a job, the Core objects palate contains the available job entries. • The Core Objects bar contains a variety of job entry types.

Repository • Spoon allows you to store transformation and job files to the local file system or in the Kettle repository • The Kettle repository can be housed in any common relational database

Transformation DefinitionsThe table below contains a list of transformation definitions:

Job DefinitionsThe table below contains a list of job definitions

Toolbar IconsThe icons on the toolbar of the main screen are from left to right

. General TabNote: Spoon automatically clears the database cache when you launch DDL (Data Definition Language) statements towards a database connection; however, when using third party tools, clearing the database cache manually may be necessary

Other Search Metadata This option will search in any available fields, connectors or notes of all loaded jobs and transformations for the string specified in the Filter field. The metadata search returns a detailed result set showing the location of any search hits. This feature is accessed by choosing Edit|Search metadata from the menu bar. Set Environment Variable The Set Environment Variable feature allows you to create and set environment variables for the current user session explicitly. This is a useful feature when designing transformations for testing variable substitutions that are normally set dynamically by another job or transformation. This feature is accessible by choosing Edit|Set Environment Variable from the menu bar. Note: This page also displays when you run a transformation that use undefined variables. This allows you to define them right before execution time. Show environment variables This feature displays the current list of environment variables and their values. It is accessed by selecting the Edit|Show environment variables option from the menu bar. Execution Log history If you have configured your Job or Transformation to store log information in a database table, you can view the log information from previous executions by right-clicking on the job or transformation in the Main Tree and selecting 'Open History View'. A view similar to the one below appears: Note: todo: replace screenshot when PDI-224 is fixed Note: The log history for a job or transformation also opens by default each next time you execute the file.

Generate Mapping Against Target StepIn cases where you have a fixed target table, map the fields from the stream to their corresponding fields in the target output table. Use a Select Values step in your transformation. The 'Generate mapping against target' option provides you with an easy-to-use dialog for defining these mappings that automatically creates the resulting Select Values step that can be dropped into your transformation flow prior to the table output step.To access the 'Generate mapping against target' option right click in the table output step.

Generate mappings example Below is an example of a simple transformation in which we want to generate mappings to our target output table: • Begin by right-clicking on the Table output step and selecting 'Generate mappings against target'. • Add all necessary mappings using the Generate Mapping dialog shown above and click OK. You will now see a Table output mapping step has been added to the canvas. • Drag the generated Table output Mapping step into your transformation flow prior to the table output step:

Creating a Transformation or Job • You create a new Transformation in one of three ways:By clicking on the New Transformation button on the main tool barBy clicking New, then TransformationBy using the CTRL-N hot keyAny one of these actions opens a new Transformation tab for you to begin designing your transformation.You create a new Job in one of three ways:By clicking on the New Job button on the main tool barBy clicking New, then JobBy using the CTRL-ALT-N hot keyAny one of these actions opens a new Job tab for you to begin designing your job.

Creating a New Database Connection • This section describes how to create a new database connection and includes a detailed description of each connection property available in the Connection information dialog box.To create a new connection right click the Database Connections in the tree and select New or New Connection Wizard. You can also double click Database Connections, or press F3.The Connection information dialog box appears. The topics that follow describe the configuration options available on each tab of the Connection information dialog box.

Database ExplorerThe Database Explorer provides the ability to explore configured database connections. The Database Explorer also supports tables, views, and synonyms along with the catalog and/or schema to which the table belongs.The buttons to the right provide quick access the following features for the selected table:

Kettle and Spoon • The first lesson of our Kettle ETL tutorial will explain how to create a simple transformation using theSpoon application, which is a part of the Pentaho Data Integration suite.The transformation in our example will read records from a table in an Oracle database, and then it will filter them out and write output to two separate text files. The records which will pass the validation rule will be spooled into a text file and the ones that won’t will be redirected to the rejects link which will place them in a different text file. Assuming that the Spoon application is installed correctly, the first thing to do after running it is to configure a repository. Once the ‘Select a repository’ window appears, it’s necessary to create or choose one. A repository is a place where all Kettle objects will be stored – in this tutorial it will be an Oracle database. To create new repositories click the ‘New’ button and type in connection parameters in the ‘Connection information’ window. There are some very useful options on the screen, one is ‘Test’ which allows users to test new connections and the other is ‘Explore’ which lets users browse a database schema and explore the database objects. After clicking the ‘Create or Upgrade’ a new repository is created. By default, an user with administrator rights is created – it’s login name is admin and the password is also admin. It is highly recommended to change the password after the first login. • Database connection in Spoon - a part of Kettle ETL:

If a connection with repository is established successfully, a Spoon main application window will show up. To design a new transformation which will perform the tasks described above it’s necessary to take the following steps: • Click the ‘New transformation’ icon and enter it’s name (in our tutorial it will be trsfCountry) Define a database connection. It is located in the left hand-side menu in the ‘Main tree’ area in the Database connections field Drag and drop the following elements from the ‘Core Objects’ menu to the transformation design area in the center of the screen: Table Input (menu Output), Filter Rows (menu Transform) and two Text Field Output objects (menu Output).Edit the Table Input – choose a source database and define an SQL query which will return records to the transform flow. The ‘Preview’ option is usually very useful here as it shows the preview of the records returned from the database. Oracle table input data in Spoon:

Hops • Next thing to do is to link the objects together. The links between elements are called Hops and they indicate which direction the transform flows go. Hops elements can be found, created and edited in the Main Tree section. • The easiest way to create a Hop is to drag and drop a link between two objects with left SHIFT pressed. • Once the hops are defined, it’s time to define validation criteria in the ‘Filter Values’ object. In that place we define the data flow and the direction of that flow based on a validation rule.

Run • The last thing to do is to change the text files output configuration. Enter the names of the files and its extension in the properties window and if needed, adjust other text files specific options here. • Save and run the transform (menu -> Transformation -> Run or just press the F9 key).

Pentaho Data Integration Suite