Data transformation
This presentation is the property of its rightful owner.
Sponsored Links
1 / 31

Data Transformation PowerPoint PPT Presentation


  • 46 Views
  • Uploaded on
  • Presentation posted in: General

Data Transformation. Data cleaning. Importing Data. Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access data ODBC for external database data. Importing an Excel Spreadsheet. PROC IMPORT OUT= WORK.Fall2007

Download Presentation

Data Transformation

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Data transformation

Data Transformation

Data cleaning


Importing data

Importing Data

  • Reading data from external formats

  • Libname/Infile/Input for text form data

  • Proc Import for Excel/Access data

  • ODBC for external database data


Importing an excel spreadsheet

Importing an Excel Spreadsheet

PROCIMPORT OUT= WORK.Fall2007

DATAFILE= "L:\DataWarehousing07f\CourseDatabase\Fall2007.xls"

DBMS=EXCEL REPLACE;

SHEET="'Fall 07$'";

GETNAMES=YES;

MIXED=NO;

SCANTEXT=YES;

USEDATE=YES;

SCANTIME=YES;

RUN;


Import an access table

Import an Access Table

PROC IMPORT OUT= WORK.OrderLine

DATATABLE= "OrderLin"

DBMS=ACCESS REPLACE;

DATABASE="I:\DataWarehousing07f\WholesaleProducts.mdb";

SCANMEMO=YES;

USEDATE=NO;

SCANTIME=YES;

RUN;


Good practice

 Good Practice

  • Check the metadata for a dataset

    PROC CONTENTS DATA= OrderLine;

    RUN;

  • Print a few records

    PROC PRINT DATA= OrderLine (OBS= 10);

    RUN;


Saving sas datasets

Saving SAS Datasets

LIBNAME course "L:\DataWarehousing07f\CourseDatabase";

Data course.Spring2008;

set spring2008;

run;

Note: the name associated with the libname command (“course”) must be 8 characters or less.


Libname infile input for character data

LIBNAME / INFILE / INPUT for character data

  • LIBNAME identifies the location or folder where the data file is stored

  • INFILE specifies the libname to use for reading external data.

  • INPUT reads text format data

  • SET reads SAS data


Infile with input for character data files

INFILE with INPUT for character data files

DATA Fitness;

INFILE "L:\DataWarehousing07f\TransformationSAS\SAS1.txt";

INPUT NAME $ WEIGHT WAIST PULSE CHINS SITUPS JUMPS;

run;


Creating derived attributes

Creating Derived Attributes

Generating new attributes for a table. SAS creates attributes when they are referred to in a data step. The metadata depends on the context of the code.

  • LENGTH statements

  • FORMAT statements

  • FORMATS and INFORMATS

  • PUT

  • INPUT


Put and input functions

PUT and INPUT Functions

TextOutput = PUT(variable, format)

Note: the result of a put function is always character

Note: there is also a PUT statement that writes the contents of a variable to the SAS log

Output = INPUT(CharacterInput, informat)

Note: the variable for an input function is always character


Formats

Formats

  • Formats always contain a period

  • Formats for character variables always start with a $

  • The most used format categories are Character, Date and Time, and Numeric

    Note: use the SAS “search” tab to look for “Formats.” For a list of SAS formats look under: “Formats: Formats by Category”


Good practice1

 Good Practice

The following code is handy for testing functions and formats in SAS. The _Null_ dataset name tells SAS not to create the datset in the WORK library

Data _Null_;

InputVal= 123;

OutputVal= PUT(InputVal, Roman30.);

PUT InputVal OutputVal;

run;


Generating dates

Generating Dates

  • Generating a Date dimension

  • Usually done offline in something like Excel

  • SAS has extensive date and datetime functions and formats

  • SAS formats apply to only one of datetime, date or time variable types. Convert from one type to another with SAS functions.


Creating a text variable for date

Creating a text variable for Date

Data Orders2;

Length Date $10.;

Set Orders;

Date= PUT( Datepart(OrderDate), MDDYY8.);

  • The Length statement assures that the variable will have enough space. It must come before the SET.

  • OrderDate has DateTime format. The DATEPART function produces a date format output. MMDDYYx. is a date format type.


Sas functions

SAS Functions

We are especially interested in “Character” and “Date and Time” functions

Note: use the SAS “search” tab to look for “Functions.” For a list of SAS functions look under: “Functions and CALL routines: Functions and CALL Routines by Category”


Useful data cleaning functions

Useful Data Cleaning Functions

  • Text Manipulation:

    • COMPRESS, STRIP, TRIM, LEFT, RIGHT, UPCASE, LOWCASE

  • Text Extraction

    • INDEX, SCAN, SUBSTR, TRANSLATE, TRANWRD


Parsing

Parsing

  • The process of splitting a text field into multiple fields

  • Uses SAS functions to extract parts of a character string.

    • Fixed position in a string: SUBSTR

    • Known delimiter: SCAN

      Note: it is a good idea to strip blanks before you try to parse a string.


Example of parsing

Example of Parsing

Data Customer2;

LENGTH street cust_addr $20.;

FORMAT street cust_addr $20.;

SET Customer;

Cust_Addr= TRIM(Cust_Addr);

Number= Scan(Cust_Addr,1,' ');

Street= Scan(Cust_Addr,2,' ');

run;

Note: The LENGTH and FORMAT statements clear trailing blanks for further display.


Parsing results

Parsing Results

Obs cust_addr Number street

1 481 OAK 481 OAK

2 215 PETE 215 PETE

3 48 COLLEGE 48 COLLEGE

4 914 CHERRY 914 CHERRY

5 519 WATSON 519 WATSON

6 16 ELM 16 ELM

7 108 PINE 108 PINE


Good practice2

 Good Practice

Always print the before and after images here. Parsing free form text can be quite a problem. For example, apartment addresses ‘110b Elm’ and ‘110 b Elm’ will parse differently. In this case you may have to search the second word for things that look like apartments and correct the data.


Substr string position length

=SUBSTR(string, position<, length>)

Use this when you have a known position for characters.

  • String: character expression

  • Position: start position (starts with 1)

  • Length: number of characters to take (missing takes all to the end)

    VAR= ‘ABCDEFG’

    NEWVAR= SUBSTR(VAR,2,2)

    NEWVAR2= SUBSTR(VAR,4)

    NEWVAR= ‘BC’

    NEWVAR2= ‘DEFG’


Substr variable position length new characters

SUBSTR(variable, position<,length>) = new-characters

Replaces character value contents. Use this when you know where the replacement starts.

a='KIDNAP';

substr(a,1,3)='CAT';

a: CATNAP

substr(a,4)='TY' ;

a: KIDTY


Index source excerpt

INDEX(source, excerpt)

  • Searches a character expression for a string of characters. Returns the location (number) where the string begins.

    a='ABC.DEF (X=Y)';

    b='X=Y';

    x=index(a,b);

    x: 10

    x= index(a,’DEF’);

    x: 5


Alternative index functions

Alternative INDEX functions

  • INDEXC searches for a single character

  • INDEXW searches for a word:

    Syntax

    INDEXW(source, excerpt<,delimiter>)


Length

Length

Returns the length of a character variable

  • The LENGTH and LENGTHN functions return the same value for non-blank character strings. LENGTH returns a value of 1 for blank character strings, whereas LENGTHN returns a value of 0.

  • The LENGTH function returns the length of a character string, excluding trailing blanks, whereas the LENGTHC function returns the length of a character string, including trailing blanks. LENGTH always returns a value that is less than or equal to the value returned by LENGTHC.


Standardizing

Standardizing

  • Adjusting terms to standard format.

  • Based off of frequency prints.

  • Use functions or IF statements

    • TRANWRD is easy but can produce unexpected results

    • IF statements are safer, but less general


Standardization code

Standardization Code

Supplier= Tranwrd(supplier, " Incorporated", "");

If Supplier= "Trinkets & Things" then supplier= "Trinkets n' Things";

More complex logic is often needed. See the course examples.


Good practice3

 Good Practice

It is a good idea to produce a change log for standardized changes:

Data Products2 Changed;

Set Products;

SupplierOld= Supplier;

* * * *

Output Products2;

If Trim(supplier) ^= Trim(SupplierOld) then output Changed;

Proc Print Data= Changed;

Var SupplierOld Supplier;


Locating anomalies

Locating Anomalies

  • Frequency counts are a good way to identify anomalies.

  • It is also helpful to identify standard changes that you do not have to review.

  • Probably the safest way to execute standard changes is with a “Change Table” that lists From and To values. (Advanced SAS exercise – go for it!!)


De duplicating

De Duplicating

  • Reconcile different representations of the same entity

  • Done after standardizing. Usually requires multi-field testing.

  • May use probabilistic logic, depending on the application.

  • Should produce a change log.


Correcting

Correcting

  • Identifying and correcting values that are wrong

  • Very difficult to do. Usually based off of exception reports or range checks.


  • Login