data transformation
Download
Skip this Video
Download Presentation
Data Transformation

Loading in 2 Seconds...

play fullscreen
1 / 31

Data Transformation - PowerPoint PPT Presentation


  • 93 Views
  • Uploaded on

Data Transformation. Data cleaning. Importing Data. Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access data ODBC for external database data. Importing an Excel Spreadsheet. PROC IMPORT OUT= WORK.Fall2007

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Data Transformation' - sylvester-vinson


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
data transformation

Data Transformation

Data cleaning

importing data
Importing Data
  • Reading data from external formats
  • Libname/Infile/Input for text form data
  • Proc Import for Excel/Access data
  • ODBC for external database data
importing an excel spreadsheet
Importing an Excel Spreadsheet

PROCIMPORT OUT= WORK.Fall2007

DATAFILE= "L:\DataWarehousing07f\CourseDatabase\Fall2007.xls"

DBMS=EXCEL REPLACE;

SHEET="\'Fall 07$\'";

GETNAMES=YES;

MIXED=NO;

SCANTEXT=YES;

USEDATE=YES;

SCANTIME=YES;

RUN;

import an access table
Import an Access Table

PROC IMPORT OUT= WORK.OrderLine

DATATABLE= "OrderLin"

DBMS=ACCESS REPLACE;

DATABASE="I:\DataWarehousing07f\WholesaleProducts.mdb";

SCANMEMO=YES;

USEDATE=NO;

SCANTIME=YES;

RUN;

good practice
 Good Practice
  • Check the metadata for a dataset

PROC CONTENTS DATA= OrderLine;

RUN;

  • Print a few records

PROC PRINT DATA= OrderLine (OBS= 10);

RUN;

saving sas datasets
Saving SAS Datasets

LIBNAME course "L:\DataWarehousing07f\CourseDatabase";

Data course.Spring2008;

set spring2008;

run;

Note: the name associated with the libname command (“course”) must be 8 characters or less.

libname infile input for character data
LIBNAME / INFILE / INPUT for character data
  • LIBNAME identifies the location or folder where the data file is stored
  • INFILE specifies the libname to use for reading external data.
  • INPUT reads text format data
  • SET reads SAS data
infile with input for character data files
INFILE with INPUT for character data files

DATA Fitness;

INFILE "L:\DataWarehousing07f\TransformationSAS\SAS1.txt";

INPUT NAME $ WEIGHT WAIST PULSE CHINS SITUPS JUMPS;

run;

creating derived attributes
Creating Derived Attributes

Generating new attributes for a table. SAS creates attributes when they are referred to in a data step. The metadata depends on the context of the code.

  • LENGTH statements
  • FORMAT statements
  • FORMATS and INFORMATS
  • PUT
  • INPUT
put and input functions
PUT and INPUT Functions

TextOutput = PUT(variable, format)

Note: the result of a put function is always character

Note: there is also a PUT statement that writes the contents of a variable to the SAS log

Output = INPUT(CharacterInput, informat)

Note: the variable for an input function is always character

formats
Formats
  • Formats always contain a period
  • Formats for character variables always start with a $
  • The most used format categories are Character, Date and Time, and Numeric

Note: use the SAS “search” tab to look for “Formats.” For a list of SAS formats look under: “Formats: Formats by Category”

good practice1
 Good Practice

The following code is handy for testing functions and formats in SAS. The _Null_ dataset name tells SAS not to create the datset in the WORK library

Data _Null_;

InputVal= 123;

OutputVal= PUT(InputVal, Roman30.);

PUT InputVal OutputVal;

run;

generating dates
Generating Dates
  • Generating a Date dimension
  • Usually done offline in something like Excel
  • SAS has extensive date and datetime functions and formats
  • SAS formats apply to only one of datetime, date or time variable types. Convert from one type to another with SAS functions.
creating a text variable for date
Creating a text variable for Date

Data Orders2;

Length Date $10.;

Set Orders;

Date= PUT( Datepart(OrderDate), MDDYY8.);

  • The Length statement assures that the variable will have enough space. It must come before the SET.
  • OrderDate has DateTime format. The DATEPART function produces a date format output. MMDDYYx. is a date format type.
sas functions
SAS Functions

We are especially interested in “Character” and “Date and Time” functions

Note: use the SAS “search” tab to look for “Functions.” For a list of SAS functions look under: “Functions and CALL routines: Functions and CALL Routines by Category”

useful data cleaning functions
Useful Data Cleaning Functions
  • Text Manipulation:
    • COMPRESS, STRIP, TRIM, LEFT, RIGHT, UPCASE, LOWCASE
  • Text Extraction
    • INDEX, SCAN, SUBSTR, TRANSLATE, TRANWRD
parsing
Parsing
  • The process of splitting a text field into multiple fields
  • Uses SAS functions to extract parts of a character string.
    • Fixed position in a string: SUBSTR
    • Known delimiter: SCAN

Note: it is a good idea to strip blanks before you try to parse a string.

example of parsing
Example of Parsing

Data Customer2;

LENGTH street cust_addr $20.;

FORMAT street cust_addr $20.;

SET Customer;

Cust_Addr= TRIM(Cust_Addr);

Number= Scan(Cust_Addr,1,\' \');

Street= Scan(Cust_Addr,2,\' \');

run;

Note: The LENGTH and FORMAT statements clear trailing blanks for further display.

parsing results
Parsing Results

Obs cust_addr Number street

1 481 OAK 481 OAK

2 215 PETE 215 PETE

3 48 COLLEGE 48 COLLEGE

4 914 CHERRY 914 CHERRY

5 519 WATSON 519 WATSON

6 16 ELM 16 ELM

7 108 PINE 108 PINE

good practice2
 Good Practice

Always print the before and after images here. Parsing free form text can be quite a problem. For example, apartment addresses ‘110b Elm’ and ‘110 b Elm’ will parse differently. In this case you may have to search the second word for things that look like apartments and correct the data.

substr string position length
=SUBSTR(string, position<, length>)

Use this when you have a known position for characters.

  • String: character expression
  • Position: start position (starts with 1)
  • Length: number of characters to take (missing takes all to the end)

VAR= ‘ABCDEFG’

NEWVAR= SUBSTR(VAR,2,2)

NEWVAR2= SUBSTR(VAR,4)

NEWVAR= ‘BC’

NEWVAR2= ‘DEFG’

substr variable position length new characters
SUBSTR(variable, position<,length>) = new-characters

Replaces character value contents. Use this when you know where the replacement starts.

a=\'KIDNAP\';

substr(a,1,3)=\'CAT\';

a: CATNAP

substr(a,4)=\'TY\' ;

a: KIDTY

index source excerpt
INDEX(source, excerpt)
  • Searches a character expression for a string of characters. Returns the location (number) where the string begins.

a=\'ABC.DEF (X=Y)\';

b=\'X=Y\';

x=index(a,b);

x: 10

x= index(a,’DEF’);

x: 5

alternative index functions
Alternative INDEX functions
  • INDEXC searches for a single character
  • INDEXW searches for a word:

Syntax

INDEXW(source, excerpt<,delimiter>)

length
Length

Returns the length of a character variable

  • The LENGTH and LENGTHN functions return the same value for non-blank character strings. LENGTH returns a value of 1 for blank character strings, whereas LENGTHN returns a value of 0.
  • The LENGTH function returns the length of a character string, excluding trailing blanks, whereas the LENGTHC function returns the length of a character string, including trailing blanks. LENGTH always returns a value that is less than or equal to the value returned by LENGTHC.
standardizing
Standardizing
  • Adjusting terms to standard format.
  • Based off of frequency prints.
  • Use functions or IF statements
    • TRANWRD is easy but can produce unexpected results
    • IF statements are safer, but less general
standardization code
Standardization Code

Supplier= Tranwrd(supplier, " Incorporated", "");

If Supplier= "Trinkets & Things" then supplier= "Trinkets n\' Things";

More complex logic is often needed. See the course examples.

good practice3
 Good Practice

It is a good idea to produce a change log for standardized changes:

Data Products2 Changed;

Set Products;

SupplierOld= Supplier;

* * * *

Output Products2;

If Trim(supplier) ^= Trim(SupplierOld) then output Changed;

Proc Print Data= Changed;

Var SupplierOld Supplier;

locating anomalies
Locating Anomalies
  • Frequency counts are a good way to identify anomalies.
  • It is also helpful to identify standard changes that you do not have to review.
  • Probably the safest way to execute standard changes is with a “Change Table” that lists From and To values. (Advanced SAS exercise – go for it!!)
de duplicating
De Duplicating
  • Reconcile different representations of the same entity
  • Done after standardizing. Usually requires multi-field testing.
  • May use probabilistic logic, depending on the application.
  • Should produce a change log.
correcting
Correcting
  • Identifying and correcting values that are wrong
  • Very difficult to do. Usually based off of exception reports or range checks.
ad