John Porter, University of Virginia, jporter@virginia.edu. Using the “R” Actor in Kepler for quality control. R Basics. R is an open source statistical language “Atomic” types:  logical, integer, real, complex, string (or character) and raw

Using the “R” Actor in Kepler for quality control

John Porter, University of Virginia, jporter@virginia.edu

## Using the “R” Actor in Kepler for quality control

### R Basics

• R is an open source statistical language

• “Atomic” types:  logical, integer, real, complex, string (or character) and raw

• Data in R is stored in one of several types of objects

• Scalar : myVar <- 10

• Vectors: myVec <- c(10,20,30)

• Lists: myList <- c(10,”E”,12.3)

• Matrix: myMat <- cbind(myVec1,myVec2)

• Data Frames: myDf<-data.frame(myVec,MyList)

• Factors: myFac <- as.factor(myList)

### R Workspaces

• All the variables and functions defined during a session are part of the “Workspace”

• R Workspaces can be saved for later use

• When you come back, everything is the same as when the workspace was saved

### Most Commonly Used Object Types

• Vectors – contain a single column of one of the “atomic” types

• Often created using the concatenate function

myVec <- c(10,20,30)

• Individual elements can be accessed using indexes

myVec[2] is 20

### Data Frames

• Data Frames – table-style objects that contain named vectors inside them

myDF\$RAIN refers to the “RAIN” vector, as does myDF[ ,2]

myDF[135,3] is 121.8

### Reading Data into Data Frames

• A common way of creating data frames is to read in a comma-separated-value (csv) file

Note, regardless of operating system, R wants “/” – not “\”

### Sample R Program for QA/QC

# Select the Data File

dataTable1 <-read.csv(infile1, ,skip=1 ,sep="," ,quot='"' , col.names=c( "YEAR", "RAIN", "RAIN_CM", "NOTES" ), check.names=TRUE)

attach(dataTable1)

# Run basic summary statistics

summary(as.factor(NOTES))

summary(as.numeric(YEAR))

summary(as.numeric(RAIN))

summary(as.numeric(RAIN_CM))

### Quick Exercise – Run these in R

# anything after a # sign on a line is just a COMMENT - it won't do anything

varA <- 10 # sets up a vector with one element containing a 10

varA # listing an object's name prints out the values

varB<- c(10,20,30) # sets up a vector with 3 elements. c() is the concatenation function

varB

varB[2] # now let's display ONLY the second element

# now let's do some math!

mySumAB <- varA + varB # adding them together.

# Note there is only 1 value in varA

mySumAB

# note the single value in varA repeated in the addition

### R Data Structures

• A lot of the “magic” in R is because of the object-oriented approach used

• R objects contain a lot more than just the data values

• A command that does one thing to a scalar (single value) does something else with a vector (a list of values) – all because R functions “understand” the difference!

### Conversions

• Conversions are possible between different modes or types of objects using conversion functions

• as.numeric(varA)

• makes varA a number – if it can!

• as.integer( )

• as.character( )

• as.factor()

• as.matrix()

• as.data.frame()

### Using Data Frames

A <- c(10,20,30)

B <- c(4,6,3)

C <- c(‘A’,’B’,’C’) # put letters in quotes

Df <-data.frame(C,A,B)

Df # list whole data frame

Df\$A # list the A vector

Df[,3] # list the 3rd vector (B)

Df[1,] # list all columns for row 1

Df[Df\$A > 10,] # list rows where A>10

### Data Frames

• Results of Data Frame manipulations

### R Help

R has a number of ways of calling up help

• ??sqrt- does a “fuzzy” search for functions like “sqrt”

• ?sqrt– does an exact search for the function sqrt() and displays documentation

• There are also manuals and extensive on-line tutorials (but Google is frequently the best way to find help)

### R & Kepler

• Kepler uses the “RExpression Actor” to run R code from inside Kepler

• Typically run with an SDF Director with a single iteration for most analyses

• You only need them done once!

• Don’t forget to set the iteration count – the default is to loop forever!

The default RExpression has no inputs and two outputs

graphicsFileName & output

Typical connections for basic RExpression Actor

• To make Rexpression actors really useful, it is helpful to be able to have them intercommunicate with other Kepler actors beyond simply listing output or showing graphs

• To allow this intercommunication we need to add additional Input and Output ports

• The names of the ports will automatically be connected to objects with the same name in the R program

Hook up some input and output actors

### R Program to Test

Remember – names of ports translate into names of objects in R

R Listing Output

“myOutValue”

displayed

### R for Checking EML Data

But there are some TRICKS you should know!

### Trick 1 – select the right object type for the EMLactor

• By Default the EML Actor only connects to the output ports the FIRST LINE OF DATA “as field”.

• If you want to have an output port represent the data as a VECTOR you need to select “As Column Vector”

• If you want to get a Data Frame instead of individual columns, you need to select “As ColumnBasedRecord”

### Trick 2 – Trap R errors

• Normally if there is a problem with your R program you get a cryptic message from Kepler

### try() and geterrmessage() in R

Runs the “errorplot()”* function and reports any error messages that occur when you run it

* There is no “errorplot()” function in R

### QA/QC – Quality Assurance and Quality Control

• Error types

• Errors of Commission – data contains wrong values

• Errors of Omission – data that should be there is missing

• We will mostly be talking today about errors of commission

### Porter’s Rule of Data Quality

• There is no non-trivial dataset that does not contain some errors

• Goal of QA/QC: reduce errors to the maximum possible extent, or at least to the level that they don’t adversely effect the conclusions reached through analysis of the data

### QA/QC – Possible Tests

• Identification and removal of duplicates

• Correct Domain

• Numerical Range (e.g., -20 < Temperature < 50)

• Correct Codes (e.g., HOGI, not HOG1)

• Graphs

• Time-series plots

• Plots between variables

• Detections of “spikes” in time series

• Customized criteria (e.g., month specific range checks)

### Exercise – A succession of workflows for QA

• Open a Web Browser and go to:

• http://tinyurl.com/7po5ffb

• Open the LocalData.zip file

• Extract All Files to directory C:\

• You should then have a C:\localData directory containing the files for this exercise

1_Ft_Monroe_simple_summary.kar

### Kepler Stuff to Note

• Annotations allow you to add titles and other useful instructions to your workflow display

### Kepler Stuff to Note

• Parameters let you easily show and change values that will be used elsewhere in the workflow

### Kepler Parameters

• Customize Name lets you set the NAME of the parameter and what should display on the screen

• Remember the

name – that is

how you will

refer to the

parameter later.

### Using a Parameter Value

• Add a \$ to the front of a parameter in a Kepler settings box to insert the value of the parameter – so the Data File: is c:/localData/ft_monro.csv

### Brief Exercise

• Experiment with editing connections in this workflow to display different graphs

Then open the 3_ft_monro_badData.kar workflow – it has a corrupted version of this data

### R stuff to Note

• This workflow uses both

a Data Frame (table) and

vectors (single columns)

• In the dataFrame you can subset lines using: dataFrame[(dataFrame\$RAIN < 0), ]

• Be sure to put the trailing comma!

• dataFrame\$RAIN < 0 generates a logical vector of TRUE and FALSE values – one for each line

### QA/QC in R

summary(dataFrame)

print("Here are Duplicated Data Lines")

dataFrame[duplicated(dataFrame),]

print("now list out of range checks")

dataFrame[(dataFrame\$RAIN < 0 | dataFrame\$RAIN_CM < 0),]

dataFrame[(dataFrame\$RAIN > 150 | dataFrame\$RAIN_CM > 300),]

print("now list unit conversion errors")

dataFrame[(abs((dataFrame\$RAIN*2.54)- dataFrame\$RAIN_CM)>0.1),]

### Examine the workflow on the bad data and change it!

• Try setting different values for the range checks

• Try different graphs (as you did for the good data)

• Try listing all the data that was NOT duplicated (note in R the “not “ operator is “!“)

• use R help and Google as needed

### R+Kepler vs. R Alone

• Given that “R” runs just fine alone, why use Kepler?

• Allows use of OTHER Kepler actors, Data Turbine

• E.g., EMLData, editors, graphical tools

• Allows code to be segmented for easier editing in the future

• Reusability – ability to copy and paste parts of Kepler workflows

• Use spatial arrangement to help guide the user

• Downsides

• Complicates debugging

A more complex and general workflow

4_BasicEMLQA

### Workflow Steps

• Convert it using a XSLT stylesheet into an R program

• Edit the R program to point to the data

• Ingest the data into a data frame

• Summarize the data

• “Tweak “ the data to add a date-time vector for time plots and fix some conversion problems and re-summarize the data

• Run some plots

### Passing R Workspaces

• This workflow, instead of passing data from actor-to-actor, passes the name of the R Workspace

• Subsequent actors re-open the R Workspace without needing to ingest the data again

• This is very efficient, but this method only works for connecting R actors

### R code for passing on R workspaces

Set Port Variable to the name of the workflow

Saving workspace for later use

Remember to save the workspace!

Name of Port connected to WorkingDir port (above)

### A conversion problem

Temperature and Humidity values have some severe problems reading in!

What happened?

### R Factors

• Factors are the way R deals with categorical or nominal data (e.g., typically, non-numeric data)

• Internally Factors are made up of two vectors:

• Values – the actual values stored in the factor – often referred to as “levels”

• Indexes – an integer vector containing numbers that are used to specify the ORDERing of the values

• DANGER – sometimes when you read in data from a file, errors or odd characteristics of the data will cause R to read a column of (mostly) numbers as a Factor instead of as a numeric vector!

### Factors

This is the mean of the INDEXES not the VALUES/Levels

• After conversion data ranges are much better!

• But Max_T is still suspicious!