john porter university of virginia jporter@virginia edu n.
Skip this Video
Download Presentation
Using the “R” Actor in Kepler for quality control

Loading in 2 Seconds...

play fullscreen
1 / 55

Using the “R” Actor in Kepler for quality control - PowerPoint PPT Presentation

  • Uploaded on

John Porter, University of Virginia, Using the “R” Actor in Kepler for quality control. R Basics. R is an open source statistical language “Atomic” types:  logical, integer, real, complex, string (or character) and raw

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Using the “R” Actor in Kepler for quality control' - adonis

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
r basics
R Basics
  • R is an open source statistical language
  • “Atomic” types:  logical, integer, real, complex, string (or character) and raw
  • Data in R is stored in one of several types of objects
    • Scalar : myVar <- 10
    • Vectors: myVec <- c(10,20,30)
    • Lists: myList <- c(10,”E”,12.3)
    • Matrix: myMat <- cbind(myVec1,myVec2)
    • Data Frames: myDf<-data.frame(myVec,MyList)
    • Factors: myFac <- as.factor(myList)
r workspaces
R Workspaces
  • All the variables and functions defined during a session are part of the “Workspace”
  • R Workspaces can be saved for later use
    • When you come back, everything is the same as when the workspace was saved
most commonly used object types
Most Commonly Used Object Types
  • Vectors – contain a single column of one of the “atomic” types
  • Often created using the concatenate function

myVec <- c(10,20,30)

  • Individual elements can be accessed using indexes

myVec[2] is 20

data frames
Data Frames
  • Data Frames – table-style objects that contain named vectors inside them

myDF$RAIN refers to the “RAIN” vector, as does myDF[ ,2]

myDF[135,3] is 121.8

reading data into data frames
Reading Data into Data Frames
  • A common way of creating data frames is to read in a comma-separated-value (csv) file


myDf <- read.csv(“C:/ft_monro.csv”,header=TRUE)

Note, regardless of operating system, R wants “/” – not “\”

sample r program for qa qc
Sample R Program for QA/QC

# Select the Data File

infile1 <- file(“C:/downloads/ft_monroe.csv", open="r")

# Read the data

dataTable1 <-read.csv(infile1, ,skip=1 ,sep="," ,quot='"' , col.names=c( "YEAR", "RAIN", "RAIN_CM", "NOTES" ), check.names=TRUE)


# Run basic summary statistics





quick exercise run these in r
Quick Exercise – Run these in R

# anything after a # sign on a line is just a COMMENT - it won't do anything

varA <- 10 # sets up a vector with one element containing a 10

varA # listing an object's name prints out the values

varB<- c(10,20,30) # sets up a vector with 3 elements. c() is the concatenation function


varB[2] # now let's display ONLY the second element

# now let's do some math!

mySumAB <- varA + varB # adding them together.

# Note there is only 1 value in varA


# note the single value in varA repeated in the addition

r data structures
R Data Structures
  • A lot of the “magic” in R is because of the object-oriented approach used
  • R objects contain a lot more than just the data values
  • A command that does one thing to a scalar (single value) does something else with a vector (a list of values) – all because R functions “understand” the difference!
  • Conversions are possible between different modes or types of objects using conversion functions
    • as.numeric(varA)
      • makes varA a number – if it can!
    • as.integer( )
    • as.character( )
    • as.factor()
    • as.matrix()
using data frames
Using Data Frames

A <- c(10,20,30)

B <- c(4,6,3)

C <- c(‘A’,’B’,’C’) # put letters in quotes

Df <-data.frame(C,A,B)

Df # list whole data frame

Df$A # list the A vector

Df[,3] # list the 3rd vector (B)

Df[1,] # list all columns for row 1

Df[Df$A > 10,] # list rows where A>10

data frames1
Data Frames
  • Results of Data Frame manipulations
r help
R Help

R has a number of ways of calling up help

  • ??sqrt- does a “fuzzy” search for functions like “sqrt”
  • ?sqrt– does an exact search for the function sqrt() and displays documentation
  • There are also manuals and extensive on-line tutorials (but Google is frequently the best way to find help)
r kepler
R & Kepler
  • Kepler uses the “RExpression Actor” to run R code from inside Kepler
  • Typically run with an SDF Director with a single iteration for most analyses
    • You only need them done once!
    • Don’t forget to set the iteration count – the default is to loop forever!
adding ports
Adding Ports
  • To make Rexpression actors really useful, it is helpful to be able to have them intercommunicate with other Kepler actors beyond simply listing output or showing graphs
  • To allow this intercommunication we need to add additional Input and Output ports
    • The names of the ports will automatically be connected to objects with the same name in the R program
r program to test
R Program to Test

Remember – names of ports translate into names of objects in R

results of running workflow
Results of Running Workflow

R Listing Output



r for checking eml data
R for Checking EML Data

But there are some TRICKS you should know!

trick 1 select the right object type for the emlactor
Trick 1 – select the right object type for the EMLactor
  • By Default the EML Actor only connects to the output ports the FIRST LINE OF DATA “as field”.
  • If you want to have an output port represent the data as a VECTOR you need to select “As Column Vector”
  • If you want to get a Data Frame instead of individual columns, you need to select “As ColumnBasedRecord”
trick 2 trap r errors
Trick 2 – Trap R errors
  • Normally if there is a problem with your R program you get a cryptic message from Kepler
t ry and geterrmessage in r
try() and geterrmessage() in R

Runs the “errorplot()”* function and reports any error messages that occur when you run it

* There is no “errorplot()” function in R

qa qc quality assurance and quality control
QA/QC – Quality Assurance and Quality Control
  • Error types
    • Errors of Commission – data contains wrong values
    • Errors of Omission – data that should be there is missing
  • We will mostly be talking today about errors of commission
porter s rule of data quality
Porter’s Rule of Data Quality
  • There is no non-trivial dataset that does not contain some errors
  • Goal of QA/QC: reduce errors to the maximum possible extent, or at least to the level that they don’t adversely effect the conclusions reached through analysis of the data
qa qc possible tests
QA/QC – Possible Tests
  • Identification and removal of duplicates
  • Correct Domain
    • Numerical Range (e.g., -20 < Temperature < 50)
    • Correct Codes (e.g., HOGI, not HOG1)
  • Graphs
    • Time-series plots
    • Plots between variables
  • Detections of “spikes” in time series
  • Customized criteria (e.g., month specific range checks)
exercise a succession of workflows for qa
Exercise – A succession of workflows for QA
  • Open your Virtual Machine
  • Open a Web Browser and go to:
  • Open the file
  • Extract All Files to directory C:\
  • You should then have a C:\localData directory containing the files for this exercise


A dead-simple workflow

kepler stuff to note
Kepler Stuff to Note
  • Annotations allow you to add titles and other useful instructions to your workflow display
kepler stuff to note1
Kepler Stuff to Note
  • Parameters let you easily show and change values that will be used elsewhere in the workflow
kepler parameters
Kepler Parameters
  • Customize Name lets you set the NAME of the parameter and what should display on the screen
  • Remember the

name – that is

how you will

refer to the

parameter later.

using a parameter value
Using a Parameter Value
  • Add a $ to the front of a parameter in a Kepler settings box to insert the value of the parameter – so the Data File: is c:/localData/ft_monro.csv
brief exercise
Brief Exercise
  • Experiment with editing connections in this workflow to display different graphs

Then open the 3_ft_monro_badData.kar workflow – it has a corrupted version of this data

r stuff to note
R stuff to Note
  • This workflow uses both

a Data Frame (table) and

vectors (single columns)

  • In the dataFrame you can subset lines using: dataFrame[(dataFrame$RAIN < 0), ]
    • Be sure to put the trailing comma!
    • dataFrame$RAIN < 0 generates a logical vector of TRUE and FALSE values – one for each line
qa qc in r
QA/QC in R


print("Here are Duplicated Data Lines")


print("now list out of range checks")

dataFrame[(dataFrame$RAIN < 0 | dataFrame$RAIN_CM < 0),]

dataFrame[(dataFrame$RAIN > 150 | dataFrame$RAIN_CM > 300),]

print("now list unit conversion errors")

dataFrame[(abs((dataFrame$RAIN*2.54)- dataFrame$RAIN_CM)>0.1),]

examine the workflow on the bad data and change it
Examine the workflow on the bad data and change it!
  • Try setting different values for the range checks
  • Try different graphs (as you did for the good data)
  • Try listing all the data that was NOT duplicated (note in R the “not “ operator is “!“)
  • use R help and Google as needed
r kepler vs r alone
R+Kepler vs. R Alone
  • Given that “R” runs just fine alone, why use Kepler?
    • Allows use of OTHER Kepler actors, Data Turbine
      • E.g., EMLData, editors, graphical tools
    • Allows code to be segmented for easier editing in the future
    • Reusability – ability to copy and paste parts of Kepler workflows
    • Use spatial arrangement to help guide the user
  • Downsides
    • Complicates debugging
workflow steps
Workflow Steps
  • Read an EML metadata file
  • Convert it using a XSLT stylesheet into an R program
  • Edit the R program to point to the data
  • Ingest the data into a data frame
  • Summarize the data
  • “Tweak “ the data to add a date-time vector for time plots and fix some conversion problems and re-summarize the data
  • Run some plots
passing r workspaces
Passing R Workspaces
  • This workflow, instead of passing data from actor-to-actor, passes the name of the R Workspace
  • Subsequent actors re-open the R Workspace without needing to ingest the data again
  • This is very efficient, but this method only works for connecting R actors
r code for passing on r workspaces
R code for passing on R workspaces

Set Port Variable to the name of the workflow

Saving workspace for later use

Remember to save the workspace!

Loading the Saved Workspace

Name of Port connected to WorkingDir port (above)

a conversion problem
A conversion problem

Temperature and Humidity values have some severe problems reading in!

What happened?

r factors
R Factors
  • Factors are the way R deals with categorical or nominal data (e.g., typically, non-numeric data)
  • Internally Factors are made up of two vectors:
    • Values – the actual values stored in the factor – often referred to as “levels”
    • Indexes – an integer vector containing numbers that are used to specify the ORDERing of the values
  • DANGER – sometimes when you read in data from a file, errors or odd characteristics of the data will cause R to read a column of (mostly) numbers as a Factor instead of as a numeric vector!

This is the mean of the INDEXES not the VALUES/Levels


After conversion data ranges are much better!

  • But Max_T is still suspicious!
your final challenge
Your Final Challenge
  • As it’s name suggests this data file has some corrupted data (plus the normal errors)
  • Edit the “Tweaks” actor to add additional checks or add additional plots to identify the problems with the data
  • If you don’t cause Kepler to abort the workflow due to errors at least once, you aren’t trying hard enough! So make additions in a change-test-repeat cycle