Using the “R” Actor in Kepler for quality control

1 / 55

Using the “R” Actor in Kepler for quality control - PowerPoint PPT Presentation

John Porter, University of Virginia, [email protected] Using the “R” Actor in Kepler for quality control. R Basics. R is an open source statistical language “Atomic” types:  logical, integer, real, complex, string (or character) and raw

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

PowerPoint Slideshow about ' Using the “R” Actor in Kepler for quality control' - adonis

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
John Porter, University of Virginia, [email protected]

Using the “R” Actor in Kepler for quality control

R Basics
• R is an open source statistical language
• “Atomic” types:  logical, integer, real, complex, string (or character) and raw
• Data in R is stored in one of several types of objects
• Scalar : myVar <- 10
• Vectors: myVec <- c(10,20,30)
• Lists: myList <- c(10,”E”,12.3)
• Matrix: myMat <- cbind(myVec1,myVec2)
• Data Frames: myDf<-data.frame(myVec,MyList)
• Factors: myFac <- as.factor(myList)
R Workspaces
• All the variables and functions defined during a session are part of the “Workspace”
• R Workspaces can be saved for later use
• When you come back, everything is the same as when the workspace was saved
Most Commonly Used Object Types
• Vectors – contain a single column of one of the “atomic” types
• Often created using the concatenate function

myVec <- c(10,20,30)

• Individual elements can be accessed using indexes

myVec[2] is 20

Data Frames
• Data Frames – table-style objects that contain named vectors inside them

myDF\$RAIN refers to the “RAIN” vector, as does myDF[ ,2]

myDF[135,3] is 121.8

• A common way of creating data frames is to read in a comma-separated-value (csv) file

Note, regardless of operating system, R wants “/” – not “\”

Sample R Program for QA/QC

# Select the Data File

dataTable1 <-read.csv(infile1, ,skip=1 ,sep="," ,quot=\'"\' , col.names=c( "YEAR", "RAIN", "RAIN_CM", "NOTES" ), check.names=TRUE)

attach(dataTable1)

# Run basic summary statistics

summary(as.factor(NOTES))

summary(as.numeric(YEAR))

summary(as.numeric(RAIN))

summary(as.numeric(RAIN_CM))

Quick Exercise – Run these in R

# anything after a # sign on a line is just a COMMENT - it won\'t do anything

varA <- 10 # sets up a vector with one element containing a 10

varA # listing an object\'s name prints out the values

varB<- c(10,20,30) # sets up a vector with 3 elements. c() is the concatenation function

varB

varB[2] # now let\'s display ONLY the second element

# now let\'s do some math!

mySumAB <- varA + varB # adding them together.

# Note there is only 1 value in varA

mySumAB

# note the single value in varA repeated in the addition

R Data Structures
• A lot of the “magic” in R is because of the object-oriented approach used
• R objects contain a lot more than just the data values
• A command that does one thing to a scalar (single value) does something else with a vector (a list of values) – all because R functions “understand” the difference!
Conversions
• Conversions are possible between different modes or types of objects using conversion functions
• as.numeric(varA)
• makes varA a number – if it can!
• as.integer( )
• as.character( )
• as.factor()
• as.matrix()
• as.data.frame()
Using Data Frames

A <- c(10,20,30)

B <- c(4,6,3)

C <- c(‘A’,’B’,’C’) # put letters in quotes

Df <-data.frame(C,A,B)

Df # list whole data frame

Df\$A # list the A vector

Df[,3] # list the 3rd vector (B)

Df[1,] # list all columns for row 1

Df[Df\$A > 10,] # list rows where A>10

Data Frames
• Results of Data Frame manipulations
R Help

R has a number of ways of calling up help

• ??sqrt- does a “fuzzy” search for functions like “sqrt”
• ?sqrt– does an exact search for the function sqrt() and displays documentation
• There are also manuals and extensive on-line tutorials (but Google is frequently the best way to find help)
R & Kepler
• Kepler uses the “RExpression Actor” to run R code from inside Kepler
• Typically run with an SDF Director with a single iteration for most analyses
• You only need them done once!
• Don’t forget to set the iteration count – the default is to loop forever!
• To make Rexpression actors really useful, it is helpful to be able to have them intercommunicate with other Kepler actors beyond simply listing output or showing graphs
• To allow this intercommunication we need to add additional Input and Output ports
• The names of the ports will automatically be connected to objects with the same name in the R program
R Program to Test

Remember – names of ports translate into names of objects in R

Results of Running Workflow

R Listing Output

“myOutValue”

displayed

R for Checking EML Data

But there are some TRICKS you should know!

Trick 1 – select the right object type for the EMLactor
• By Default the EML Actor only connects to the output ports the FIRST LINE OF DATA “as field”.
• If you want to have an output port represent the data as a VECTOR you need to select “As Column Vector”
• If you want to get a Data Frame instead of individual columns, you need to select “As ColumnBasedRecord”
Trick 2 – Trap R errors
• Normally if there is a problem with your R program you get a cryptic message from Kepler
try() and geterrmessage() in R

Runs the “errorplot()”* function and reports any error messages that occur when you run it

* There is no “errorplot()” function in R

QA/QC – Quality Assurance and Quality Control
• Error types
• Errors of Commission – data contains wrong values
• Errors of Omission – data that should be there is missing
• We will mostly be talking today about errors of commission
Porter’s Rule of Data Quality
• There is no non-trivial dataset that does not contain some errors
• Goal of QA/QC: reduce errors to the maximum possible extent, or at least to the level that they don’t adversely effect the conclusions reached through analysis of the data
QA/QC – Possible Tests
• Identification and removal of duplicates
• Correct Domain
• Numerical Range (e.g., -20 < Temperature < 50)
• Correct Codes (e.g., HOGI, not HOG1)
• Graphs
• Time-series plots
• Plots between variables
• Detections of “spikes” in time series
• Customized criteria (e.g., month specific range checks)
Exercise – A succession of workflows for QA
• Open a Web Browser and go to:
• http://tinyurl.com/7po5ffb
• Open the LocalData.zip file
• Extract All Files to directory C:\
• You should then have a C:\localData directory containing the files for this exercise

1_Ft_Monroe_simple_summary.kar

Kepler Stuff to Note
• Annotations allow you to add titles and other useful instructions to your workflow display
Kepler Stuff to Note
• Parameters let you easily show and change values that will be used elsewhere in the workflow
Kepler Parameters
• Customize Name lets you set the NAME of the parameter and what should display on the screen
• Remember the

name – that is

how you will

refer to the

parameter later.

Using a Parameter Value
• Add a \$ to the front of a parameter in a Kepler settings box to insert the value of the parameter – so the Data File: is c:/localData/ft_monro.csv
Brief Exercise
• Experiment with editing connections in this workflow to display different graphs

Then open the 3_ft_monro_badData.kar workflow – it has a corrupted version of this data

R stuff to Note
• This workflow uses both

a Data Frame (table) and

vectors (single columns)

• In the dataFrame you can subset lines using: dataFrame[(dataFrame\$RAIN < 0), ]
• Be sure to put the trailing comma!
• dataFrame\$RAIN < 0 generates a logical vector of TRUE and FALSE values – one for each line
QA/QC in R

summary(dataFrame)

print("Here are Duplicated Data Lines")

dataFrame[duplicated(dataFrame),]

print("now list out of range checks")

dataFrame[(dataFrame\$RAIN < 0 | dataFrame\$RAIN_CM < 0),]

dataFrame[(dataFrame\$RAIN > 150 | dataFrame\$RAIN_CM > 300),]

print("now list unit conversion errors")

dataFrame[(abs((dataFrame\$RAIN*2.54)- dataFrame\$RAIN_CM)>0.1),]

Examine the workflow on the bad data and change it!
• Try setting different values for the range checks
• Try different graphs (as you did for the good data)
• Try listing all the data that was NOT duplicated (note in R the “not “ operator is “!“)
• use R help and Google as needed
R+Kepler vs. R Alone
• Given that “R” runs just fine alone, why use Kepler?
• Allows use of OTHER Kepler actors, Data Turbine
• E.g., EMLData, editors, graphical tools
• Allows code to be segmented for easier editing in the future
• Reusability – ability to copy and paste parts of Kepler workflows
• Use spatial arrangement to help guide the user
• Downsides
• Complicates debugging
Workflow Steps
• Convert it using a XSLT stylesheet into an R program
• Edit the R program to point to the data
• Ingest the data into a data frame
• Summarize the data
• “Tweak “ the data to add a date-time vector for time plots and fix some conversion problems and re-summarize the data
• Run some plots
Passing R Workspaces
• This workflow, instead of passing data from actor-to-actor, passes the name of the R Workspace
• Subsequent actors re-open the R Workspace without needing to ingest the data again
• This is very efficient, but this method only works for connecting R actors
R code for passing on R workspaces

Set Port Variable to the name of the workflow

Saving workspace for later use

Remember to save the workspace!

Name of Port connected to WorkingDir port (above)

A conversion problem

Temperature and Humidity values have some severe problems reading in!

What happened?

R Factors
• Factors are the way R deals with categorical or nominal data (e.g., typically, non-numeric data)
• Internally Factors are made up of two vectors:
• Values – the actual values stored in the factor – often referred to as “levels”
• Indexes – an integer vector containing numbers that are used to specify the ORDERing of the values
• DANGER – sometimes when you read in data from a file, errors or odd characteristics of the data will cause R to read a column of (mostly) numbers as a Factor instead of as a numeric vector!
Factors

This is the mean of the INDEXES not the VALUES/Levels

After conversion data ranges are much better!

• But Max_T is still suspicious!