From n00b to pro
Sponsored Links
This presentation is the property of its rightful owner.
1 / 29

From n00b to Pro PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

From n00b to Pro. Jack Davis Andrew Henrey. PURPOSE. Create a simulator from scratch that: Generates data from a variety of distributions Makes a response variable from a known function of the data (plus an error term)

Download Presentation

From n00b to Pro

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

From n00b to Pro

Jack Davis

Andrew Henrey


Create a simulator from scratch that:

  • Generates data from a variety of distributions

  • Makes a response variable from a known function of the data (plus an error term)

  • Constructs a linear model that estimates the coefficients of the function

  • Repeats generation and modeling many times to compare the average estimates of the linear model to the known parameters.

  • Package the whole thing nicely into a function that we can call in a single line in later work.

  • If you’re experienced, the commands themselves may seem trivial


  • 1) Learning how to learn

  • 2) Randomly Generating Data

  • 3) Data Frames and Manipulation

  • 4) Linear Models

  • BREAK – Quality of presenter improves

  • 5) Running loops

  • 6) Function Definition

  • 7) More advanced function topics

  • 8) Using functions

  • 9) A short simulation study

Learning how to learn – Jack Davis

  • Google CRAN Packages to get the package list

    • From here you can get a description of every command in a package.

  • ??<term> searches for commands related to <term>

    • ??plot will find commands related to plot

  • ?<command> calls up the help file for that command

    • ?ablinegives the help file for the abline() command.


  • Exercises:

  • Name one function in the darts game package.

  • What is the e-mail of the author of the Texas Holdem simulation package?

  • (Bonus) Tell the author about your day via e-mail; s/he likes hearing from fans.

  • Find a function to make a histogram

  • Find some example code on the heatmap() command.

Randomly generating data – jack davis

  • The r<dist> commands randomly generate data from a distribution

  • rnorm( n , mean, sd) Generates from normal distribution (default N(0,1))

  • rexp( n, rate)

  • rbinom( n, size, prob)

  • rt( n, df) From Student’s T. (Mean is zero, so setting a mean is up to you)

  • set.seed() Allows you to generate the same data every time, so you or others can verify work.


  • Set a random seed

  • Generate a vector of 50 values from the Normal (mean=10,sd=4) distribution, name the vector x1.

  • Do the same with

    • Poisson ( lambda = 5), named x2,

    • Exponential (rate = 1/7) named x3,

    • Student’s t distribution (df =5), with a mean of 5, named x4,

    • Normal (mean=0, sd=20), named err

  • Make a new variable y, let it be 3 + 20x1 + 15x2 – 12x3 – 10x4 + err

Data frames – jack davis

  • data.frame() makes a dataframe object of the vectors listed in the ()

  • The advantage of having a data frame is that it can be treated as a single object..

  • Data frames, models, and even matrix decompositions can be objects in R.

  • You can call parts of objects by name using $

  • model$coefor model$coefficient will bring up the estimated coefficients

  • If no such aspect exists, then you’ll get a null response.

  • Example:Andrew$height

Data frames – jack davis

  • Exercises:

  • Make a data.frame() of x1,x2,x3,x4, and y Name it dat

  • (if you’re stuck from the last part, run “Q3-dataframethis.txt” first)

  • Use index indicators like dat[4,3], dat [2:7,3], dat [4,], and dat [4,-1] to get

    • The 3rd row, 5th entry of dat

    • The 2nd – 7th values of the 5th column

    • The entire 3rd row

    • The 3rd row without the 1st entry

Linear models – jack davis

  • The results of the lm() function are an object.

  • Example: mod = lm(y ~ x1 + I(x2^2) + x1:x2, data=dat)

  • Useful aspects

    • mod$fitted

    • mod$residuals

  • Useful functions

    • summary(mod)

    • predict(mod, newdata)

Linear models – jack davis

  • Use the lm command to create a linear model of y as a function of x1,x2,x3, and x4 additively using dat data, name it mod. (No interactions or transformations)

  • Get the summary of mod

  • Display the estimated coefficients with no other values.


  • This slide unintentionally left 98% blank


  • 1) Learning how to learn

  • 2) Randomly Generating Data

  • 3) Data Frames and Manipulation

  • 4) Linear Models

  • BREAK – Quality of presenter improves deteriorates

  • 5) Running loops

  • 6) Function Definition

  • 7) More advanced function topics

  • 8) Using functions

  • 9) A short simulation study


  • Similar to other programming languages, loops in R allow you to repeat the same block of code several times

  • Unlike other programming languages, large loops in R are exceedingly slow

  • Any loop of less than about 100,000 total iterations is not going to give you much trouble in terms of time


  • An R loop that executes a million commands takes about a second. Conditions vary wildly

  • Generating 100,000 data sets of size 50,000 and looping through the dataset to calculate a mean for each one would take longer to run than Jack Davis heading up Burnaby Mountain (ouch)

Sup d00dz late for tutorial 



  • Loop syntax:

    for (i in 1:n)





  • No need to run from 1:K

  • Can use an arbitrary vector instead

  • Runs for length(vect) iterations

  • Takes on the ith value of the vector each iteration

  • e.g.

  • V = c(1,5,3,-6)

  • for (count in V) {print(count);}

  • ## 1, 5 , 3 , -6


  • Exercises:

  • A) Define a variable runs to be the number 10,000

  • B) Define a matrix() called mat with 5 columns and runs rows (10,000 rows)

  • C) Put a for() loop around the code found in q5-loopthis.txt . Loop from 1:runs.

    Use index indicators like a[k,] to save the estimated coefficients of the model in a new row of mat.

    OR, if you think you are a total coding BOSS, then put the loop around your code in parts 2-4 that generates data and finds the linear model estimates of the betas.

Function definition – Andrew Henrey

  • Functions are a slightly abstract concept

  • Mathematics: f(x) = x2+4x-16

  • Computing: mean(x) = sum(x)/length(x) – All 3 are functions!

  • Functions map INPUTS to OUTPUT

  • Possibly no inputs

  • In one way or another, always some form of output

  • Example functions:


Function definition – Andrew Henrey

  • Function syntax

  • Simple function:

    F = function()


    return (5)


    >> F()


Function Definition – Andrew Henrey

  • Exercises:

  • Make a function out of the code you wrote in part 5. The syntax should be similar to the previous slide. The function:

    • Should be called simulate.lm

    • Should include everything needed to generate the data several times, find a linear model, and extract the coefficients

    • Does NOT take any inputs

    • Should return the matrix of 10,000 runs of coefficients

  • Use the function and save the results to a matrix called test

  • If nothing is working (), you can use the example code in q6 – function this.txt

advanced functions – Andrew Henrey

  • A more complicated example:

    MSE = function(X=c(0,3,11),Y)


    return (mean((X-Y)^2))


  • Observe that X has default values



Advanced Functions – Andrew Henrey

  • If an input argument to a function has default values, you don’t have to specify them when calling the function

  • If an input argument has no default values, running the function without specifying them gives you an error

Advanced functions – Andrew Henrey

  • Exercises:

  • Modify simulate.lm() by adding input parameters. Include:

  • nruns, the number of runs in the simulation, with no default

  • seed, the random initial seed of the simulation, defaulting to 1337

  • verbose, a Boolean true/false to report progress, defaulting to FALSE (caps matter)

  • Set runs to nruns at the beginning of your function

  • Use set.seed(seed) in your function

  • Add code that prints out how far along you are in the loop, but only when verbose is true

  • Run this new function to overwrite the old one

Using Functions – Andrew Henrey

  • Exercises:

  • A) Run your simulate.lm() with 25000 runs. Store these results as a variable called betas

  • B) Use hist() on the first column of betas to see the sampling distribution of the intercept

  • C) Use summary(), mean() , and sd() on this column as well

  • D) Use par(mfrow=c(2,2)) and then some hist() commands to display histograms of the other four sampling distributions in a 2x2 grid

  • E) Compare your results to the known values (The means of the sampling distributions to the true values, and the standard deviations to the estimated values in Q4)

Simulation Study – Andrew Henrey

  • Idea: You have a binomial experiment with 9 successes and 3 failures.

  • You would like to construct a 95% CI for the true proportion of successes

  • You DON’T know whether the normal approximation is appropriate

  • How can we find out whether or not it’s OK?

Simulation Study – Andrew Henrey

  • Overall procedure:

  • Construct a LOT of samples from a population with 12 trials and p=0.75

  • For each sample, calculate the 95% CI using the normal approximation

  • For each sample, see whether the CI overlaps with 0.75

  • Count the number of samples for which the CI overlaps with 0.75

  • The proportion of the samples that have a CI that overlaps is called the “true coverage probability”

  • If the true coverage probability is close to 95% , the normal approximation to the sampling distribution of p is a good one.

Simulation Study – Andrew Henrey

  • Steps:

  • Generate 100,000 samples of binomial(12,0.75) data using rbinom()

  • For each sample, calculate the usual estimate of p

  • For each sample, calculate SE = sqrt(p*(1-p)/12)

  • For each sample, calculate the lower and upper bounds of the 95% CI

  • Find out how many intervals actually contain 0.75

  • Optional: Look at hist(x) to gain intuition of why the normal approximation isn’t perfect


  • Leave plztyvm

  • Login