Create Presentation
Download Presentation

Download Presentation
## Introduction to R for Absolute Beginners: Part I

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Introduction to Rfor Absolute Beginners: Part I**Melinda Fricke Department of Linguistics University of California, Berkeley melindafricke@berkeley.edu D-Lab Workshop Series, Spring 2013**Why this workshop?**"The questions that statistical analysis is designed to answer can often be stated simply. This may encourage the layperson to believe that the answers are similarly simple. Often, they are not… No-one should be embarrassed that they have difficulty with analyses that involve ideas that professional statisticians may take 7 or 8 years of professional training and experience to master.” (Maindonald and Braun, 2010. Data Analysis and Graphics Using R: An Example-Based Approach.)**What we will cover today**• Getting around in R • working directories, managing your workspace, creating and removing objects (types of variable assignment), inspecting objects, viewing functions, getting help • Types of data and basic manipulations • data types, object types, reading and writing data, modifying data and objects, basic functions**(What we will cover next time)**Downloading and installing external packages Common statistical tests correlation, simple linear regression, t-tests, ANOVA Graphing R makes really beautiful graphs, and is very flexible**What we will not cover (ever)**"The best any analysis can do is to highlight the information in the data. No amount of statistical or computing technology can be a substitute for good design of data collection, for understanding the context in which data are to be interpreted, or for skill in the use of statistical analysis methodology. Statistical software systems are one of several components of effective data analysis.” (Maindonald and Braun, 2010)**Why use R?**R is an incredibly flexible, high-level programming language that will allow you to conduct nearly any statistical analysis, and create any visualization you can think of. This video was created by Ben Schmidt using the ggplot2 package. http://sappingattention.blogspot.com/2012/10/data-narratives-and-structural.html**A simpler example…**(These are from my own work on the production of “s” sounds.)**Creating objects**Open R and type the following: x = 1 [enter] y <- 2 [enter] 3 -> z [enter] x [enter] y [enter] z [enter] There are 3 ways to assign variables in R.**Creating objects**Now try this: x + y + z x + y + z -> q q What’s the difference between the first line and the second?**Creating objects**These are vectors. A vector is just a bunch of values that have been concatenated together, in a sequence. When we type “q”, R tells us that the first element in the vector “q” is 6: [1] 6 (It’s also the only element, but that’s okay.)**Creating objects**We can create vectors that are longer than 1 element by concatenating multiple elements together: x = c(7, 8, 8, 7, 4, 1) x length(x)**A little bit about “looping”**R is a “high level” programming language. This means it takes care of a lot of things for us. x * 2 x + y Most programming languages require you to write loops, but R takes care of a lot of “looping” on its own.**Pop quiz!**What will this code produce? length(x + y)**A little bit about looping**What will this code produce? length(x + y) [1] 6**A little bit about looping**What will this code produce? length(x + y) [1] 6 The length of x is 6. (x + y) loops through the vector x, adding y to each number, yielding 6 new values (and therefore a vector of length 6). Remember: if you want to concatenate, use c(): length(c(x,y)) [1] 7**Pop quiz!**What will this code produce? length(x + y) [1] 6 length(length(x + y))**Pop quiz!**What will this code produce? length(x + y) [1] 6 length(length(x + y)) [1] 1**A little bit about looping**What will this code produce? x = c(7, 8, 8, 7, 4, 1) y = c(1, 2) x + y**A little bit about looping**What will this code produce? x = c(7, 8, 8, 7, 4, 1) y = c(1, 2) x + y [1] 8 10 9 9 5 3**A little bit about looping**What will this code produce? x = c(7, 8, 8, 7, 4, 1) y = c(1, 2) x + y [1] 8 10 9 9 5 3 “Loop through x and y simultaneously, adding 2 elements together.” For operations using 2 vectors of different lengths, the shorter one will be “recycled”. (Look at y + x.)**Data types**class(x) class(‘x’)**Data types**class(x) [1] “numeric” class(‘x’) [1] “character”**Data types**class(x) [1] “numeric” class(‘x’) [1] “character” There are different types of data in R. Putting quotes around something indicates you want R to treat it literally - as a character, not a variable. e.g. class(‘1’)**Data types**as.character(x) -> k k + 1**Data types**as.character(x) -> k k + 1 Your first error message! as.numeric(k) -> k k + 1**Data types**as.factor(k)**Data types**as.factor(k) A factor is R-speak for a categorical variable: a type of data that can have one of several fixed levels. By default, factor levels are ordered alphabetically. levels(k)**Data types**as.factor(k) A factor is R-speak for a categorical variable: a type of data that can have one of several fixed levels. By default, factor levels are ordered alphabetically. levels(k) Oops! Why doesn’t this work?**Data types**as.factor(k) -> k A factor is R-speak for a categorical variable: a type of data that can have one of several fixed levels. By default, factor levels are ordered alphabetically. levels(k) [1] “1” “4” “7” “8”**Data types**c(10, 11, 10, 8, 6, 12, 11, 8, 10, 6, 10, 11) -> p**Using what we’ve learned so far…**c(10, 11, 10, 8, 6, 12, 11, 8, 10, 6, 10, 11) -> p How many items are in the vector ‘p’? How many unique values are in ‘p’?**Using what we’ve learned so far…**c(10, 11, 10, 8, 6, 12, 11, 8, 10, 6, 10, 11) -> p How many items are in the vector ‘p’? length(p) [1] 12 How many unique values are in ‘p’?**Using what we’ve learned so far…**c(10, 11, 10, 8, 6, 12, 11, 8, 10, 6, 10, 11) -> p How many items are in the vector ‘p’? length(p) [1] 12 How many unique values are in ‘p’? length(levels(as.factor(p)))**Your first R weirdness**as.factor(p) -> p as.numeric(p) What happened??**Your first R weirdness**as.factor(p) -> p as.numeric(p) [1] 3 4 3 2 1 5 4 2 3 1 3 4 ( 10 11 10 8 6 12 11 8 10 6 10 11) When you try to change a factor directly into numeric mode, the factors are replaced by their “order”. How could we avoid this?**Your first R weirdness**as.numeric(as.character(p))**Your first R weirdness**as.numeric(as.character(p)) What if we want to change the order of the levels?**Your first R weirdness**as.numeric(as.character(p)) What if we want to change the order of the levels? factor(p, levels=c(‘6’, ‘8’, ‘10’, ‘12’, ‘11’)) -> p levels(p)**Ordered factors**Factors may make more sense if we give our categories names other than numbers. Try this: mycolors = c(‘blue’, ‘yellow’, ‘green’, ‘purple’, ‘red’) class(mycolors) factor(mycolors, levels=c(‘red’, ‘yellow’, ‘green’, ‘blue’, ‘purple’)) -> mycolors class(mycolors) levels(mycolors)**Taking stock**Objects and operations values e.g. ‘1’ vectors c(‘1’, ‘4’, ‘a’, ‘word’) functions as.factor(x) variable assignment =, ->, <- Data types (‘classes’) numeric 8 character ‘8’, ‘x’, ‘female’ factor ‘8’, ‘x’, ‘female’ the difference between strings of characters and factors is that factors have one of a set of fixed values e.g. ‘male’ vs. ‘female’**Some more useful functions**Type these commands in to see what they do: ls() table(p) unique(p) sort(p) mean(p) median(p) sd(p) edit(p) ls**Some more useful functions**Type these commands in to see what they do: ls() lists the objects currently in your workspace table(p) unique(p) sort(p) mean(p) median(p) also useful: rm() (remove) sd(p) edit(p) ls**Some more useful functions**Type these commands in to see what they do: ls() lists the objects currently in your workspace table(p) creates a table of counts unique(p) sort(p) mean(p) median(p) sd(p) edit(p) ls**Some more useful functions**Type these commands in to see what they do: ls() lists the objects currently in your workspace table(p) creates a table of counts unique(p) lists all existing unique values sort(p) mean(p) median(p) sd(p) edit(p) ls**Some more useful functions**Type these commands in to see what they do: ls() lists the objects currently in your workspace table(p) creates a table of counts unique(p) lists all existing unique values sort(p) sorts values from lowest to highest mean(p) median(p) sd(p) edit(p) ls**Some more useful functions**Type these commands in to see what they do: ls() lists the objects currently in your workspace table(p) creates a table of counts unique(p) lists all existing unique values sort(p) sorts values from lowest to highest mean(p) mean of the values median(p) median (middle) of the values sd(p) standard deviation edit(p) ls**Some more useful functions**Type these commands in to see what they do: ls() lists the objects currently in your workspace table(p) creates a table of counts unique(p) lists all existing unique values sort(p) sorts values from lowest to highest mean(p) mean of the values median(p) median (middle) of the values sd(p) standard deviation edit(p) lets you interact directly with the data! “edit(p) -> p” to save your changes ls**Some more useful functions**Type these commands in to see what they do: ls() lists the objects currently in your workspace table(p) creates a table of counts unique(p) lists all existing unique values sort(p) sorts values from lowest to highest mean(p) mean of the values median(p) median (middle) of the values sd(p) standard deviation edit(p) lets you interact directly with the data! ls displays the internal workings of the function**Getting help with functions**?sort search current packages for a function help(sort) (these two are equivalent) ??sort search all packages for a word