Initial Data Analysis

DISTINCTIONS Initial Data Analysis

Some Distinctions • Population vs. Sample • Descriptive vs. Inferential stats • Variables • Types of data • Quantitative versus Categorical • Measurement scales

Population • The entire collection of events that you are interested in generalizing to. • For example, our population could be the students in this class, UNT students, all students in U.S., people in general. • Although we wish to make claims about the entire population, it is often too large to deal with, and so we will take a portion of it to study. • Random sampling

Random Sampling • Choose a subset (sample) of the population ensuring that each member of the population has an equivalent chance of being sampled. • Examine that sample and use your observations to draw inferences about the population. • Example : Voting polls, television ratings, rolling a die.

Random Sampling • Note, however, that the inferences drawn are only as good as the representativeness of the sample. • If the sample is not random, it may not be representative of the population. When a sample is not representative of its parent population, the external validity of any inference is called into question i.e. how well will we be able to generalize? • Example : Most psychology studies and freshman psych students.

When studying the effects of some treatment variable in an experimental fashion, it is also important to randomly assign subjects to treatments. Control1 vs. Experimental group Oftentimes we want to look at the effects of some treatment e.g. a drug, teaching strategy, memory technique etc. To study the effects of the treatment we’ll often give one or more groups the treatment and one group no treatment and then compare the groups Random assignment reduces the likelihood that groups differ in some critical way other than the treatment since everyone has an equal chance to be put in one of the treatment groups. Random Assignment

Random Assignment • If random assignment is not used then the internal validity of the experimental results may be compromised i.e. are our results due to the treatment we’ve imposed or something else? • Example: don’t randomly assign male/females to receive treatment -> effects seen due to gender rather than treatment

Assume we have a random sample of subjects that we have randomly assigned to treatment groups. Example: Stop-smoking study. Now we must select the variables we wish to study, with the term variable referring to a property of an object or event that can take on different values. Example: # of cigs smoked, abstinence after one week (yes or no). Note the distinction; # of cigarettes smoked is a continuous variable, whereas abstinence is a categorical variable. A variable is to be contrasted with a constant, that which only takes on one value.1 Variables

Measurement (quantitative, magnitude) Data Continuous vs. Discrete Example: GPA during college vs. GPA for class Example: 9 point “Likert” scale- continuous or discrete? 20 point? Categorical (frequency, nominal, qualitative) Data Named data e.g. different brands, political party, race, gender How you think about your data and what scale of measurement your variables are is very important. What you decide about the variable will have a say on the analyses available, and even possibly even have vast effect on the theory itself. Early developmental theories suggested clear cut stages which imply categories of development Types of Data

A Note about Categorical Variables Consider the grouping variable in which people are classified as In-patient, Out-patient, and Control groups. There is only one variable in a theoretical sense, and our goal is to determine the relationship between the grouping variable and the outcome, and an ANOVA table in this sense would speak to the overall effect group membership has on the outcome. Statistically speaking there are actually two coded variables if we are applying the general linear model See dummy coding, effects coding etc.1 What you want to keep in mind is that a single group has no relationship with the outcome, as membership in a single group is a constant.

Another distinction related to variables concerns those that we are interested in understanding and explaining (dependent, criterion, outcome variables) versus those we expect to have an effect (perhaps causal) on that outcome, and which we may be able to manipulate or have control over experimentally (independent, predictor variables). Which one is predicting and which is being predicted? Both predictor and dependent variables can be quantitative or categorical Example: Whether or not we give a subject the stop-smoking treatment would be the independent variable, and the # of cigarettes smoked would be a dependent variable. Other examples: age:income, shoe size:intelligence, gender:hostility, intelligence:voting outcome Variables

Parameters and Statistics • Parameters are simply values associated with the population and as such are often inferred rather than actually known • They are designated with Greek notation e.g.  for the population mean • “Statistics” in this sense speak specifically to the data set at hand (the sample) and make no reference to values outside the sample. • Using statistics we have collected we will then infer the population values (parameters) e.g. use the sample mean to infer  • Most commonly employed methods assume a fixed population parameter

Descriptive statistics are used to describe the data set itself without reference to the population from which it is derived. Examples: graphing, calculating, averages, looking for extreme scores. Exploratory/Initial data analysis (Tukey, Chatfield, others) typically relies on descriptive information most What Do We Do With The Data?

What Do We Do With The Data? • Inferential statistics allow you to infer something about the parameters of the population based on the statistics of the sample, and the various tests we perform on the sample. • Examples: Chi-Square, T-Tests, Correlations, ANOVA

Measurement Scales • Nominal • category labels assigned in some meaningful way (e.g. gender, political party) • Ordinal • orders or ranks objects on some continuum (e.g. military ranks) • Interval • Can speak of differences between scale points, arbitrary zero point (Fahrenheit scale- 30°-20°=20°-10°, but 20°/10° is not twice as hot!) • Probably most common in psych • Ratio • Same as interval but with true zero point (distance, weight, Kelvin- physical measurements). Ratios are interval scales too but not the other way around.

Scales • There is much debate with regard to scale distinction and how to deal with different data types. • Even some types of data seem to qualify as more than one type. • Although some analyses will result in the same outcome whatever you want to call your data, which analysis you perform may be affected by what you see the underlying construct to be, and so it is important that you give it some thought.

Decision tree

Decision Trees • While decision trees might be helpful, they are, at best, a suggestion, and should never be used as a definitive statement on what analysis to do • Example: political party (republican vs. democrats) and 3 ‘continuous’ attitude measures (gun control, abortion, Iraq) • Simple enough right? • Possible analyses • Purely descriptive assessment • Simple correlations • 3 t-tests • Classical, Non-parametric, or Robust? • Bootstrap, Mann-Whitney, Differences on M-estimators rather than means etc. • MANOVA • Differences on the linear combination of continuous • Discriminant Function Analysis or Logistic regression • Predicting party based on attitude (classification) • Factor analysis on continuous, t-test on factor scores

Analyses • The point is you will always have an option for how to both understand and describe, as well as analyze the data • What you must do is work this out BEFORE collecting the data • While you will still have the options on how to analyze the data in the end, those options should be known before anything is collected • To collect without an analysis in mind that will test your theoretical model is to waste one’s time, and suggests a lack of truly understanding just what your theory/hypothesis is

Initial Data Analysis