Last lecture summary

Last lecture summary

Confidnce interval • confidence interval of the mean • Confidence level 95%, what does it mean? • How is the interval constructed? • what is ? • Z-value on the standard normal distribution tn-1-value on t-distribution

Hypothesis testing • We test a claim about what? • population parameter • Two hypothesis? • Which hypothesis we test? What does it mean statistically significant? • So, how would you proceed in testing the following claim: “Jarda sleeps on average 8 hours 45 minuts a day”?

Reject Ho Reject Ho Accept Ho Accept Ho

Set up the null and alternative hypotheses: • Ho, says that the population parameter is equal to some claimed number. • Three possible alternative hypotheses exist; choose the one that's most relevant. • Ha: The population parameter is not equal (≠) to the claimed number. • Ha: The population parameter is less than (<) the claimed number. • Ha: The population parameter is greater than (>) the claimed number. • Take a random sample, calculate the sample statistic. • Convert the sample statistic to a test statistic: • Take your sample statistic minus the number in the null hypothesis and divide by the standard error of your statistic. This changes the distance to standard units. • Find the p-value for your test statistic. • If Ha contains a less-than alternative, find the percentile in appropriate table (Z, t) that corresponds to your test statistic. • If Ha contains a greater-than alternative, find the percentile in appropriate table (Z, t) that corresponds to your test statistic, and then take 100% minus that percentile. • Double this percentage if (and only if) Ha is the not-equal-to alternative. • Change the percentage to a probability by dividing by 100, this is your p-value. • Examine your p-value and make your decision. • Smaller p-values show more evidence against Ho. Reject the claim. • Larger p-values show more evidence for Ho. Your sample supports the claim. • Cutoff point (α level) between accpet/reject Ho, usually 0.05.

Errors in testing • Type-1 • You reject Ho, while you shouldn’t. • False positive • Type-2 • You do not reject Ho, while you should. • False negative

The chance making a Type I error is α. • The chance making a Type II error depends mainly on the sample size. • If you have more data, you’re less likely to miss something that’s going on. • However, large sample increases the chance of Type I error. • Type I and Type II errors sit on opposite ends of a seesaw - as one goes up, the other goes down. • To try to meet in the middle, choose a large sample size and a small α level (0.05 or less) for your hypothesis test.

New Stuff

Correlation and linear regression

Find a cricket, count the number of its chirps in 15 seconds, add 37, you have just approximated the outside temperature in degrees Fahrenheit. • National Service Weather Forecast Office: http://www.srh.noaa.gov/epz/?n=wxcalc_cricketconvert

Each observation is composed of two variables – bivariate data. • First step in making sense of data is to draw them – scatterplot. • this is positive linear • relationship (proportional) • opposite is negative linear • relationship (inverse) • We say that data are • correlated. from Statistics for Dummies

Looking at the scatterplot leads to the conclusion that there appear to be a linear positive relationship (as the chirps increase, temperature increases). • These two variables are associated. • But you must be very careful if you want to interpret the association as the causality. • Look at the following example.

There is a strong relationship between the ice cream consumption and the crime rate. • How could this be true? • The two variables must have something in common with one another. It must be something that relates to both level of ice cream consumption and level of crime rate. Can you guess what that is? • Outside temperature. from causeweb.org

If you stop selling ice cream, does the crime rate drop? What do you think? • That’s because of the simple principle that correlations express the association that exists between two or more variables; they have nothing to do with causality. • In other words, just because level of ice cream consumption and crime rate increase/descrease together does not mean that a change in one necessarily results in a change in the other. • You can’t interpret associations as being causal.

In ice cream example, there exist a variable (outside temperature) we did not realize to control. • Such variable is called third variable, confounding variable, lurking variable. • The methodologies of scientific studies therefore need to control for these factors to avoid a type I error ('false positive‘) conclusion that the dependent variables are in a causal relationship with the independent variable. • Let’s have a look at dependence of murder rate on temperature.

from http://www-personal.umich.edu/~bbushman/BWA05a.pdf Journal of Personality and Social Psychology, 2005, Vol. 89, No. 1, 62–66

Quantitative measure of correlation is a correlation coefficient. It reflects the amount of variability that is shared between two variables and what they have in common. • It ranges between a value of –1 and a value of +1. • A correlation between two continuous variables is described by the Pearson correlation coefficient rxy

covariance divide by n-1 for sample but by n for population

Covariance • Watch explanation video. http://www.youtube.com/watch?v=35NWFr53cgA • Covariance and correlation are related parameters that indicate the extent to which two random variables co-vary. • Covariance matrix is a matrix whose element in the i, j position is the covariance between the ith and jth elements of a random vector.

Covariance matrix • Standardize a data matrix A. • What does this mean? • Subtract mean, divide by standard deviation. You get matrix B. • Covariance matrix is then • What are the properties of covariance matrix? • square, symmetric, positive definite, real eigenvalues and eigenvectors random vector with 2 elements 3 observations of a random vector

Back to the correlation coefficient • The absolute value of the coefficient reflects the strength of the correlation. So, a correlation of –0.70 is stronger than a correlation 0.50. • One of the frequently made mistakes regarding correlation coefficients occurs when people assume that a direct or positive correlation is always stronger (i.e., “better”) than an indirect or negative correlation because of the sign and nothing else.

However, there is a more precise way to interpret the correlation coefficient: calculate the coefficient of determination. • .8 to 1.0 Very strong relationship • .6 to .8 Strong relationship • .4 to .6 Moderate relationship • .2 to .4 Weak relationship • .0 to .2 Weak or no relationship • Correlation matrix

Coefficient of determination • Percentage of variance in one variable that is accounted for by the variance in the other variable. • It is calculated simply as r2. r2 = 0 r2 = 0.25 r2 = 0.81 from http://www.sagepub.com/upm-data/11894_Chapter_5.pdf

Correlation analysis investigates the relationships between variables using graphs or correlation coefficients. • Regression analysis answers the questions like: which relationship exists between variables X and Y (linear, quadratic ,….), is it possible to predict Y using X, and with what error?

Simple linear regression • also single linear regression (jednoduchá lineární regrese) • one y(dependent variable, závisle proměnná), one x (independent variable, nezávisle proměnná) • y= a + bx • a – y-intercept (constant), b – slope • y is estimated value, so to distinguish it from the actual value y corresponding to the given x statisticans use y ^ ^ ^

Data set • Students in higher grades carry more textbooks. • Weight of the textbooks depends on the weight of the student.

strong positive correlation, r = 0.926 outlier from Intermediate Statistics for Dummies

Build a model • Find a straight line y = a + bx from Intermediate Statistics for Dummies

y-intercept (3.69 in our case) • it may or may not have a practical meaning • Does it fall within actual values in the data set? If yes, it is a clue it may have a practical meaning. • Does it fall within negative territory where negative y-value are not possible? (e.g. weights can’t be negative) • Does a value x = 0 have practical meaning (student weighting 0)? • However, even if it has no meaning, it may be necessary (i.e. significantly different from zero)! • slope • change in y due to one-unit increase in x (i.e. if student’s weight increases by 1 pound, its textbook’s weight increases by 0.113 pounds) • now you can use regression line to estimate y value for new x

Last lecture summary