2.21k likes | 3.39k Views
Introduction to Applied Statistics. Xiaobo Sheng. Overview. CH 1 Introduction CH 2-3 Concepts, Descriptive Statistics of one variable CH 6-8 Probability, A few common probability distributions and models CH 9-13 Statistical Inference CH 15 Linear Regression.
E N D
Introduction to Applied Statistics Xiaobo Sheng
Overview • CH 1 Introduction • CH 2-3 Concepts, Descriptive Statistics of one variable • CH 6-8 Probability, A few common probability distributions and models • CH 9-13 Statistical Inference • CH 15 Linear Regression
Introduction • What is statistics? • A collection of numerical information • Or the branch of mathematics dealing with theory and techniques of collecting, organizing, and interpreting numerical information. (We will focus on the first definition)
Why we need Statistics? Pepsi vs Coca Horse Racing Casino Game
How do we deal with Statistics? • Input: Data Set (a collection of information) • Process: Data analysis(Making sense of a data set) • Output: Statistical Inference(Drawing conclusion about a population based on a sample from that population)
A few basic definitions need to know Population: the group or collection of interest to us. Usually it will be very huge and messy. Sample : a subset of population. reasonable small and capable of being analyzed using statistical tools. And we use the observations in the sample to learn about the population. Example : income of teachers. Average age, etc.
Descriptive statistic a number used to summarize information in a set of data values. varies by different problems. Variable : a particular piece of information Two types: quantitative variable : has numerical values that are measurements categorical variable : values can not be interpreted as numbers.
Mean : average = Median( 5o percentile) divides an ordered list of values in half. Quartiles divide an ordered list of values into 4 groups of equal or approximately equal size.
1st quartile (25th percentile) at least three-fourths are greater than or equal to the first quartile 3rdquartile (75th percentile) at least three-fourths are less than or equal to the first quartile Page 49
Range Difference between the largest and smallest values of a data set. Interquartile range Difference between the 3rd and 1st quartiles
Standard Deviation use it to measure variation of values about the mean σpopulation standard deviation ssample standard deviation P82
Lists, Tables, and Plots • Data list A listing of the values of a variable in a data set.
Table: Usually values in table are ordered or sorted by certain standard. If not, we can use Excel to finish this process.
Plots • Dot Plot
Distribution • A description of how the values of the variable are positioned along an axis or number line. Symmetric Skewed to the left(negatively skewed) there is a concentration of relatively values, with some scatter over a range of smaller values. Skewed to the right(positively skewed) there is a concentration of relatively values, with some scatter over a range of larger values.
Peak A major concentration of values.
Unimodal distribution has one major peak • Bimodal has two major peaks • Multimodal has several major peaks
CH4 • Scatterplot two-dimensional graphical display of two quantitative variables.
Transformation of a variable a mathematical manipulation of each value of the variable. logarithmic transformation(common one) square root transformation power transformation
Logarithmic transformation take the logarithm of each value of the variable.
Ch 15 Correlation, Regression • Study relationship between quantitative variables Linear Correlation Coefficient
Mathematical Notation (1) Another form (2)
Formal Definition Correlation Coefficient(Pearson’s correlation coefficient) A measure of linear association between two quantitative variables r has no unit, and takes value from -1 to 1.
A correlation coefficient near 0 suggests there is little or no linear association between those two variables
What exactly does the correlation coefficient measure? It measures the extent of clustering of plotted points about a straight line. A correlation coefficient that is large in absolute value suggests strong linear association between the two variables. A correlation coefficient that near zero suggests little linear association between the two variables.
Can correlation coefficient be misleading? • Yes. We should always plot two quantitative variables to get a visual feel for their relationship. Then we can use the correlation coefficient to supplement the plot.
r is 0.66. By itself, this correlation coefficient might suggest linear association between these two variables. But the figure itself suggests a curved relationship. A stronger linear relationship exists between life expectancy and the logarithm of per capita gross national product.(r = 0.84)
Outlier • An observation that is far from the other observations.
Definition of Linear Regression • Simple linear regression refers to fitting a straight line model by the method of least squares and then assessing the model. Application: • Find out relationship between two quantitative variables • Can be used to predict future.