Statistics for the Physical Sciences STAT 229

Statistics for the Physical Sciences STAT 229 Chapter 1 Statistics: The Art and Science of Learning from Data

Homework 1 • Problems 1.1 to 1.36 (even numbered) • Complete the survey on page 22-23 Fall 2008 STAT 229

1.1 Overview • Statistics is the art and science of learning from data. It is a collection of methods for • Planning experiments (Design) • Obtaining data (data are collected observations, such as measurements and survey responses) • Organizing data • Summarizing data (Description) • Analyzing data • Interpreting results, and • Making decisions and predictions (Inference) • Statistics is a branch of Mathematics -> Fall 2008 STAT 229

Statistics is invented for studying Randomness- a lack of order, purpose, cause, or predictability (by Wiki)- without which the world will be of no interest. • Examples of random phenomena: • Phelps won 8 gold medals • A 6-sided die is flipped and landed a 4 • It’s going to rain tomorrow • Randomness, Fuzziness and Uncertainty • Randomness creates uncertainty. On the other hand, randomness can be used. When estimating the proportion of adults in USA who smoked, we can survey 1000 adults and use the survey responses as our data. How randomness is used? Why use it? Fall 2008 STAT 229

1.2 We Learn about Population Using Samples • In the previous example, all US adults form a population while the 1000 surveyed adults form a sample. • In general, a population is the complete collection of all items to be studied. These items can be human subjects, animals, machines, even scores. • A sample is a sub-collection of items selected from a population. Fall 2008 STAT 229

More about Samples • A sample should represent the underlying population. Therefore, sample data must be collected in an appropriate way, such as through a process of random selection. • How large should a sample be? • What are those appropriate ways to generate a sample? • Methods for summarizing sample data are referred to as descriptive statistics, while methods for making decisions or predictions about a population based on sample data are called inferential statistics. Fall 2008 STAT 229

Parameter and Statistic • A parameter is a numeric summary of the population • A statistic is a numeric summary of a sample taken from the population Fall 2008 STAT 229

Problem: Number of Good Friends One year the General Social Survey asked, “About how many good friends do you have?” Of the 819 people who responded, 6% reported having only one good friend. Identify (a) the sample (b) the population, and (c) the parameter or statistic

Try Problem 1.3 on page 8 of the textbook. Go to the General Social Survey website http://sda.berkeley.edu/GSS By entering HEAVEN as the “row variable” name, find the percentages of people who said “yes, definitely,” “yes, probably,” “no, probably not,” and “no, definitely not” when asked whether they believed in heaven.

1.3 What Role Do Computers Play in Statistics? • Save (large) data files • Create databases • Do analysis with software: SAS, Minitab, Spss, R, Splus, C, Matlab, Excel, ... • Simulation – use of computers to mimic reality. Fall 2008 STAT 229

Simulation of Coin Tossing inMicrosoftExcel NOTES: 1. Pseudo-random numbers are numbers generated by a computer algorithm to simulate real random numbers. 2. Excel has an Analysis ToolPak by which one can do statistical analysis, including simulation. Fall 2008 STAT 229

Tasks: When a balanced coin is tossed 20 times, we have a sequence of 20 Heads or Tails. Let 1 denote Heads and 0 denote Tails. Then a sample is a sequence of 1 or 0. The empirical probability or sample proportion of tossing Heads(1) is computed as the number of 1’s divided by the total number of tosses. The coin-tossing process can be simulated using Bernoulli distribution with proportion p = 0.5. 1. Simulate 5 random samples, each consisting of 10 pseudo-random numbers from a Bernoulli(0.5) distribution. Repeat the process using 1000 pseudo-random numbers. 2. Compute the sample proportion for each of the 10 samples.

Simulation Follow this: Tools Data Analysis  Random Number Generation  Bernoulli More questions: • Where does randomness play a role? • Is the amount of variability from sample to sample of size 10 bigger than the amount of variability from sample to sample of size 1000? • Comment on the effect of sample size.

If You Are Using Excel 2007… • Excel 2007 no longer have tools menu. • To use Analysis ToolPak, go to office button at the upper left corner, click Excel options, then click Add-ins and highlight Analysis ToolPak. Clicking go button to open the Add-ins window. Check the box Analysis ToolPak and click OK. • Now go to Data menu, click Data Analysis and choose Random Number Generation. Fall 2008 STAT 229

Statistics for the Physical Sciences STAT 229 Chapter 2 Exploring Data with Graphs and Numerical Summaries

Homework #2 • 2-1 (p29): Problems 2.2, 2.4, 2.6, 2.8 • 2-2 (p44): Problems 2.10, 2.12, 2.14, 2.16, 2.22 • 2-3 (p55): Problems 2.30, 2.32, 2.34, 2.36, 2.42, 2.44 • 2-4 (p64): Problems 2.48, 2.52, 2.56, 2.58, 2.60 • 2-5 (p73): Problems 2.64, 2.66, 2.68, 2.72, 2.74, 2.78, 2.80, 2.82 • 2-6 (p80): Problems 2.84

2.1 What Are the Types of Data? • A characteristic observed for the subjects in a study is called a variable. • Examples of variable: major, GPA, religious affiliation, smoking status,... • Variables can be quantitative (numerical) or qualitative (categorical). • A variable is quantitative if its numerical values represent different magnitudes of the variable, such as weight, GPA. A variable is categorical, if its value represents a category, such as major, letter garde.

Quantitative variables can be discrete or continuous. • A discrete variable is usually a count such as the number of car accident last year, while a continuous variable is a measurement, such as distance. • The reason we care whether a variable is quantitative, categorical, discrete, or continuous is that the method used to analyze a data set depends on the type of variable the data represent.

Key Features of a Variable • A quantitative variable usually takes different values in a study. Studying the spread (variability) of such a variable is one of the most important tasks in statistics. Another feature of a quantitative variable is the center of all its possible values. • For a categorical variable, a key feature to describe is the relative number of items (percentage) in the various categories.

Frequency Tables • For a categorical variable, counting how often each possible value is taken by the variable is a critical first step in descriptive statistics. The results are summarized in a frequency table. • The following table shows the frequency of shark attacks in various regions for 1990-2006.

Frequency of shark attacks in various regions for 1990-2006 Questions: What is the variable? Is it categorical? The mode of categorical data is the category with the highest frequency. Find the mode of the data.

Frequency Tables (cont’d) • In the table above, the proportions and percentages are also called relative frequencies. A table like this is called a frequency table. • A frequency table is a listing of possible values for a variable, together with the number of observations for each value. • For a quantitative variable, A frequency table is constructed by first categorizing the data into a set of adjacent intervals, then finding the frequencies for each interval.

Frequency Table for Daily TV Watching

Example Construct a frequency table for quiz scores for twenty students: 5, 7, 8, 3, 7, 7, 1, 9, 6, 8, 5, 6, 7, 10, 7, 9, 6, 8, 6, 6 Score Frequency Proportion Percentage 1 1 7 8 3 0.05 0.05 0.35 0.40 0.15 [0,2] (2,4] (4,6] (6,8] (8,10] 5 5 35 40 15 Total 20 1.00 100

2.2 How Can We Describe Data Using Graphical Summaries? Preliminary results of the election for the European Parliament in 2004

Pie Charts and Bar Graphsfor Categorical Variables • Pie chart: A circle having a “slice of a pie” for each category. The size of slice corresponds to the percentage of observations in the category. • Bar graph: Displays a vertical bar for each category. The height of the bar is the percentage of observations in the category.

Example: Use the shark attack data from this source link to construct a pie chart of interest.

Bar Graph for European Parliament in 2004

Pareto Chart: Bar Graph with categories Ordered by Their Frequency from the Tallest Bar to Shortest

Graphs for Quantitative Variables • Dot plots: Shows a dot for each observation, placed just above the value on the number line for that observation. • Stem-and-Leaf Plots: similar to dot plot. Each observation is represented by a stem and a leaf. • Histogram: a graph uses bars to portray the frequencies or relative refrequencies.

Graphs for Quantitative Variables Example Dot plot Quiz scores for twenty students: 5, 7, 8, 3, 7, 7, 1, 9, 6, 8, 5, 6, 7, 10, 7, 9, 6, 8, 6, 6 1 2 3 4 5 6 7 8 9 10

Graphs for Quantitative Variables Example Stem-and-Leaf Plot Test scores for 12 students: 80, 45, 100, 76, 84, 87, 96, 62, 75,74, 87, 76 Step 1: Sorted test scores: 45, 62, 74, 75, 76, 76, 80, 84, 87, 87, 96, 100 Step 2: Place the scores in the corresponding stems and leaves. (usually the last digit will be the leaf) Stem Leaves 4 5 6 7 8 9 10 5 2 4 5 6 6 0 4 7 7 6 0

Graphs for Quantitative Variables Histogram Step 1: Divide the range of data into intervals of equal width. Step 2: Count the frequency and construct a frequency table (or relative frequency table). Step 3: Label the endpoints of the intervals on x-axis. Draw a bar over each interval with height equal to its frequency (or relative frequency), values of which are marked on the y-axis.

Graphs for Quantitative Variables Example Histogram Quiz scores for twenty students: 5, 7, 8, 3, 7, 7, 1, 9, 6, 8, 5, 6, 7, 10, 7, 9, 6, 8, 6, 6 Score Freq [0,2) 1 [2,4) 1 [4,6) 7 [6,8) 8 [8,10) 3 [10,12) 0

The Shape of a Distribution • When looking at a graph of quantitative data (dot plot, stem-and-leaf plot, and histogram), look for • the overall pattern: Do the data cluster together? • the outliers • modes: unimodal, bimodal,… • skew: skewed to the left or right • the underlying smooth curve

Unimodal Bimodal Multimodal

outliers

These Two Histograms Show Differences in Spread

Time plots • Time series: a data set collect over time. • Time plot: a graph displaying time-series data. • Look for pattern over time.

Time plots: Example Gasoline price

2.3 How can we describe the center of quantitative data? • Measures of center: mean and median • Mean: the sum of the observations divided by the number of observations. • Median: The midpoint of the observations.

Mean Formula

Example Travel times to work • How long does it take to get from home to work? • Here are the travel times in minutes in North • Carolina, chosen at random by Census Bureau: • 20 10 40 25 20 10 60 15 40 5 30 12 10 10 • Find the mean travel time.

How to Determine the Median Step 1: Sort your data from the smallest to the largest. Step 2: If n, the number of data points is odd, the median is the middle value; if n is even, the median is the average of the middle two values.

Example Find median for the travel times 30 20 10 40 25 20 10 60 15 40 5 30 12 10 10 Arrange the data in order: 5 10 10 10 10 12 15 20 20 25 30 30 40 40 60 Since n = 15 is odd, Median = 20, the middle value. Example Find the median for the scores 60 80 87 73 95 92 Arrange the data in order: 60 73 80 87 92 95 Since n = 6 is even, Median = (80 + 87)/2 = 83.5, the average of the two middle values.

Properties of the mean and the median • The mean is the balance point of the data. • In a symmetric distribution, the mean and median are the same. • In a skewed distribution, the mean is usually farther out in the long tail than the median. • Skewed to the right, mean > median • Skewed to the left, mean < median • Mean is less resistant to outliers.

Mean, Median, and Mode • The mean is the balance point. • The median is the midpoint. • The mode is the value occurs most frequently.

Mean and Median: Applications • City data • St Cloud, MN • New Orleans, LA

Statistics for the Physical Sciences STAT 229

Statistics for the Physical Sciences STAT 229

Presentation Transcript

Statistics for the Social Sciences

Statistics for the Social Sciences

Statistics for the Social Sciences

Statistics for the Social Sciences

Statistics for the Social Sciences

Statistics for the Social Sciences

Statistics for the Social Sciences

Statistics for the Social Sciences

Statistics for the Social Sciences

Statistics for the Social Sciences

Statistics for the Social Sciences

Statistics for the Social Sciences

Statistics for the Social Sciences

Statistics for the Social Sciences

Statistics for the Social Sciences

Statistics for the Social Sciences

Statistics for the Social Sciences

Statistics for the Social Sciences

Statistics for the Social Sciences

Statistics for the Social Sciences

Statistics for the Social Sciences

Statistics for the Social Sciences