Lecture 1 Sections 1.1 – 1.2

Lecture 1Sections 1.1 – 1.2 • Objectives: • Populations, Samples and Processes • Visual Display for Univariate Data • Numerical Variables • Stem-and-Leaf Displays • Dotplots • Histograms • Categorical Variables • Bar Chart • Pie Chart • Introduction to R

Branches of Statistics • Descriptive Statistics • Exploratory Data Analysis (EDA) • Chapters 1 – 3 • Used to summarize and describe important features in • the data, either graphically or numerically. • Inferential Statistics • Involves techniques for generalizing from a sample to a • population. • Chapters 7 – 8.

Population vs. Sample • Population: The entire group of individuals in which we are interested in but can’t usually assess directly. • E.g. all individuals who received a B.S in engineering in 2011. • Sample: The part of the population we actually examine and for which we do have data. • The sample is selected in some prescribed manner. Population Sample

Variables We are usually interested only in certain characteristics of the objects in a population, e.g. the age of an engineering graduate, the gender of a graduate. A characteristic may be categorical– e.g. gender - or it may be numerical– e.g. age. A variable is any characteristic whose value may change from one object to another in the population. The value varies from object to object. We will use lowercase letters to denote variable. Example: x = age of a graduating engineer; y = braking distance of an automobile under specified conditions.

Discrete or Continuous A variable is discrete if its set of possible values is either finite or can be listed in an infinite sequence. A variable is continuousif its possible values consist of an entire interval on the number line. e.g. x takes values 0,1,2,3,…. e.g. x is the pH of a chemical substance. x can take values like 7.0, 7.03, 7.032 etc

Univariate vs. Multivariate • A univariate data set is when observations are made on a single variable. • E.g. type of transmission • A bivariate data set is when observations are made on two variables • E.g. (height, weight) pair for each basketball player. • A multivariate data set is when observations are made on more than two (multiple) variables

Example 1.1 (pg. 4) The tragedy that befell the space shuttle Challenger and its astronauts in 1986 led to a number of studies to investigate the reasons for mission failure. Attention quickly focused on the behavior of the rocket engine’s O-rings. Here is data consisting of observations on x=O-ring Temperature (oF) for each test firing or actual launch of the shuttle rocket engine (Presidential Commission on the Space Shuttle Challenger Accident, Vol. 1, 1986: 129-131). 84 49 61 40 83 67 45 66 70 69 80 58 68 60 67 72 73 70 57 63 70 78 52 67 53 67 75 61 70 81 76 79 75 76 58 31 Without any organization, it is very difficult to get a sense of what a typical or representative temperature might be, whether the values are highly concentrated about a typical value or quite spread out, whether there are any gaps in the data, what percentage of the values are in the 60s, and so on.

How to examine a distribution? • The distribution of a variable tells us what values the variable takes and how often it takes these values. • Almost always plot data as preliminary analysis • 2. Look for the overall pattern • Shape • Location • Spread • 3. Look for the striking deviation from overall pattern • Outlier

Stem-and-Leaf Plot How to make a stemplot: Separate each observation into a stem, consisting of all but the final (rightmost) digit, and a leaf, which is that remaining final digit. Stems may have as many digits as needed, but each leaf contains only a single digit. Write the stems in a vertical column with the smallest value at the top, and draw a vertical line at the right of this column. Write each leaf in the row to the right of its stem, in increasing order out from the stem. Stem 3 4 5 6 7 8 Leaf 1 059 23788 01136777789

Histogram The range of values that a variable can take is divided into equal size intervals. The histogram shows the number of individual data points that fall in each interval. • Histogram Shapes • Unimodal, bimodal or multimodal • Symmetric, positively skewed or negatively skewed

Categorical Data Because the variable is categorical, the data in the graph can be ordered any way we want (alphabetical, by increasing value, by year, by personal preference, etc.) • Bar graphsEach category isrepresented by a bar. • A Pareto diagram is a bar chart from a quality control study • Pie chartsThe slices must represent the parts of one whole.

Some examples in R Example 1 Space Shuttle Challenger Accident Data 84 49 61 40 83 67 45 66 70 69 80 58 68 60 67 72 73 70 57 63 70 78 52 67 53 67 75 61 70 81 76 79 75 76 58 31 Stem-and-Leaf Graph: x=c(84,49,61,40,83,67,45,66,70,69,80,58,68,60,67,72,73,70,57,63,70,78,52,67,53,67,75,61,70,81,76,79,75,76,58,31) #temperature data length(x) #the sample size stem(x) #stem-and-leaf plot Histogram: hist(x,main="histogram of Temperature",xlab="Temperature") #histogram

Example 2 In the manufacture of printed circuit boards, finished boards are subjected to a final inspection before they are shipped to customers. Here is data on the type of defect for each board rejected at final inspection during a particular time period. Type of defect Frequency Low copper plating 112 Poor electroless coverage 35 Lamination problems 10 Plating separation 8 Etching problems 5 Miscellaneous 12 defect=c("Low copper plating","Poor electroless coverage","Lamination problems","Plating separation","Etching problems","Miscellaneaous") #type of defect frequency=c(112,35,10,8,5,12) #frequency Bar Graph: barplot(frequency,names.arg=defect) #barplot Pie Chart: pie(frequency,labels=defect) #pie chart

Lecture 1 Sections 1.1 – 1.2

Lecture 1 Sections 1.1 – 1.2

Presentation Transcript