Introduction to Statistics: Collecting, Organizing, and Analyzing Data

INTRODUCTION

WHAT IS STATISTICS? • Statistics is a science of collecting data, organizing and describing it and drawing conclusions from it. That is, statistics is a way to get information from data.

WHAT IS STATISTICS? • A market analyst wants to know the effectiveness of a new diet. • A pharmaceutical Co. wants to know if a new drug is superior to already existing drugs, or possible side effects. • How fuel efficient a certain car model is? • Is there any relationship between your GPA and employment opportunities.

WHAT IS STATISTICS? • An ergonomic chair can be assembled using two different sets of operations. The operations manager would like to know whether the assembly time under the two methods differ. • What is the effect of market strategy on market share? • How to pick the stocks to invest in?

STEPS OF STATISTICAL PRACTICE • Model building: Set clearly defined goals for the investigation • Data collection: Make a plan of what data to collect and how to collect it • Data analysis: Apply appropriate statistical methods to extract information from the data • Data interpretation: Interpret the information and draw conclusions

DESCRIPTIVE STATISTICS AND INFERENTIAL STATISTICS • Descriptive statistics includes the collection, presentation and description of numerical data . • Inferential statistics includes making inference, decisions by the appropriate statistical methods by using the collected data.

BASIC DEFINITIONS • POPULATION: The collection of all items of interest in a particular study. • PARAMETER: A descriptive measure of the population • SAMPLE: A set of data drawn from the population • STATICTIC: A descriptive measure of a sample • VARIABLE: A characteristic of interest about each element of a population or sample.

EXAMPLE PopulationUnitSample Variable All students currently Student Any departmentGPA enrolled in schoolHours of works per week All books in library BookStatistics’ BooksReplacement cost Frequency of check out Repair needs All campus fast food RestaurantBurger King Number of employees restaurants Seating capacity Hiring/Not hiring

EXAMPLE • Thousands of customers have accounts at a large department store. An accountant claims that the average unpaid balance for these accountants is $75, a figure obtained by computing the average of the unpaid balances for 50 of the accountants. • What is the population in your experiment? • What is the sample? • What is the parameter of interest? • Is the figure of $75 a parameter or a statistics? The population is the collection of all unpaid account balances at the store. The sample is the unpaid balances for the 50 accounts. The parameter is the average of the population balances. Because the figure $75 is the average of the sample of 50 unpaid balances, it is a statistic.

DESCRIPTIVE STATISTICS • Descriptive statistics involves the arrangement, summary, and presentation of data, to enable meaningful interpretation, and to support decision making. • Descriptive statistics methods make use of • graphical techniques • numerical descriptive measures. • The methods presented apply to both • the entire population • the population sample

QUALITATIVE VS. QUANTITATIVE DATA • A qualitative variable is one in which the “true” or naturally occurring levels or categories taken by that variable are not described as numbers but rather by verbal groupings • Example: levels or categories of hair color (black, brown, blond) • Quantitative variables on the other hand are those in which the natural levels take on certain quantities (e.g. price, travel time) • That is, quantitative variables are measurable in some numerical unit (e.g. pesos, minutes, inches, etc.)

CONTINUOUS OR NON-CONTINUOUS VARIABLE • A continuous variable is one in which it can theoretically assume any value between the lowest and highest point on the scale on which it is being measured (e.g. speed, price, time, height) • Non-continuous variables, also known as discrete variables, • Variables that can only take on a finite number of values • All qualitative variables are discrete

SCALES OF MEASUREMENT Scales of measurement describe the relationships between the characteristics of the numbers or levels assigned to objects under study Four classificatory scales of measurement are nominal, ordinal, interval, and ratio

NOMINAL SCALED DATA A nominal scaled variable is a variable in which the levels observed for that variable are assigned unique values – values which provide classification but which do not provide any indication of order For example: we may assign the value zero to represent males and one to represent females; but we are not saying that females are better than males or vice versa. Nominal scaled variables are also termed as categorical variables Nominal data must be discrete All mathematical operations are meaningless (i.e. +, -, , and x)

ORDINAL SCALED DATA Ordinal scaled data are data in which the values assigned to levels observed for an object are (1) unique and (2) provide an indication of order An example of this is ranking of products in order of preference. The highest-ranked product is more preferred than the second-highest-ranked product, which in turn is more preferred than the third-ranked product, etc. While we may now place the objects of measure in some order, we cannot determine distances between the objects. For example, we might know that product A is preferred to product B; however, we do not know by how much product A is preferred to product B. For ordinal scales, again +, -, , and x are meaningless

INTERVAL SCALED DATA Interval scaled data are data in which the levels of an object under study are assigned values which are (1) unique, (2) provide an indication of order, and (3) have an equal distance between scale points. The usual example is temperature (centigrade or Fahrenheit). In either scale, 41 degrees is higher than 40 degrees. However, zero degrees is an arbitrary figure – it does not represent an absolute absence of heat We can add or subtract interval scale variables meaningfully but ratios are not meaningful (that is, 40 degrees is not exactly twice as hot as 20 degrees).

RATIO SCALED DATA Ratio scaled data are data in which the values assigned to levels of an object are (1) unique, (2) provide an indication of order, (3) have an equal distance between scale points, and (4) the zero point on the scale of measure used represents an absence of the object being observed. We can add, subtract, divide and multiply such variables meaningfully.

In an interval scale, you can take difference of two values. You may not be able to take ratios of two values. • Example: Temperature in Celsius. You can say that if temperature in Adana is 400 Celsius and that in Trabzon is 200 Celsius, then Adana is 200 Celsius hotter than Trabzon (taking difference). But you cannot say Adana is twice as hot as Trabzon (not allowed to take ratio). • In a ratio scale, you can take a ratio of two values. Example: 40 kg is twice as heavy as 20 kg (taking ratios). • Also, “0” on ratio scale means the absence of that physical quantity. “0” on interval scale doesn't mean the same. 0 kg means the absence of weight. 00 Celsius doesn't mean absence of heat.

OTHER DATA TYPES Dummy Variables from Quantitative Variables A quantitative variable can be transformed into a categorical variable, called a dummy variable by recoding the values. Consider the following example: the quantitative variable Age can be classified into five intervals. The values of the associated categorical variable, called dummy variables, are 1, 2,3,4,5:

Types of data - examples Interval-ratio data Nominal Age - income 55 75000 42 68000 . . . . PersonMarital status 1 married 2 single 3 single . . . . Weight gain +10 +5 . . Computer Brand 1 IBM 2 Dell 3 IBM . . . .

Types of data - examples Nominal data Interval-ratio data With nominal data, all we can do is, calculate the proportion of data that falls into each category. Age - income 55 75000 42 68000 . . . . Weight gain +10 +5 . . IBM Dell Compaq Other Total 25 11 8 6 50 50% 22% 16% 12%

Types of data – analysis • Knowing the type of data is necessary to properly select the technique to be used when analyzing data. • Type of analysis allowed for each type of data • Interval data – arithmetic calculations • Nominal data – counting the number of observation in each category • Ordinal data - computations based on an ordering process

SCALES OF MEASUREMENT Data Qualitative Quantitative Numerical Numerical Nonnumerical Nominal Ordinal Nominal Ordinal Interval Ratio

What types of variable are the following (nominal, ordinal, metric discrete or metric continuous)? • Number of teeth • Age • Age last birthday (in years) • Has patient visited their dentist in the last year • Social class • Pocket depth • Hardness of filling material • Color of filling material • Type of radiograph • Calcium: phosphorus ratio in teeth • Severity of gum disease

Cross-Sectional/Panel Data/Time-Series Data • Cross sectional data is collected at a certain point in time • Marketing survey (observe preferences by gender, age) • Test score in a statistics course • Starting salaries of an MBA program graduates • Panel Survey: conducted at several points in time (“waves”) using the same sample respondents over waves → panel data mostly from prospective (panel) surveys → also: from retrospective (“biographical”) survey • Time series data is collected over successive points in time (Observed only one data point) • Weekly closing price of gold • Amount of crude oil imported monthly

CIRCULAR (DIRECTIONAL) DATA • Directional or circular distributions are those that have no true zero and any designation of high or low values is arbitrary: • Compass direction • Hours of the day • Months of the year

FUNCTIONAL DATA • Functional data is made up of repeated measurements, taken as a function of something (e.g., time) • For example, a trajectory is an example of functional data - we have the position or velocity sampled at many time points

FUNCTIONAL DATA • It is a multivariate data with an ordering on the dimension. It uses Hilbert Space. • Provides information about curves, surfaces or anything else varying over a continuum. • It uses information on rates of change or derivatives of the curves, also slopes, curvatures or other characters. • e.g. handwriting data, motor control, electrical measurements such as EEG or EKG, spectral measurements, measure of climate an so on.

Example • One expects temperature to be primarily sinusoidal in character, and certainly periodic over the annual cycle. • There is much variation in level and some variation in phase. • A model of the form

• Unlike time series analyses, no assumptions of stationarity are made, and data are not sampled at equally spaced time points. • Unlike most longitudinal data, a large number of time points are available, and the signal-to-noise ratio is medium to high. • The data can support the accurate estimate of one or more derivatives, and these play several critical roles. • Phase variation is recognized and separated from amplitude variation. • Familiar multivariate methods have functional counterparts, and the smoothness of functional parameter estimates is explicitly controlled. • Differential equations are new modelling tools.

Introduction to Statistics: Collecting, Organizing, and Analyzing Data

Introduction to Statistics: Collecting, Organizing, and Analyzing Data

Presentation Transcript

Introduction to introduction to introduction to … Optimization

INTRODUCTION/ INTRODUCTION

Introduction

INTRODUCTION

Introduction

Introduction