Ctrc core curriculum seminar series
This presentation is the property of its rightful owner.
Sponsored Links
1 / 66

CTRC Core Curriculum Seminar Series PowerPoint PPT Presentation


  • 46 Views
  • Uploaded on
  • Presentation posted in: General

CTRC Core Curriculum Seminar Series. Descriptive Statistics: Data Types and Measures, Central Tendency, Variability Chang-Xing Ma, PhD Associate Professor Department of Biostatistics, UB January 4, 2012. Disclosure Statement. Chang-Xing Ma, PhD Nothing to disclose. Goals and Objectives.

Download Presentation

CTRC Core Curriculum Seminar Series

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Ctrc core curriculum seminar series

CTRC Core Curriculum Seminar Series

Descriptive Statistics:

Data Types and Measures, Central Tendency, Variability

Chang-Xing Ma, PhD

Associate Professor

Department of Biostatistics, UB

January 4, 2012


Disclosure statement

Disclosure Statement

  • Chang-Xing Ma, PhD

    • Nothing to disclose


Goals and objectives

Goals and Objectives

  • Goals: Gain the knowledge of basic statistics and how to describe the data

  • Objectives:

    • Describe the data type

    • Summarize data

    • Understand Measure of Central Tendency

    • Understand Measure of Dispersion


Outline

Outline

  • Basic concepts of biostatistics

  • Data type

  • Summarize data

  • Measure of Central Tendency

  • Measure of Dispersion


Some terminology

Some terminology

  • Statistics is the study of how to collect, organize, analyze, and interpret numerical information from data

  • Biostatistics—the theory and techniques for collecting, describing, analyzing, and interpreting health data.


Some terminology1

Some terminology

  • Population refer to all measurements or observations of interest

  • Sample is simply a part of the population. But the sample MUST represent the population.

    • A random sample is such a representative sample

      • The sample must be large enough

      • The sample should be selected randomly


Some terminology2

Some terminology

  • Parameter is some numerical or nominal characteristic of a population

    • A parameter is constant, e.g. mean of a population

    • Usually unknown

  • Statisticis some numerical or nominal characteristic of a sample.

    • We use statistic as an estimate of a parameter of the population

    • It tends to differ from one sample to another

    • We also use statistic to test hypothesis


Ctrc core curriculum seminar series

Parameters

(µw,σw2),

Population: all U.S. persons ~ Normal (µh,σh2),

A random sample: sample size =

Gender Height Weight

statistics

A sample

mean height:

std height:

mean weight

std weight

% of male (=1)


Ctrc core curriculum seminar series

Sources of data

Records

Surveys

Experiments

Comprehensive

Sample


Ctrc core curriculum seminar series

Types of variables

Quantitative variables

Qualitative variables

Quantitative

continuous

Qualitative

nominal

Quantitative

discrete

Qualitative

ordinal


Data types

Data Types

  • Numerical (Quantitative)

    • numerical measurement

      • Height

      • Weight

  • Categorical (Qualitative)

    • with no natural sense of ordering

      • Gender

      • Hair color

      • Blood type


Numerical variable

Numerical Variable

  • Continuous

    • Range of values

      • Height in inch

  • Discrete

    • Limited possible values

      • # of smoking per day

      • # of children in a family

  • Age -


Ctrc core curriculum seminar series

Determining Data Types

  • • Ordinal (Categorical) vs. Discrete (Numerical)

  • • Ordinal

    • – Cancer Stage I, II, III, IV

    • – Stage II ≠ 2 times Stage I

    • – Categories could also be A, B, C, D

  • • Discrete

    • – # of children: 0, 1, 2, …

    • – 4 children = 2 times 2 children


Descriptive statistics reducing a complex mass of data to a manageable set of information

Descriptive Statistics – reducing a complex mass of data to a manageable set of information

  • Descriptive Statistics: the summary and presentation of data to:

    • simplify the data

    • enable meaning full interpretation

    • support decision making

  • Numerical descriptive measures (few numbers)

  • Graphical presentations


Inferential statistics

Inferential statistics

From a sample

  • to estimate population parameters

  • to test hypothesis

  • to build the model to reflect the population


The student test score fcat

The student test score (FCAT)

  • Problem 1

  • Among the 6 variables, which ones are qualitative and which ones are quantitative?

  • Is Race nominal or ordinal?

Code:

Race:

W – White

B – Black

H – Hispanic

A – Asian

Sex:

F – Female

M – Male

Poverty:

0 – not poor

1 – poor

Student ID Race Sex Reading Math Poverty


Descriptive statistics

Descriptive Statistics

  • Categorical variables:

    • Frequency distribution

    • Bar chart, pie chart

    • Contingency tables

  • Continuous variables:

    • Grouped frequency table

    • Central Tendency

    • Variability


Simple frequency distribution

An ordered arrangement that shows the frequency of each level of a variable.

Simple Frequency Distribution

race Frequency Percent

-----------------------------

A 7 4.07

B 42 24.42

H 8 4.65

W 115 66.86

sex Frequency Percent

----------------------------

F 86 50.00

M 86 50.00


Simple frequency distribution1

It is useful for categorical variable

For continuous variable,

it allows you to pick up at a glance some valuable information, such as highest, lowest value.

ascertain the general shape or form of the distribution

make an informed guess about central tendency values

Simple Frequency Distribution


Bar chart

Bar Chart

BY

  • summarizing a set of categorical data - nominal or ordinal data

  • It displays the data using a number of rectangles, each of which represents a particular category. The length of each rectangle is proportional to the number of cases in the category it represents

  • can be displayed horizontally or vertically

  • they are usually drawn with a gap between the bars

  • Bars for multiple (usually two) variables can be drawn together to see the relationship


Pie chart

Pie Chart

  • summarizing a set of categorical data - nominal or ordinal data

  • It is a circle which is divided into segments.

  • Each segment represents a particular category.

  • The area of each segment is proportional to the number of cases in that category.


Ctrc core curriculum seminar series

Complex frequency distribution Table

Distribution of 20 lung cancer patients at the chest department of Alexandria hospital and 40 controls in May 2008 according to smoking


How about continuous variables

How about continuous variables?

  • How data is distributed?

  • Measure of Central Tendency

  • Measure of Variability


Grouped frequency distribution for continuous variable

Grouped Frequency Distribution – for continuous variable

Frequency Table

DATA:

Interval Size:

N:

µ:

σ:


Grouped frequency distribution

BUT the problem is that so much information is presented that it is difficult to discern what the data is really like, or to "cognitively digest" the data.

the simple frequency distribution usually need to condense even more.

It is possible to lose information (precision) about the data to gain understanding about distributions.

This is the function of grouping data into equal-sized intervals called class intervals.

The grouped frequency distribution is further presented as Frequency Polygons, Histograms, Bar Charts, Pie Charts.

Grouped Frequency Distribution


Describing distributions

Describing Distributions

  • Bell-Shaped Distribution

    • Normal distribution N (µ=0, σ2 =1)

    • t-distribution

µ, σ2


Describing distributions1

Describing Distributions

  • Skewed Distribution – positively skewed distribution

µ, σ2


Describing distributions2

Describing Distributions

  • Skewed Distribution – negatively skewed distribution

µ, σ2


Describing distributions3

Describing Distributions

  • Other Shapes Rectangular Bimodal

µ, σ2


Describing distributions4

Describing Distributions

  • Other Shapes J-curve

µ, σ2


Ctrc core curriculum seminar series

Probability density function - Normal

z-transform

green curve is

standard normal

distribution


Measure of central tendency mean median mode

The Mean

average value

not robust to outlying value

Length of hospital stays:6, 4, 5, 9, 10, 7, 1, 4, 3, 4

Mean=(6+4+5+9+10+7+1+4+3+4)/10=5.3

Measure of Central TendencyMean, Median, Mode


Measure of central tendency mean median mode1

The Median

is the point that divides a distribution of data into two equal parts

robust to outlying value

Length of hospital stays: sort data1 3 4 4 4 5 6 7 9 10

median=4.5

Measure of Central TendencyMean, Median, Mode

Split Data


Measure of central tendency mean median mode2

The Mode

is the midpoint of the interval that has highest frequency

robust to outlying value, but sometimes misleading

Length of hospital stays: sort data1 3 4 4 4 5 6 7 9 10

Mode=4, which occurred 3 times.

Measure of Central TendencyMean, Median, Mode

Most frequently


Comparison between mean and median

Comparison between mean and median

Mean

Median


Comparison between mean and median1

Comparison between mean and median

Median

Mean


Comparison between mean and median2

Comparison between mean and median

Mean

Median


Summary

Frequency distribution

Histogram, Polygon graph

Bar Chart, Pie Chart

Describing Distributions

Mean, Median, Mode

Summary

DATASET: http://128.205.94.145/STA2008/FL_School0022.xls


Problem 2

Problem 2

  • In a study, we collected a medical measurements X for 4 patients

  • Data of X: 2, 3, 5, 6

  • Mean of X?

  • Median of X?

  • Mode of ?


Descriptive statistics variability

The sample range

Interquartile range

The sample standard deviation (SD), variance

Standard error of mean (SEM)

Descriptive StatisticsVariability


Measures of dispersion range

Range – the difference between the lowest and highestFor example, Age of Patients (years): 6 13 7 14 10 14 15 9 7 2 7 13 16 9 8 3 3 17 8 5 4 9 9 6lowest 2, highest 17Range=2 -17 years

When sample size increases, the range tends to increase as well. (not robust)

Measures of Dispersion - Range


Measures of dispersion range1

Measures of Dispersion - Range

  • All of curves have the same range

  • Mean?

  • Median?


Measures of dispersion percentiles deciles quartiles

Percentiles: based on dividing a sample or population into 100 equal parts.

Deciles divide the distribution into 10 parts

Quartiles divide the distribution into 4 equal parts.

1st quartile includes the lowest 25% of the values (Q1)

2st quartile includes the values from 26 percentile through 50 percentile (Q2) - median

3st quartile includes the values from 51 percentile through 75 percentile (Q3)

Measures of DispersionPercentiles, Deciles, Quartiles


Measures of dispersion interquarile range

Interquarile Range – the 25 percentile (1st quartile) to 75 percentile (3rd quartile)

Age of Patients (years):2 3 3 4 5 66 7 7 7 8 8 9 9 9 9 10 13 13 14 14 15 16 17

1st quartile 6, 2nd quartile 8.5, 3rd 13

Interquarile Range = 6 -13 years

Interquarile Range is a robust estimate of data variability

Measures of DispersionInterquarile Range


Measures of dispersion interquarile range1

Measures of DispersionInterquarile Range

Robust estimate, less efficient


Deviations from the mean variance and standard deviation

Deviations from the meanVariance and Standard Deviation

  • deviation: observation - mean

  • “sum” of deviation

BUT


Deviations from the mean variance and standard deviation1

Deviations from the meanVariance and Standard Deviation

  • Measure of how different the values in a set of numbers are from each other

  • Variance:

  • Standard Deviation:


Deviations from the mean variance and standard deviation2

Deviations from the meanVariance and Standard Deviation

  • Data set: 2,3,5,6

    Calculation:

Value of X(X- ) (X- )2 2-24

3-11

511

624

∑=0 ∑=10

Variance

Standard Deviation


Three normal distributions mean 0 s 2 1 s 2 2 s 2 0 5

Three normal distributions: mean=0 s2=1 s2=2 s2=0.5

Leptokurtic

Homogenous

Narrow scatter

Mesokurtic

Platykurtic

Heterogeneous

wide scatter

Central Tendency mean=0


Example 2 fev1 litres of 57 male medical students

Example 2: FEV1 (litres) of 57 male medical students

Table: FEV1 (litres) of 57 male medical students

2.85 3.19 3.50 3.69 3.90 4.14 4.32 4.50 4.80 5.20

2.85 3.20 3.54 3.70 3.96 4.16 4.44 4.56 4.80 5.30

2.98 3.30 3.54 3.70 4.05 4.20 4.47 4.68 4.90 5.43

3.04 3.39 3.57 3.75 4.08 4.20 4.47 4.70 5.00

3.10 3.42 3.60 3.78 4.10 4.30 4.47 4.71 5.10

3.10 3.48 3.60 3.83 4.14 4.30 4.50 4.78 5.10


Example 2 fev1 litres of 57 male medical students1

Example 2: FEV1 (litres) of 57 male medical students


The meaning of standard deviation

How the data are dispersed around mean

Mean ± 1 SD represent 68.3% of the population

Mean ± 2 SD represent 96% of the population

Mean ± 3 SD represent 99.7% of the population

The Meaning of Standard Deviation


The meaning of standard deviation1

The Meaning of Standard Deviation

34%

34%

1SD

1SD

2SD 48%

2SD 48%


Standard error of mean sem

Standard Error of Mean (SEM)

  • How confident can we be that the sample mean represents the population mean µ?

  • SEM=SD/

    • SEM must be much smaller than the SD

  • mean ± 1.96*SD cover 95% of the data

  • mean ± 1.96*SEM cover 95% of the population mean

  • SEM and SD are different!


Standard error of mean sem1

Standard Error of Mean (SEM)

  • Describing the scatter or spread of data, use SD

  • Estimate population parameters, use SEM

  • Epidemiologic study, SEM

  • Clinical or laboratory research, SD


Summarizing data calculator

Summarizing Data - Calculator

Put DATA below:

Interval Size:

N:

µ:

σ:

Ylim:


Box plot

Box-Plot

  • The box itself contains the middle 50% of the data. The upper edge (hinge) of the box indicates the 75th percentile of the data set, and the lower hinge indicates the 25th percentile. The range of the middle two quartiles is known as the inter-quartile range.

  • The line in the box indicates the median value of the data.

  • The + indicate mean value

  • The ends of the vertical lines or "whiskers" indicate the minimum and maximum data values, unless outliers are present in which case the whiskers extend to a maximum of 1.5 times the inter-quartile range.

  • The points outside the ends of the whiskers are outliers or suspected outliers.


Box plot example 2

FEV1 of 57 students

Box Plot – Example 2

Serum triglyceride measurements in cord blood from 282 babies


What you can get from a box plot

Graphically display a variable's location and spread at a glance. [Q1, Q2 (median), Q3, interquartile range]

Provide some indication of the data's symmetry and skewness.

Unlike many other methods of data display, boxplots show outliers.

By using a boxplot for each categorical variable side-by-side on the same graph, one quickly can compare data sets.

One drawback of boxplots is that they tend to emphasize the tails of a distribution, which are the least certain points in the data set. They also hide many of the details of the distribution. Displaying histogram in conjunction with the boxplot helps

What you can get from a box-plot?


Transformations

Transformations

triglyceride

LOG (triglyceride)


Summarizing data

Univariate – categorical variable

Frequency distributions

Bar Chart, Pie Chart

Summarizing data


Summarizing data1

Univariate – continuous variable

Grouped frequency distributions

Polygon or histogram

Mean, Median, Mode, Percentile, Q1, Q2, Q3, extreme values

Standard deviation, variance, range, interquartile range

Box-Plot

Normality test statistics

Summarizing data


Next lecture lecture 2

Bivariate – one is categorical and the other is continuous variable

t-test

ANOVA

Next lecture ( Lecture 2)


Lecture 3 categorical data analysis

Bivariate – both are categorical

Contingency tables

Chi-square test

Response is categorical, predictors could be both types.

Logistical regression

Lecture 3 – categorical data analysis


Lecture 4 continuous response

Correlation

Multiple linear regression

Lecture 4 – Continuous response


Ctrc core curriculum seminar series

  • Thanks.

  • Question?


  • Login