1 / 80

Introduction

Introduction. TRIBE statistics course Split, spring break 2016. Goal. ' Make them love statistics! ' 3 methods to apply Tests for differences (χ 2 and mean comparison) Ordinary least squared regression (OLS) Error analysis

schuck
Download Presentation

Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction TRIBE statistics course Split, spring break 2016

  2. Goal 'Make them love statistics!' 3 methods to apply • Tests for differences (χ2and mean comparison) • Ordinary least squared regression (OLS) • Error analysis Interest in and knowledge about statistics for biomedical research Practical application in EViews/Excel Preparation, analysis, interpretation, and presentation of data (basics)

  3. Setting Doctoral students in biomedicine Limited and diverse statistical background More understanding, less theory Practical application Support of individual work with data for the thesis First try as an experiment 1 week during the spring break Teaching by means of lectures with applications in Excel/EViews Classes according to schedule on short notice No test

  4. Agenda • Modules • Preparation Master standard descriptive statistics and deduce tender points thereof • Analysis Formulate your problem statement and understand 'significance' • Interpretation Discern correlation and causality and set up a meaningful OLS model • Visualization Use the power of depiction to make your point • Structure Lecture, practical work, discussion • Homework Optional

  5. Lecturer

  6. Audience About you: • Name • Research topic • Data use yes/no • Method for the analysis • 1 expectation for this course • …and more

  7. Our first data set: height/weight We need • Dimension x-axis, y-axis • Direction in which values increase • Units in order to measure differences First questions • Precision needed/possible? • Truthfulness/bias? • Properties of either dimension? • Relation? • Explanation?

  8. Statistics versus …metrics Statistics • 'Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data.' • More than mathematical methods for the treatment of data • Focus on analytical methods and properties …metrics • …metrics bridge the gap between statistical methods and practical interests within a specific area • Still, …metricians develop their own methods along with statisticians • Econometrics focus on time series and causality How do you call your field of …metrics?

  9. Our sample data set Plus alternatively your own data

  10. Data retrieval Access • Collection • Search • Validation (external) Backup of the original data set including date and source = raw data • Websites as .mht files including link and date • Rights and sensitivity • Import to Excel / EViews / …, revision => Your data set

  11. Fast run through all

  12. Module 1 Realization & descriptive statistics: moments, structure, correlation Estimated underlying distribution: probability and cumulative density Distribution tests: expected versus actual realizations

  13. Module 2 Hypothesis testing: potential and limitation of statistics plus significance Problem statement: formulating the desired result in a testable way Central limit theorem: the magic of normally distributed sample means

  14. Module 3 Correlation: causality from content, not statistics Linear regression: standard ordinary least squares (OLS) Error term: model change and transformations for ideal characteristics

  15. Module 4 From data to visualization Message specific to the audience Review

  16. Not covered this time Finite sample properties • Unbiasedness, consistency, efficiency, distributionSome rule of thumb minima, though Survival • Analog principle as in OLS regression and hypothesis testing Time series • Autocorrelation • (Conditional) heteroscedasticity • Regimes

  17. In between During class • Listen • Process data • Formulate questions • Replicate with your data • Surf the web to the links from the slides Outside class • Between lecture hours: have a break and/or discuss • Between classes: get an appointment • For the next day: 1 optional homework

  18. Afterwards Nothing mandatory Feedback • to each other • to the lecturer • to the program manager(s) Contact the lecturer with ideas or questions

  19. Questions?

  20. Conclusion Meaningful statistical tests base on assumptions=> You need to know something about the topic discussed Statistics work only with a question behind=> Formulate what you would like to demonstrate Replication (of statistics, not data) = imperative=> Keep a backup of your complete raw data with date and source

  21. Descriptive statistics TRIBE statistics course Split, spring break 2016

  22. Goal Understand that descriptive statistics are about SAMPLE properties Use descriptive statistics for validation of the data Understand descriptive output (example online):

  23. Sample size More is better Subsets of N • N used in the reported statistics (complete information, revised) • N of the sample (= raw data) • N of the population Difference • Data representative? • Direction of the bias? • Generalization of the results admissible?

  24. Moments Expectation of Xk = E[Xk] = knd moment Estimator in samples = the (unweighted arithmetic) average of Xk Moments with names: • Mean • Variance (standard deviation for its square root) • Skewness • Kurtosis • Any moment E[Xn] (in some cases) by moment generating functions

  25. Mean For a population = the expected value, sample estimator = average

  26. Variance σ2 Variance = the average of the squared differences from the mean • First measure of average spread in a distribution • First measure of uncertainty • Part of the 'family' of moments Standard deviation (= square root of the variance) • Kind of average deviation • Same unit as the data • Squaring the individual distances to the meanavoids cancelling of positive and negative ones plus marks unequal deviations as 'larger', see:

  27. Existence of moments Distributions without (some) moments exist • No mean: Cauchy distribution • No variance: some t-distributions • All distributions with a defined variance also have a defined mean For many distributions, a formula for the moments exists: ANY sample has sample mean and variance (in short, all moments)

  28. Data requirements Metric variables necessary for most descriptive statistics to make sense – often implicitly assumedor approximated by the according interpretation of adjacent categories Ordinal variables (like rankings): distances with less or no meaning Nominal (or categorical) data • Often transformed to dummies (value of 0 or 1) • More than two categories can be captured by more dummies • Dummies allow a quantitative distinction of effects

  29. Median Median = 50% quantile Skewness matters Symmetry  median = mean Standard use of quantiles for • Data with extreme outliers • Income/wealth statements • Votes (majority rules) Still another figure: the mode

  30. Data structure Boundaries • Minimum • Maximum • Range = maximum minus minimum Outliers • No universal definition • Rule of thumb: more than 2-3 standard deviations away • Limited number when defined by σ since the probability of realizations within k times σ decreases at least quadratically in k (Chebyshev's inequality: Probability(│X-µ│ ≥ kσ) ≤ 1/k2 for k > 0)

  31. Validation of the data import Frequent mistakes • Blanks as zeros • Format: decimal separators (.↔ ,), 12'345↔12.345, numbers as text • Percentage versus percentage points Simple quality check: compare expectations to realizations • Complete data: serious collection or the contrary • Numbers: always the same entries, extreme values, sums • Surveys: enforced answers, strategy, wrongly understood questions

  32. Precision Measure height to meters instead of centimeters => all 2 meters tall • Required precision relates to the question asked • Different levels of precision complicate the replication of results • Highly precise, some statistics imply more information than available(e.g. a median height of 185.00 cm when data is in cm only)

  33. Missing data Reduces the available data set for analyses that rely on all information • Fewer data points need clearer outcomes for significant results • Handling easiest with specialized software • Potential (and often likely) bias at the omissions Treatment • Data samples large enough taking into account some missing data • Do not replace missing output by a model prediction (no gain but spuriously reduced variance due to the assumed zero error) • Possibility to replace missing data in an input matrix (but then correlation matters)

  34. Data structure Sorting • A-Z usually okay, sorting by time not (autocorrelation) • Helpful to get rid of missing data, no need in statistical software • If at all, then for all variables equally (for cross-variable relations) Expansion of the data set (additional variables, often dummies) • Beware of implicit assumptions ('A + B = Total': maybe there is a C) • Explanatory content (also non-linear) by construction • Keep track of the construction

  35. Histogram In EViews:Group/Series  Descriptive Statistics &Tests  Histogram and Stats

  36. Box plot • Information about the distribution • Whiskers show the range to the farthest point still not an outlier • No standard for the (far) outliersEViews uses 1.5 (respectively 3) times the interquartile range In EViews:Group/Series  View  Graph  Boxplot

  37. Correlation Descriptive over more than one series, also across dimensions First measure of 'connection' between univariate data Usually stated in terms of linear correlation Basis for regressions Autocorrelation (especially for time series) over one or many periods ρ for the population, r for the sample: • ρhigher than 80% considered as strong, below 50% as weak

  38. To do list • Keep the original raw data • Design the data collection with reserves for potentially missing data • Use descriptive statistic to validate the sample data • Refrain from 'obvious' improvements (categories, rounding, sorting) • Maximum precisions in the calculations (as long as it does not slow down the process by too much), reasonable one in the presentation • Attention at further transport or transformation, for example: • Date difference EViews/Excel = 693593 • Date '0' in EViews means 01Jan0001 • Date '0' in Excel means 01Jan1900

  39. Questions?

  40. Conclusion Tender points of the descriptive statistics (sensitivity to outliers) Misdirection when drawing upon the wrong descriptive statistics Like any software, statistical packages take some time to accustom Complete data can be a sign of good or bad quality Rather use robust test methods than trying to fix the data Large samples reduce the issues of bias and missing data Expectations about the descriptive statistics form a first hypothesis that is 'tested' by eye inspection for the realized values of the data

  41. Underlying distribution TRIBE statistics course Split, spring break 2016

  42. Goal What do we think we have in general (and not just the sample)? Figure out what SHOULD be there (and with which likelihood) Realize that a known distribution does not imply certain outcomes

  43. Sample distribution (histogram) 'Where do we have how many realizations along each dimension?' Example of a discrete probability function With no assumptions, this is the most likely underlying distribution

  44. Discrete probability function From histogram to distribution • x-axis = dimension of the realization (cm, kg, €, …) • y-axis = probability of the realization on the x-axis at this point • Standardization to an area of 1 (= 100%), also for comparison Properties of standardized histograms as probability functions • Any surface of size 100% corresponds to a distribution • Height limited by the width of the category (100% maximum surface) • No 'left/right' boundaries necessary (but probability goes to 0)

  45. Continuous function when n increases For sufficient precision at the measurement, the steps disappear with n • Example: measure length in km, meters, cm, mm, µm, … • Not the case for truly discrete functions: Coin toss, roll dice, lotteries • Smooth development allows approximation by continuous functions Mapping • Data implies function and function implies data • Analytical representation of continuous distribution functions • Approximation of discrete by continuous facilitates calculations(fewer parameters, predictions for the full support, smoothness)

  46. Continuous versus discrete distribution No continuous distribution in ANY sample(white necessarily a finite number and precision) Hence always the question 'close enough as an approximation?' Classical approximation: the normal distribution (= Gaussian curve)

  47. Full distribution Probability density function and cumulative probability function • D(x) = integral of P(x), often available as explicit function (analytical solutions, easy calculation) • Indicates the likelihood of realizations within ANY interval • Normal distribution: • Sometimes, less information (like mean or variance) suffices

  48. Distribution versus realization Exact distribution (binomial, coin, dice, normal, …) ≠ sure realization Real life realization = data set

  49. Calculation rule mean E[a∙X + b∙Y + c] = a∙E[X] + b∙E[Y] + ca,b,c constants; X,Y stochastic; E[∙] as the operator for expectations

  50. Calculation rules (co)variance Variance of X = Var(X) = E[(X - E[X])2] Var(a∙X + b) = a2∙Var(X) Var(a∙X ± b∙Y + c) = a2∙Var(X) + b2∙Var(Y) ± 2∙a∙b∙Cov(X,Y) Covariance of X and Y = Cov(X, Y) = E[(X - E[X])(Y - E[Y])] Cov(X, X) = Var(X), Cov(a, X) = 0, Cov(a∙X, b∙Y) = a∙b∙Cov(X, Y) Cov(X+Y, Z) = Cov(X, Z) + Cov(Y, Z) for stochastic Z • Variance adds up in a n-step combination (for example over time) => Volatility (= standard deviation σ) increases with √t over time

More Related