1 / 25

Quantitative Data Analysis I./II.

UK FHS Historical sociology (2015+). Quantitative Data Analysis I./II. Missing Values (I.) Identification, assigning and their analysis Jiří Šafr jiri.safr(AT)seznam.cz Last revision 22/3/2015. Missing data: definition and relevance.

kellerl
Download Presentation

Quantitative Data Analysis I./II.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. UK FHS Historical sociology (2015+) Quantitative Data Analysis I./II. Missing Values (I.) Identification, assigningand their analysis Jiří Šafr jiri.safr(AT)seznam.cz Last revision 22/3/2015

  2. Missing data: definition and relevance • Missing data (also called ‘‘item nonresponse’’) means that for some reason data on particular items or questions are not available for analysis. • Fraction of missing data is important indicator of data quality. • Prerequisite for the analysis of the data (and especially statistical treatment of missing data) is to understand why the data are missing. (a missing value originating from accidentally skipping a question differs from a missing value originating from reluctance of a respondent to reveal sensitive information.) Source: [Lavrakas2008: 467]

  3. The first step of any analysis is to examine data, i.e. search for „inappropriate“ values and to exclude them from the range of valid values → MISSING VALUES

  4. Two types of missing values (in SPSS): 1. System = SYSMIS(in data: „ . “)This is the very basic and simple form of delimiting missing values (and reliable format when transferred into other software), but strictly speaking there is no information why the value/s is missing. Most often it is when the record for that variable was not performed, or the variable does not apply to the case (the respondent) (e.g. A year of divorce for single / married persons). If we have e.g. in a questionnaire further detailed information such as „Not applicable“, „Refused to answer“, „Do not know“ we code those values with a specific „inappropriate“ values which we can later assign 2. User defined = MISSING VALUES In the data we use values out of standard range, e.g. : „9“ or „99“. We can label them, e.g. 8 = Refused to answer, 9 = Didn't know. These values will not be included in the main part of the analyses and they will be reported separately (so far we turn „them off“ in MISSING VALUES command or in menu).

  5. Missing values - procedure Any time you get any new data set: 1. Determine whether in the dataset are any missing value defined and how. Don‘tcount on the documentation, e.g. codebook, but always check it out yourself in the data.If not then: 2. assigning the „inappropriate“ values to missing values (or possibly recording or other data transformations) - - - (see QDA II.) 3. substantive analysis of missing values: a) Can we ignore them? Are they missing at random? If not: b) Analysis of their dependency on other variables - - - (very advanced strategy) (4. imputation of missing values (estimation of values, where there are missing) and manipulation in multivariate analysis (listwise/pairwise deletion and various imputation))

  6. 1. Inspecting datathe easiest approach to MV • Looking over the settings in Data-editor – the column MV is not enough, we must always inspect the data. • For larger number of variables, mostly in the first step it sometimes suffices to use simple tabular command DESCRIPTIVES → we inspect Minimum and Maximum values in the data and compare them with adequate values in "questionnaire". ? Mostly it reveals max values, but be careful it is not reliable, namely for categorical variables! • FREQUENCIES command is the only reliable, because it lists the occurrences of all values, i.e. their (un)designation as MV. For more variables, however, we get a lot of tables. • Clearly we also show the number of MV (but not in detail what values) by command MVA (MISSING VALUE ANALYSIS). For detecting MV slightly better strategy than DESCRIPTIVES, but it is not available in the Base version of SPSS.

  7. DESCRIPTIVES PI.1a. → not always reliable! FREQUENCIES PI.1a.→ complete information about all values/ categories MVA PI.1a. Missing values – identification(DESC, FREQ, MVA)

  8. 2. Assigning missing values MISSING VALUESVar1possibly other Var2 Var3 …(0 8 9).→ you can set max. 3 specific values or: range (LOWEST THRU 5). or (8 THRU HIGHEST). or combination of an range and one value: (5 8 thru Highest). It can also be done in Data editor (by clicking the mouse)but the syntax provides easy checking and documentation of data manipulation which can be used later in other data.

  9. Identification and assigning missing valuesExample: „Age of university students at HiSo“ Values 12 and 92 are out of the rational age range of university (HiSo) students, so we assign them as missing. Use the command syntax: MISSING VALUES age (12 92). Or in the data editor (click by mouse in menu) FREQUENCIESvek. Note: Once the MV is assigned, nothing seems to happen; in fact we only marked MV in the data. So it is good to print again the table with frequencies:FREQUENCIES age. At the same time, we see that in the data setno User Missings been defined yet. (There are only 2 cases of System MissingsSYSMIS).

  10. Assigning missing values as a range Example: „Age of university students at HiSo“ Assigning range of MVIt is clever because it will apply to any possible future addition of new cases (or data manipulation) • from minimum to specific value: MISSING VALUES age (LOWEST THRU 20). • from specific value to maximum: MISSING VALUES age (50 THRU HIGHEST). • and we can add one specific value to the range: MISSING VALUES age (50 THRU HIGHEST12).

  11. „Switching off/on“ of missing values in Syntax • Missing can by „switched off” simply→ leave the brackets empty. MISSING VALUESage( ). FREQUENCIESage. Now all values are included in the analysis.(Of course it does not apply to System Missing values they remain excluded) And again we van simply „switch them on”. MISSING VALUESage (12 92). FREQUENCIESage.

  12. Why is it so important to detect and exclude user defined missing values? „Inappropriate“ values can distort our estimates!particularly mean, variance and correlation The risk of obtaining biased results is not so high in case of categorical variables which we present in the table of frequencies where we usually notice them.

  13. Example: Number of children born → meanwithout/with missing values assigned (population census data) No missing values definedMissing values defined

  14. Some notes about Missing Values (1) • It can happen that we assigned some missing values and they in fact don’t occur in the table of frequencies. It is logical result because in the analysis (such as Frequencies in the MV section)only real occurrence of any (incl. missing) values appears(However, SPSS still keeps our definition of potential Missings prepared). The information about the missing values actually assigned can be viewed by DISPLAY. DISPLAY DICTIONARY /VARIABLES = age. • Note also the situation when in the table Frequencies a certain value appears several times, e.g.: 1 1 1 is in effect, e.g., 0.9 and 0.6, and 1 (0.9 a 0.6 are rounded to 1 so they appear in a variable format with no decimal places as the unique value of 1) → Change the format of variable FORMATS friends (F8.1). Number of friends Number of friends

  15. Some notes about Missing Values (2) • If there are more values which are not in continuous range we need to recode them first and then use the range definition. For example, Income per month (in thousands) with illogical values „-9 -7“ and also „8888888 99999999“ can be treated in this way: RECODE Income (8888888= -8888888) (9999999= -99999999) (ELSE = COPY).MISSING VALUES Income (LOW THRU -1 ). • It is reasonable to use negative values for coding/recording missing values (-9 -8 instead 9 8) → they are more visible.

  16. Missing values: How to treat /analyse them - rule of thumb • If relative number of missing values is less than ca 5%, we can mostly ignore them (in „large enough“ sample). But carefully with their intersection in bivariate (and higher level) analysis (5% at var1 and 5% at var2 can result in 5 % in total as well as 10 % !). • If the number of missing values exceeds this threshold (>5%), then the analysis of missing values,i.e. dependence on other variables is necessary(→ causes of MV), i.e. we should ask:„Who doesn't answer to our questions?“ And perhaps „Is our question form valid?“ • >5% incidence of MV does not have to be only at random (i.e. randomly distributed in the population with almost no harm to our results) which needs to be verified (and when appropriate we consider the imputation of missing values).

  17. Missing values Step 3.– their analysis

  18. Inspecting the Structureand Patterns of Missing Data • The first step in the analysis of incomplete data is to inspect the data. • If most of the missing values concern only one specific variable (e.g., household or personal income) and such variable is not central to the analysis, we may decide to delete it. • The same applies to a single respondent with many missing values. • However, missing values are usually scattered throughout the entire data matrix. • In that case, we should know if the missing data form a pattern and if missingness is related to some other variables. Source: [Lavrakas2008: 469]

  19. Analysis of dependency & mutual interdependence of missing values We are dealing with two issues: a) How are the missing values intertwined between (dependent) variables(e.g. in an item battery) b) Whether they are somehow dependent on sorting factors (e.g. age, education, income or filtering question) • The simplest procedure: „switching off“ of the missing values (they will be included), and analysis of the relevant categories, e.g. in the contingency table. • MVA (Missing Value Analysis) (in advanced version of SPSS only) • Construction of a new variable with information about the missing value (/ values for several variables) and its separate analysis (or inclusion in a model).Dichotomised variable indicating missing vs. valid value or count variable indicating how many times missing values occurred within a set of variables (e.g. in an item battery).

  20. MVA – Missing Value Analysis • MVA can reveal missing values patterns occurring simultaneously at a set of variables Note MVA is not available in the Base version of SPSS (but we can help ourselves by some tricks).

  21. MVA - notes • Don‘t use weighting – if there is any, switch the weight first. → WEIGHT OFF. • Basic features: a description of the missing values + missing values patterns • It distinguishes numeric and categorical variables MVA age income gender region/CATEGORICALgender region.

  22. MVA Output (1) Basic output (with request for categorical variables) MVA age income gender region /CATEGORICAL gender region.

  23. MVA Output (2) • Patterns of missing values among set of variables→ How many respondents did not answered on how many items from the battery of questions? This table only shows coding MV at a set of variables not the occurrence of patterns as such. 5= Don‘t know 6= No answer There are many other settings and outputs in MVA.

  24. Missing data can be: • Missing completely at random (MCAR)→ ideal situation, the results are not biased • Missing at random (MAR)→ missing values are only at some of the variables, but they are not mutually systematically affected • Not missing at random (NMAR) → missing values are conditional (non-randomly) by some process or factor→ the problem of bias of the results

  25. QDA I./II. Intersection of valid casesrestricting the analysis to complete cases with valid data Analyses in the text (e.g. a report or thesis) should be made on a consistent subset with the same number of valid cases across variables. → In successive bivariate analyses there should be the same basis of valid cases. According to the principle LISTWISE = missing values intersect across all the variables,(i.e. in a survey only those who answered to all questions are reported)But this can be highly problematic namely if there are a lot missingvalues (>5%) and/or they are unique (not overlapping) at differentvariables. In effect it discards a lot of information).

More Related