BY DEVELOPING IMPUTATION STRATEGIES

Q2008 - ROME, 09-11 JULY 2008Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003)Claudio Quintano, Rosalia Castellano, Sergio LongobardiUniversity of Naples “Parthenope”claudio.quintano@uniparthenope.it; lia.castellano@uniparthenope.it sergio.longobardi@uniparthenope.it

OUTLINES IMPROVING THE ACCURACY OF ITALIAN DATA FROMOECD’s “Programme for International Student Assessment” (PISA 2003) BY DEVELOPING IMPUTATION STRATEGIES TO REDUCE THE NON-SAMPLING ERROR OF PARTIAL NON RESPONSES

PISA 2003 The OECD’s PISA “Programme for International Student Assessment” survey is an internationally standardised assessment administered to 15 years old students 41 Countries (20 European Union members) The survey involves 276.165 students (11.639 in Italy) 10.274 schools (406 in Italy)

PISA 2003 The survey assesses the students’ competencies in three areas Scientific literacy Reading literacy Mathematical literacy

AVAILABLE DATA STUDENT DATASET FAMILY ENVIRONMENT OF STUDENT The OECD collects data on SCHOOL DATASET SCHOOL CHARACTERISTICS

Multilevel (school and student) model with 4 covariates ITALY: EXCLUDED STUDENT UNITS (8%) AS ONE OR MORE STUDENT OR SCHOOL VARIABLES ARE MISSING

Multilevel (school and student) model with 29 covariates ITALY: EXCLUDED STUDENT UNITS (81%) AS ONE OR MORE STUDENT OR SCHOOL VARIABLES ARE MISSING

STEPS OF ANALYSIS Missing data pattern Imputation strategies Evaluation of results

OECD’S PISA DATASET TWO SUBSETS OF VARIABLES COLLECTED VARIABLES DERIVED VARIABLES Computed on collected variables (by linear combination or factorial analysis). This increases the potentialities of the survey Datacollectedby student and school questionnaires

EXAMPLE OF DERIVED VARIABLES The PISA 2003 index of confidence in ICT internet tasks is derived from students’ responses to the five items. All items are inverted for IRT scaling and positive values on this index indicate high self-confidence in ICT internet tasks The PISA 2003 index of school size (SCHLSIZE) is derived from summing school principals’ responses to the number of girls and boys at a school The PISA 2003 index of availability of computers (RATCOMP) is derived from school principals’ responses to the items measuring the availability of computers. It is calculated by dividing the number of computers at school by the number of students at school

“COLLECTED” AND “DERIVED” VARIABLES

STUDENTS’ DATASET

FIVE IMPUTATION PROCEDURES Iterative and sequential multiple regression applied to each section of student questionnaire PROCEDURE A Iterative and sequential multiple regression applied to imputation classes computed by a regression tree PROCEDURE B Random selection of donors within imputation classes computed by a regression tree PROCEDURE C Random selection of donors withinimputation classes computed by a regression tree for each section of the studentquestionnaire PROCEDURE D Iterative and sequential multiple regression appliedto whole dataset PROCEDURE E

USUAL ASSOCIATIONS AND ANTINOMIES OF ADOPTED IMPUTATION PROCEDURES (A-E) ALL PROCEDURES ARE BELONGING TO CATEGORIES USUALLY WELL KNOWN TWO CATEGORIES ARE INVOLVED: REGRESSION METHODS (A,B,E) AND DONORS METHODS (C,D) DIMENSION OF TREATED DATA MATRIX. THE IMPUTATION PROCEDURE IS (A,D) / IS NOT (B,C,E) PUT ON EACH SECTIONS OF THE QUESTIONNAIRE TWO DATA MATRIX SIDES ARE INVOLVED: UNITS (Classification And Regression Tree B,C,D) AND VARIABLES (A,D) MISSING DATA MECHANISM IS (A,E) / IS NOT CONSIDERED (B,C,D)

Iterative and sequential multiple regression (Raghunatahan et al. 2001) on each section of student questionnaire PROCEDURE A The data matrix is partitioned in the seven sections of student questionnaireThe features ofeach section, as partition of data matrix: • Strong logical links between the questions • Homogeneous structure of association and relationship • Homogeneous presence of missing data

PROCEDURE A

Iterative and sequential multiple,regression applied to imputation classes computed by a regression tree PROCEDURE B • DEPENDENT VARIABLE • Missing data for each student • PREDICTORS • Selected from five categories of derived indicators θ: • Family background • Scholastic context • Approach to study • Attitudes toward ICT struments • Performance scores STEP I UNITS CLASSIFICATION Computation of regression tree (14 terminal nodes) Each terminal node of the tree is considered as imputation class Their missing values are imputed by iterative and sequential regression model (Raghunatahan et al. 2001) STEP II IMPUTATION

Random selection of donors inside of imputation classes computed by a regression tree PROCEDURE C • DEPENDENT VARIABLE • Missing data for each student • PREDICTORS • Selected from five categories of derived indicators θ: • Family background • Scholastic context • Approach to study • Attitudes toward ICT struments • Performance scores STEP I UNITS CLASSIFICATION Computation of regression tree (14 terminal nodes) A different donor is selected to impute each missing value of each student The donor is selected randomly from the same node STEP II IMPUTATION

Random selection of donors within imputation classes computed by a regression tree for each section of the studentquestionnaire PROCEDURE D THE DATA MATRIX IS PARTITIONED IN THE SEVEN SECTIONS OF STUDENT QUESTIONNAIRE STEP I Matrix partition STEP II Units Classification A REGRESSION TREE IS PRODUCED WITHIN EACH PARTITION OF THE MATRIX (see the next slide) WITHIN ALL LEAVES, A DIFFER DONOR IS SELECTED TO IMPUTE EACH MISSING VALUE OF EACH STUDENT THE DONOR IS SELECTED RANDOMLY FROM THE SAME NODE STEP III Imputation

PROCEDURE D

PROCEDURE E ITERATIVE AND SEQUENTIAL MULTIPLE REGRESSION (Raghunatahan et al. 2001) ON THE WHOLE DATASET (without any partition of units and variables)

METHODOLOGICAL DETAILS OF THE IMPUTATION PROCEDURES

Classification And Regression Tree Classification and Regression Tree creates a tree-based classification model. It classifies cases into groups or predicts values of a dependent (target) variable based (Y) on values of independent (predictor) variables (X) PARENT NODE The classification is obtained through the recursive binary partition of the measurement space and containing subgroups (NODES) of the target variable values internally homogeneous, correspond to imputation cells CHILD NODE TERMINAL NODE CREATE IMPUTATION CELLS

STRUCTURE OF A REGRESSION TREE Impurity of a node t Example: A tree T composed of five nodes ti i=1,2,3,4,5 t1 t2 t3 For any split s of t into tL and tR, the best split s* is such that t5 t4

ITERATIVE AND SEQUENTIAL MULTIPLE REGRESSION (1/2) The variable with the fewest number of missing values -Y1 – is regressed on the subset of variables without missing data U=X Variables without missing data -Y- PARTITION OF THE VARIABLES STEP 1 Variables with missing data -X- STEP 2 Update Uby appending Y1 Then the next fewest missing values Y2 is regressed on U = (X, Y1) where Y1 has imputed values STEP 3 …….. Each variable is imputed by using all available variables (completed or imputed) STEP N

ITERATIVE AND SEQUENTIAL MULTIPLE REGRESSION (2/2) NEXT ROUND THE IMPUTATION PROCESS IS THEN REPEATED MODIFYING THE PREDICTOR SET TO INCLUDE ALL X AND Y VARIABLES EXCEPT THE ONE USED AS THE DEPENDENT VARIABLE ALL MISSING DATA ARE IMPUTED FOR EACH VARIABLE

EVALUATION OF IMPUTATION PROCEDURES IMPACT ON UNIVARIATE DISTRIBUTIONS RELATIONSHIP BETWEEN VARIABLES

IMPUTATION EFFECTS ON UNIVARIATE DISTRIBUTIONS N denotes the number of categorical variables CATEGORICAL VARIABLES ABSOLUTE RELATIVE SQUARE DISSIMILARITIES INDEX (LETI 1983) CONTINUOUS VARIABLES ABSOLUTE RELATIVE VARIATION INDEX (AMONG STANDARD DEVIATIONS) ABSOLUTE RELATIVE VARIATION INDEX (AMONG MEANS) the education survey data have analysed with multilevel models.

IMPUTATION EFFECTS ON RELATIONSHIP AMONG VARIABLES (1/2) Variation Association Index (categorical variables) Mean difference for each imputed variable (Yj) between the association pre and post imputation of Yj vs remaining n-1 categorical variables

IMPUTATION EFFECTS ON RELATIONSHIP AMONG VARIABLES (2/2) Variation Association Index (continuous variables) Mean difference for each imputed variable (Yj) beetwen the correlation pre and post imputation of Yj vs remaining n-1 continuous variables

MATRIX V x P - “VARIABLES x PROCEDURES”

SCORES MATRICES 1 if gjs is the minimum value in the row j Each of five matrix VPG(Nx5) -whose Gjs is a generic element- is transformed in ajs (0,1) score matrix SI(Nx5) with ajs 0 otherwise j min{ gjs} ajs=1; s:gjs≠min{ gjs} ajs=0

BUILDING A RANKING INDICATOR(1/3) The ranking indicators measure the relative performance of each procedure according to each evaluation index

BUILDING A RANKING INDICATOR(2/3) The vector of 0,1 scores extracted from the S matrix (for each procedure and for each evaluation indicator) is reduced to a scalar as a sum of its elements This sum is divided by the number of vector elements to obtain a ranking index R whose range is 0,1

BUILDING A RANKING INDICATOR(3/3) The ranking indicators measure the relative performance of each procedure according to each evaluation index Lowest performanceof sth procedure compared to other ones for generic evaluation index G Highest performanceof sth procedure compared to other procedures for generic evaluation index G

FROM AN EVALUATION INDICATOR TO A RANKING INDICATOR

EVALUATING THE IMPACT ON MARGINAL DISTRIBUTIONS AND ON SOME DISTRIBUTIVE PARAMETERS

EVALUATING THE IMPUTATION IMPACT ON THE VARIABLESASSOCIATION

CONCLUDING REMARKS MISSING DATA IMPUTATION IS AN EXTREMELY COMPLEX PROCESS EACH METHOD SHOWS CRITICAL ASPECTS • IT IS IMPORTANT TO DEVELOP A RECONTRUCTION STRATEGY CONSIDERING SOME BASIC ASPECTS: • THE MISSING DATA PATTERN • THE IMPACT ON THE STATISTICAL DISTRIBUTIONS • THE IMPACT ON THE ASSOCIATIONS AMONG VARIABLES

BY DEVELOPING IMPUTATION STRATEGIES

BY DEVELOPING IMPUTATION STRATEGIES

Presentation Transcript

Developing Search Strategies

Developing Market Strategies

Data Imputation

Developing Marketing Strategies By M.Shariq

Developing Marketing Strategies

Imputation

Imputation 2

DEVELOPING QUESTIONING STRATEGIES

WHI Imputation

Multiple Imputation

Developing marketing strategies

Developing Logistics Strategies

Developing Personal Strategies

Developing Reading Strategies

Developing writing strategies

BY DEVELOPING IMPUTATION STRATEGIES

Data Imputation by Soft Computing

Developing Search Strategies

Developing Pricing Strategies