Evaluation Methodology

Evaluation Methodology FatemehVahedian CSC-426-Week 6

Outline • Evaluation Methodology • Work Load • Experimental design • Rigorous analysis • Measurement • Levels of measurement • Reliability • True score theory of measurement • Measurement Error • Theory of Reliability • Reliability Types • Construct validity • Measurement validity types • Idea of Construct Validity • Convergent Validity • Discriminant Validity • Threats to Construct Validity • Approaches to Assess Validity

Evaluation description • Professional evaluation is defined as the systematic determination of quality or value of something (Scriven 1991) • Evaluation methodology underpins all innovation in experimental computer science • Evaluation is a systematic determination of a subject's merit using criteria governed by a set of standards • Evaluation is the structured interpretation and giving of meaning to predicted or actual impacts of proposals or results

Evaluation Methodology • Evaluation methodology requirements: • relevant workloads: • appropriate experimental design: • rigorous analysis:

Work Load • Relevant and diverse • No workload is definitive: but does it meet the test objective? • Each candidate should be evaluated quantitatively and qualitatively (e.g. specific platform, domain, application) • Widely used (e.g. open source applications) • Nontrivial (e.g. toy systems) • Suitable for research • Tractable: Can be easily used, manageable • Repeatable • Standardized • Workload selection: it should reflect a range of behaviors and not just what we are looking for

Experimental design • Meaningful Baseline • Comparing against the state of the art • Widely used workloads • Not always practical (e.g. unavailable implementation for public) • Comparisons that control key parameters • Understanding what to control and how to control it in an experimental system is clearly important. • Comparing two garbage collection schemas while controlling the Heap size • Bound and free parameters • Degree of freedom

Experimental Design • Control Changing Environment and Unintended Variance • E.g. Host platform, OS, Arrival rate • Scheduling schemas, workloads • Network latency & traffic • Differences in environment • Controlling nondeterminism • Understand key variance points • Due to nondeterminism quality of the results usually does not reach the same steady state on a deterministic workload • Take multiple measurements and generate sufficient data points • Statistically analyze results for differences in remaining nondeterminism

Rigorous analysis • Researchers use data analysis to identify and articulate the significance of experimental results • Challenging in complex systems with sheer volume of results • Aggregating data across repeated experiments is a standard technique for increasing confidence in a noisy environment • Since noise cannot be eliminated altogether, multiple trials are inevitably necessary. • Reducing nondeterminism • Researchers have only finite resources, Reducing sources of nondeterminism with sound experimental design improves tractability. • Statistical confidence intervals and significance • Show best and worst cases

Measurement and Levels of Measurement • Measurement is the process observing and recording the observations that are collected as part of a research effort • Level of Measurement: • The level of measurement refers to the relationship among the values that are assigned to the attributes for a variable

Measurement and Levels of Measurement • There are typically four levels of measurement that are defined • Nominal • Ordinal • Interval • Ratio • Why the levels of measurement is important? • knowing the level of measurement helps you decide how to interpret the data from that variable • Helps to decide what statistical analysis is appropriate on the values that were assigned

Reliability • Reliability is the consistency or repeatability of your measures • True score theory of measurement: Is the foundation of reliability theory • Different types of measure error • Theory of reliability • Different types of reliability • The relationships between reliability and validity in measurement

Reliability/ True score theory of measurement • Consists of two components: • true ability (or the true level) of the respondent on that measure • random error • Why it’s important? • reminds us that most measurement has an error component • true score theory is the foundation of reliability theory • A measure that has no random error (i.e., is all true score) is perfectly reliable; a measure that has no true score (i.e., is all random error) has zero reliability • true score theory can be used in computer simulations as the basis for generating "observed" scores with certain known properties Observed score True ability Random error + =

Measurement Error • Random error • Cause by any factors that randomly affect measurement of the variable • Sum to 0 and does not affect the average • Systematic error • Cause by any factors that systematically affect measurement • Consistently either positive or negative • Reducing measurement error • Pilot study • Training • Double check the data thoroughly • Use statistical procedure • Use multiple measures

Theory of Reliability • In research, reliability means repeatability or consistency • A measure is considered reliable if it would give us the same result over and over again • Reliability is a ratio or fraction: • true level/the entire measure—var(T)/var(X) • Covariance(X1,X2)/sd(X1)*sd(X2) • We can not calculate reliability because we can not measure the true score but we can estimate (between 0 and 1)

Reliability Types • Inter-rater or inter-observer reliability • Raters for categories • Calculate the correlation • Test-retest reliability • The shorter the time gar, the higher correlation • Parallel-forms reliability • Create a large set of questions that address the same construct, divide into two set and administer the same sample • Internal consistency reliability • Average inter-item correlation • Average total-item correlation • Split-half reliability • Cronbach’s alpha (a)

Reliability & Validity • center of the target is the concept that you are trying to measure

Construct validity/Definition • Construct validity has traditionally been defined as the experimental demonstration that a test is measuring the construct it claims to be measuring • What is construct? A construct, or psychological construct as it is also called, is an attribute, proficiency, ability, or skill that happens in the human brain and is defined by established theories • A construct is a concept. A clearly specified research question should lead to a definition of study aim and objectives that set out the construct and how it will be measured. Increasing the number of different measures in a study will increase construct validity provided that the measures are measuring the same construct Idea Program Construct Operationalization

Measurement validity types • Translation validity • Face validity • Content validity • Criterion-related validity • Predictive validity • Concurrent validity • Convergent validity • Discriminant validity

Idea of Construct Validity • Construct validity is an assessment of how well you translated your ideas or theories into actual programs or measures • Why Construct Validity is important? • truth in labeling

Convergent & Discriminant Validity • Convergent and discriminant validity are both considered subcategories or subtypes of construct validity • Convergent Validity • To establish convergent validity, we need to show that measures that should be related are in reality related

Discriminant Validity • To establish discriminant validity, you need to show that measures that should not be related are in reality not related

Threats to Construct Validity • Inadequate preoperational explication of constructs • Mono-operation bias • Mono-method bias • Interaction of different treatments • Interaction of testing and treatment • Restricted generalizability across constructs • Confounding constructs and levels of constructs • “Social” threats • Hypothesis guessing • Evaluation apprenhension • Experiment expectancy

Approaches to Assess Validity • Nomologicalnetwork • a theoretical basis • Multitrait-multimethodmatrix • to demonstrate both convergent and discriminant validity • Pattern matching

Nomologicalnetwork • It includes a theoretical framework for what you are trying to measure, an empirical framework for how you are going to measure it, and specification of linkages among and between these two frameworks • It does not provide a practical and usable methodology for actually assessing construct validity

Pattern matching • It is an attempt to link two patterns: theoretical pattern and observed pattern • It requires that • To specify your theory of the constructs precisely! • you structure the theoretical and observed patterns the same way so that you can Directly correlate them • Common Example • ANOVA table

Multitrait-multimethod matrix • MTMM is a matrix of correlations arranged to facilitate the assessment of construct validity • It is based on convergence and discriminant validity • It assumes that you have several concepts and several measurement methods, and you measure each concept by each method • To determine the strength of the construct validity: • Reliability coefficients should be the highest in the matrix • Coefficient in the validity diagonal should be significantly different from zero and high enough • The same pattern of trait interrelationship • should be seen in all triangles.

Evaluation Methodology

Evaluation Methodology

Presentation Transcript

Evaluation and ROMA methodology

MONITORING AND EVALUATION METHODOLOGY

CAPM Evaluation Methodology

FWUC management evaluation methodology

Joint MAC/PHY Evaluation Methodology

Evaluation Methodology and Simulation Scenarios

HEW SG Evaluation Methodology Overview

PHY Abstraction for HEW Evaluation Methodology

Suggestion on evaluation methodology

Proposed Evaluation Methodology Addition

Energy Efficiency Evaluation Methodology

Evaluation Methodology

Evaluation methodology and simulation scenarios

Evaluation of Person-based Migration Methodology

CAPM Evaluation Methodology

A Methodology for Malware Evaluation

Country Program Evaluation (CPE) Methodology

Proposed Addition to Evaluation Methodology

Evaluation methodology and simulation scenarios

Proposal for TGac Evaluation Methodology

IABIN Monitoring and Evaluation Methodology