Download Presentation
## Evaluation Methodology

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Evaluation Methodology**FatemehVahedian CSC-426-Week 6**Outline**• Evaluation Methodology • Work Load • Experimental design • Rigorous analysis • Measurement • Levels of measurement • Reliability • True score theory of measurement • Measurement Error • Theory of Reliability • Reliability Types • Construct validity • Measurement validity types • Idea of Construct Validity • Convergent Validity • Discriminant Validity • Threats to Construct Validity • Approaches to Assess Validity**Evaluation description**• Professional evaluation is defined as the systematic determination of quality or value of something (Scriven 1991) • Evaluation methodology underpins all innovation in experimental computer science • Evaluation is a systematic determination of a subject's merit using criteria governed by a set of standards • Evaluation is the structured interpretation and giving of meaning to predicted or actual impacts of proposals or results**Evaluation Methodology**• Evaluation methodology requirements: • relevant workloads: • appropriate experimental design: • rigorous analysis:**Work Load**• Relevant and diverse • No workload is definitive: but does it meet the test objective? • Each candidate should be evaluated quantitatively and qualitatively (e.g. specific platform, domain, application) • Widely used (e.g. open source applications) • Nontrivial (e.g. toy systems) • Suitable for research • Tractable: Can be easily used, manageable • Repeatable • Standardized • Workload selection: it should reflect a range of behaviors and not just what we are looking for**Experimental design**• Meaningful Baseline • Comparing against the state of the art • Widely used workloads • Not always practical (e.g. unavailable implementation for public) • Comparisons that control key parameters • Understanding what to control and how to control it in an experimental system is clearly important. • Comparing two garbage collection schemas while controlling the Heap size • Bound and free parameters • Degree of freedom**Experimental Design**• Control Changing Environment and Unintended Variance • E.g. Host platform, OS, Arrival rate • Scheduling schemas, workloads • Network latency & traffic • Differences in environment • Controlling nondeterminism • Understand key variance points • Due to nondeterminism quality of the results usually does not reach the same steady state on a deterministic workload • Take multiple measurements and generate sufficient data points • Statistically analyze results for differences in remaining nondeterminism**Rigorous analysis**• Researchers use data analysis to identify and articulate the significance of experimental results • Challenging in complex systems with sheer volume of results • Aggregating data across repeated experiments is a standard technique for increasing confidence in a noisy environment • Since noise cannot be eliminated altogether, multiple trials are inevitably necessary. • Reducing nondeterminism • Researchers have only finite resources, Reducing sources of nondeterminism with sound experimental design improves tractability. • Statistical confidence intervals and significance • Show best and worst cases**Measurement and Levels of Measurement**• Measurement is the process observing and recording the observations that are collected as part of a research effort • Level of Measurement: • The level of measurement refers to the relationship among the values that are assigned to the attributes for a variable**Measurement and Levels of Measurement**• There are typically four levels of measurement that are defined • Nominal • Ordinal • Interval • Ratio • Why the levels of measurement is important? • knowing the level of measurement helps you decide how to interpret the data from that variable • Helps to decide what statistical analysis is appropriate on the values that were assigned**Reliability**• Reliability is the consistency or repeatability of your measures • True score theory of measurement: Is the foundation of reliability theory • Different types of measure error • Theory of reliability • Different types of reliability • The relationships between reliability and validity in measurement**Reliability/ True score theory of measurement**• Consists of two components: • true ability (or the true level) of the respondent on that measure • random error • Why it’s important? • reminds us that most measurement has an error component • true score theory is the foundation of reliability theory • A measure that has no random error (i.e., is all true score) is perfectly reliable; a measure that has no true score (i.e., is all random error) has zero reliability • true score theory can be used in computer simulations as the basis for generating "observed" scores with certain known properties Observed score True ability Random error + =**Measurement Error**• Random error • Cause by any factors that randomly affect measurement of the variable • Sum to 0 and does not affect the average • Systematic error • Cause by any factors that systematically affect measurement • Consistently either positive or negative • Reducing measurement error • Pilot study • Training • Double check the data thoroughly • Use statistical procedure • Use multiple measures**Theory of Reliability**• In research, reliability means repeatability or consistency • A measure is considered reliable if it would give us the same result over and over again • Reliability is a ratio or fraction: • true level/the entire measure—var(T)/var(X) • Covariance(X1,X2)/sd(X1)*sd(X2) • We can not calculate reliability because we can not measure the true score but we can estimate (between 0 and 1)**Reliability Types**• Inter-rater or inter-observer reliability • Raters for categories • Calculate the correlation • Test-retest reliability • The shorter the time gar, the higher correlation • Parallel-forms reliability • Create a large set of questions that address the same construct, divide into two set and administer the same sample • Internal consistency reliability • Average inter-item correlation • Average total-item correlation • Split-half reliability • Cronbach’s alpha (a)**Reliability & Validity**• center of the target is the concept that you are trying to measure**Construct validity/Definition**• Construct validity has traditionally been defined as the experimental demonstration that a test is measuring the construct it claims to be measuring • What is construct? A construct, or psychological construct as it is also called, is an attribute, proficiency, ability, or skill that happens in the human brain and is defined by established theories • A construct is a concept. A clearly specified research question should lead to a definition of study aim and objectives that set out the construct and how it will be measured. Increasing the number of different measures in a study will increase construct validity provided that the measures are measuring the same construct Idea Program Construct Operationalization**Measurement validity types**• Translation validity • Face validity • Content validity • Criterion-related validity • Predictive validity • Concurrent validity • Convergent validity • Discriminant validity**Idea of Construct Validity**• Construct validity is an assessment of how well you translated your ideas or theories into actual programs or measures • Why Construct Validity is important? • truth in labeling**Convergent & Discriminant Validity**• Convergent and discriminant validity are both considered subcategories or subtypes of construct validity • Convergent Validity • To establish convergent validity, we need to show that measures that should be related are in reality related**Discriminant Validity**• To establish discriminant validity, you need to show that measures that should not be related are in reality not related**Threats to Construct Validity**• Inadequate preoperational explication of constructs • Mono-operation bias • Mono-method bias • Interaction of different treatments • Interaction of testing and treatment • Restricted generalizability across constructs • Confounding constructs and levels of constructs • “Social” threats • Hypothesis guessing • Evaluation apprenhension • Experiment expectancy**Approaches to Assess Validity**• Nomologicalnetwork • a theoretical basis • Multitrait-multimethodmatrix • to demonstrate both convergent and discriminant validity • Pattern matching**Nomologicalnetwork**• It includes a theoretical framework for what you are trying to measure, an empirical framework for how you are going to measure it, and specification of linkages among and between these two frameworks • It does not provide a practical and usable methodology for actually assessing construct validity**Pattern matching**• It is an attempt to link two patterns: theoretical pattern and observed pattern • It requires that • To specify your theory of the constructs precisely! • you structure the theoretical and observed patterns the same way so that you can Directly correlate them • Common Example • ANOVA table**Multitrait-multimethod matrix**• MTMM is a matrix of correlations arranged to facilitate the assessment of construct validity • It is based on convergence and discriminant validity • It assumes that you have several concepts and several measurement methods, and you measure each concept by each method • To determine the strength of the construct validity: • Reliability coefficients should be the highest in the matrix • Coefficient in the validity diagonal should be significantly different from zero and high enough • The same pattern of trait interrelationship • should be seen in all triangles.