1 / 26

Evaluation Methodology

Evaluation Methodology. Fatemeh Vahedian CSC-426-Week 6. Outline. Evaluation Methodology Work Load Experimental design Rigorous analysis Measurement Levels of measurement Reliability True score theory of measurement Measurement Error Theory of Reliability Reliability Types

cecil
Download Presentation

Evaluation Methodology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluation Methodology FatemehVahedian CSC-426-Week 6

  2. Outline • Evaluation Methodology • Work Load • Experimental design • Rigorous analysis • Measurement • Levels of measurement • Reliability • True score theory of measurement • Measurement Error • Theory of Reliability • Reliability Types • Construct validity • Measurement validity types • Idea of Construct Validity • Convergent Validity • Discriminant Validity • Threats to Construct Validity • Approaches to Assess Validity

  3. Evaluation description • Professional evaluation is defined as the systematic determination of quality or value of something (Scriven 1991) • Evaluation methodology underpins all innovation in experimental computer science • Evaluation is a systematic determination of a subject's merit using criteria governed by a set of standards • Evaluation is the structured interpretation and giving of meaning to predicted or actual impacts of proposals or results

  4. Evaluation Methodology • Evaluation methodology requirements: • relevant workloads: • appropriate experimental design: • rigorous analysis:

  5. Work Load • Relevant and diverse • No workload is definitive: but does it meet the test objective? • Each candidate should be evaluated quantitatively and qualitatively (e.g. specific platform, domain, application) • Widely used (e.g. open source applications) • Nontrivial (e.g. toy systems) • Suitable for research • Tractable: Can be easily used, manageable • Repeatable • Standardized • Workload selection: it should reflect a range of behaviors and not just what we are looking for

  6. Experimental design • Meaningful Baseline • Comparing against the state of the art • Widely used workloads • Not always practical (e.g. unavailable implementation for public) • Comparisons that control key parameters • Understanding what to control and how to control it in an experimental system is clearly important. • Comparing two garbage collection schemas while controlling the Heap size • Bound and free parameters • Degree of freedom

  7. Experimental Design • Control Changing Environment and Unintended Variance • E.g. Host platform, OS, Arrival rate • Scheduling schemas, workloads • Network latency & traffic • Differences in environment • Controlling nondeterminism • Understand key variance points • Due to nondeterminism quality of the results usually does not reach the same steady state on a deterministic workload • Take multiple measurements and generate sufficient data points • Statistically analyze results for differences in remaining nondeterminism

  8. Rigorous analysis • Researchers use data analysis to identify and articulate the significance of experimental results • Challenging in complex systems with sheer volume of results • Aggregating data across repeated experiments is a standard technique for increasing confidence in a noisy environment • Since noise cannot be eliminated altogether, multiple trials are inevitably necessary. • Reducing nondeterminism • Researchers have only finite resources, Reducing sources of nondeterminism with sound experimental design improves tractability. • Statistical confidence intervals and significance • Show best and worst cases

  9. Measurement and Levels of Measurement • Measurement is the process observing and recording the observations that are collected as part of a research effort • Level of Measurement: • The level of measurement refers to the relationship among the values that are assigned to the attributes for a variable

  10. Measurement and Levels of Measurement • There are typically four levels of measurement that are defined • Nominal • Ordinal • Interval • Ratio • Why the levels of measurement is important? • knowing the level of measurement helps you decide how to interpret the data from that variable • Helps to decide what statistical analysis is appropriate on the values that were assigned

  11. Reliability • Reliability is the consistency or repeatability of your measures • True score theory of measurement: Is the foundation of reliability theory • Different types of measure error • Theory of reliability • Different types of reliability • The relationships between reliability and validity in measurement

  12. Reliability/ True score theory of measurement • Consists of two components: • true ability (or the true level) of the respondent on that measure • random error • Why it’s important? • reminds us that most measurement has an error component • true score theory is the foundation of reliability theory • A measure that has no random error (i.e., is all true score) is perfectly reliable; a measure that has no true score (i.e., is all random error) has zero reliability • true score theory can be used in computer simulations as the basis for generating "observed" scores with certain known properties Observed score True ability Random error + =

  13. Measurement Error • Random error • Cause by any factors that randomly affect measurement of the variable • Sum to 0 and does not affect the average • Systematic error • Cause by any factors that systematically affect measurement • Consistently either positive or negative • Reducing measurement error • Pilot study • Training • Double check the data thoroughly • Use statistical procedure • Use multiple measures

  14. Theory of Reliability • In research, reliability means repeatability or consistency • A measure is considered reliable if it would give us the same result over and over again • Reliability is a ratio or fraction: • true level/the entire measure—var(T)/var(X) • Covariance(X1,X2)/sd(X1)*sd(X2) • We can not calculate reliability because we can not measure the true score but we can estimate (between 0 and 1)

  15. Reliability Types • Inter-rater or inter-observer reliability • Raters for categories • Calculate the correlation • Test-retest reliability • The shorter the time gar, the higher correlation • Parallel-forms reliability • Create a large set of questions that address the same construct, divide into two set and administer the same sample • Internal consistency reliability • Average inter-item correlation • Average total-item correlation • Split-half reliability • Cronbach’s alpha (a)

  16. Reliability & Validity •  center of the target is the concept that you are trying to measure

  17. Construct validity/Definition • Construct validity has traditionally been defined as the experimental demonstration that a test is measuring the construct it claims to be measuring • What is construct? A construct, or psychological construct as it is also called, is an attribute, proficiency, ability, or skill that happens in the human brain and is defined by established theories • A construct is a concept. A clearly specified research question should lead to a definition of study aim and objectives that set out the construct and how it will be measured. Increasing the number of different measures in a study will increase construct validity provided that the measures are measuring the same construct Idea Program Construct Operationalization

  18. Measurement validity types • Translation validity • Face validity • Content validity • Criterion-related validity • Predictive validity • Concurrent validity • Convergent validity • Discriminant validity

  19. Idea of Construct Validity • Construct validity is an assessment of how well you translated your ideas or theories into actual programs or measures • Why Construct Validity is important? • truth in labeling

  20. Convergent & Discriminant Validity • Convergent and discriminant validity are both considered subcategories or subtypes of construct validity • Convergent Validity • To establish convergent validity, we need to show that measures that should be related are in reality related

  21. Discriminant Validity • To establish discriminant validity, you need to show that measures that should not be related are in reality not related

  22. Threats to Construct Validity • Inadequate preoperational explication of constructs • Mono-operation bias • Mono-method bias • Interaction of different treatments • Interaction of testing and treatment • Restricted generalizability across constructs • Confounding constructs and levels of constructs • “Social” threats • Hypothesis guessing • Evaluation apprenhension • Experiment expectancy

  23. Approaches to Assess Validity • Nomologicalnetwork • a theoretical basis • Multitrait-multimethodmatrix • to demonstrate both convergent and discriminant validity • Pattern matching

  24. Nomologicalnetwork • It includes a theoretical framework for what you are trying to measure, an empirical framework for how you are going to measure it, and specification of linkages among and between these two frameworks • It does not provide a practical and usable methodology for actually assessing construct validity

  25. Pattern matching • It is an attempt to link two patterns: theoretical pattern and observed pattern • It requires that • To specify your theory of the constructs precisely! • you structure the theoretical and observed patterns the same way so that you can Directly correlate them • Common Example • ANOVA table

  26. Multitrait-multimethod matrix • MTMM is a matrix of correlations arranged to facilitate the assessment of construct validity • It is based on convergence and discriminant validity • It assumes that you have several concepts and several measurement methods, and you measure each concept by each method • To determine the strength of the construct validity: • Reliability coefficients should be the highest in the matrix • Coefficient in the validity diagonal should be significantly different from zero and high enough • The same pattern of trait interrelationship • should be seen in all triangles.

More Related