HCI460: Week 8 Lecture. October 28, 2009. Outline. Midterm Review How Many Participants Should I Test? Review Exercises Stats Review of material covered last week New material Project 3 Next Steps Feedback on the Test Plans. Midterm Review. Midterm Review. Overall. N = 44.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
HCI460: Week 8 Lecture • October 28, 2009
Outline • Midterm Review • How Many Participants Should I Test? • Review • Exercises • Stats • Review of material covered last week • New material • Project 3 • Next Steps • Feedback on the Test Plans
Midterm Review Overall N = 44 • Mean / average: 8.55 • Median: 8.75 • Mode: 10 (most frequent score)
Midterm Review Q1: Heuristic vs. Expert Evaluation • Question: What is the main difference between a heuristic evaluation and an expert evaluation? • Answer: • Heuristic evaluation uses a specific set of guidelines or heuristics. • Expert evaluation relies on the evaluator’s expertise (including internalized guidelines) and experience. • No need to explicitly match issues to specific heuristics. • More flexibility.
Midterm Review Q2: Research-Based Guidelines (RBGs) • Question: What is unique about the research-based guidelines on usability.gov relative to heuristics and other guidelines? What are the unique advantages of using the research-based guidelines? • Answer: • This is a very comprehensive list of very specific guidelines (over 200). Other guideline sets are much smaller and the guidelines are more general. • RBGs were created by a group of experts (not an individual). • RBGs are specific to the web. • Unlike other heuristics and guidelines, RBGs have two ratings: • Relative importance to the success of a site • Helps prioritize issues. • Strength of research evidence that supports the guideline • Research citations lend credibility to the guidelines.
Midterm Review Q3: Positive Findings • Question: Why should positive findings be presented in usability reports? • Answer: • To let stakeholders know what they should not change and which current practices they should try to emulate. • To make the report sound more objective and make stakeholders more receptive to the findings in the report. • Humans are more open to criticism if it is balanced with praise.
Midterm Review Q4: Think-Aloud vs. Retrospective TA RTA ≠ post-task interview • Question: What is the main difference between the think-aloud protocol (TA) and the retrospective think-aloud protocol (RTA)? When should you use each of these methods and why? • Answer: • TA involves having the participant state what they are thinking while they are completing a task. • Great for formative studies; helps understand participant actions as they happen. • RTA is used after the task has been completed in silence. The participants walks through the task one more time (or watches a video of himself/herself performing the task) and explains their thoughts and actions. • Good when time on task and other quantitative behavioral measures need to be collected in addition to qualitative data. • Good for participants who may not be able to do TA,
Midterm Review Q5: Time on Task in Formative UTs • Question: What are the main concerns associated with using time on task in a formative study with 5 participants? • Answer: • Formative studies often involve think-aloud protocol. • Time on task will be longer because thinking aloud takes more time and changes the workflow. • Sample size is too small for the time on task to generalize to the population or show significant differences between conditions.
Midterm Review Q6: Human Error • Question: Why is the term “human error” no longer used in the medical field? • Answer: • “Human error” places the blame on the human when in fact errors usually result from problems with the design. • A more neutral term “use error” is used instead.
Midterm Review Q7: Side-by-Side Moderation Moderating from another room with audio communication ≠ remote study • Question: When would you opt for side-by-side moderation in place of moderation from another room with audio communication? • Answer: • Side-by-side moderation is better when: • Building rapport with participant is important (e.g., in formative think-aloud studies) • Moderator has to simulate interaction (e.g., paper prototype) • The tested object / interaction may be difficult to see via camera feed or through the one-way mirror
How Many Participants Should I Test? Review from Last Week
How Many Participants Should I Test? Overview
How Many Participants Should I Test? Sample Size Calculator for Formative Studies Jeff Sauro’s Sample Size Calculator for Discovering Problems in a User Interface: http://www.measuringusability.com/problem_discovery.php
How Many Participants Should I Test? Sample Size for Precision Testing Sampling Error: • We need sufficient sample size to be able to generalize the results to the population. • Sample size for precision testing depends on: • Confidence level (usually 95% or 99%) • Desired level of precision • Acceptable sampling error (+/- 5%) • Size of population to which we want to generalize the results • Free online sample size calculator from Creative Research Systems: http://www.surveysystem.com/sscalc.htm
How Many Participants Should I Test? Sample Size for Precision Testing • Confidence interval: 95% • When generalizing a score to the population, high sample size is needed. • However, the more the better is not true. • Getting 2000 participants is a waste.
How Many Participants Should I Test? Sample Size for Hypothesis Testing • Hypothesis testing: comparing means • E.g., accuracy of typing on Device A is significantly better than it is on Device B. • Inferential statistics • Necessary sample size is derived from a calculation of power. • Under assumed criteria, the study will have a good chance of detecting a significant difference if the difference indeed exists. • Sample size depends on: • Assumed confidence level (e.g., 95%, 99%) • Acceptable sampling error (e.g., +/- 5%) • Expected effect size • Power • Statistical test (e.g., t-test, correlation, ANOVA)
How Many Participants Should I Test? Hypothesis Testing: Sample Size Table* • *Cohen, J. (1992). A Power Primer. Psychological Bulletin, 112 (1), http://www.math.unm.edu/~schrader/biostat/bio2/Spr06/cohen.pdf.
How Many Participants Should I Test? Reality • Usability tests do not typically require statistical significance. • Objectives dictate type of study and reasonable sample sizes necessary. • Sample size used is influenced by many factors—not all of them statistically driven. • Power analysis provides an estimate of sample size necessary to detect a difference, if it does indeed exist • Risk of not performing power analysis? • Too few Low power Inability to detect difference • Too many Waste (and possibly find differences that are not real) • What if you find significance even with a small sample size? • It is probably really there (at a certain p level)
How Many Participants Should I Test? Exercises
How Many Participants Should I Test? Exercise 1: Background Information Old New • Package inserts for chemicals used in hospital labs were shortened and standardized to reduce cost. • E.g., many chemicals x many languages = high translation cost • New inserts: • ½ size of old inserts (booklet, not “map”) • More concise (charts, bullet points) • Less redundant • Users: Lab techs in hospitals
How Many Participants Should I Test? Exercise 1: The Question • Client question: • Will the new inserts negatively impact user performance? • How many participants do you need for the study and why? • Exercise: • Discuss in groups and prepare questions for the client. • Q & A with the client • Come up with the answer and be prepared to explain it.
How Many Participants Should I Test? Exercise 1: Possible Method • Each participant was asked to complete 30 search tasks • 2 insert versions x 3 chemicals x 5 search tasks • Sample question: “How long may the _____ be stored at the refrigeration temperature?” (for up to 7 days) • Tasks instructions were printed on a card and placed in front of the participant. • The tasks were timed. • Participants had a maximum of 1 minute to complete each task. • Those who exceeded the time limit were asked to move on to the next task. • To indicate that they were finished, participants had to: • Say the answer to the question out loud • Point to the answer in the insert
How Many Participants Should I Test? Exercise 1: Possible Method
How Many Participants Should I Test? Exercise 1: Sample Size
How Many Participants Should I Test? Exercise 1: Sample Size • 32 lab techs: • 17 in the US • 15 in Germany
How Many Participants Should I Test? Exercise 1: Results • Short inserts performed significantlybetter than the long inserts in terms of:
How Many Participants Should I Test? Exercise 2 • Client question: • Does our website work for our users? • How many participants do you need for the study and why? • Exercise: • Discuss in groups and prepare questions for the client. • Q & A with the client • Come up with the answer and be prepared to explain it.
Stats Planted the Seed Last Week • Stats are: • Not learnable in an hour • More than just p-values • Powerful • Dangerous • Time consuming • But, you are the Expert • Need to know: • Foundation • Rationale • Useful application
Stats Foundation • Just with any user experience research endeavor • Think hard first (Test Plan) • Anticipate outcomes to keep you on track with objectives and not get pulled to tangents • Then begin research... • Definition of statistics: • A set of procedures to for describing measurements and for making inferences on what is generally true • Statistics and experimental design go hand in hand • Objective Method Measures Outcomes Objectives
Stats Measures • Ratio scales (Interval + Absolute Zero) • Measure that has: • Comparable intervals (inches, feet, etc.) • An absolute zero (not meaningful to non-statisticians) • Differences are comparable • Device A is 72 inches in length while Device B is 36 inches • One is twice as tall as the other • Performance on Device A was 30 sec while Device B was 60 • Users were twice as fast completing task on Device A over Device B • Interval scales do not have a zero • Difference between 40F - 50F = 100F – 90F • Take away: You get powerful statistics using ratio/interval measures
Stats What Does Power Mean Again? • Statistical Power • “The power to find a significant difference, if it does indeed exist” • Too little power Miss significance when it is really there • Too much power Might find significance when it is NOT there • You can get MORE Power by: • Adding more participants (by impact is non-linear) • Having a greater “effect size,” which is the anticipated difference • Picking within-subjects designs over between-subjects designs • Using ratio/interval measures • Changing alpha • Practical Power • Sample size costs money • If you find significance, then it is probably really true!
Stats Other Measures (non-ratio/non-interval) • Likert scales • Rank data • Count data • Each of these measures use a different statistical test • Power is different (reduced) • Consider Likert Data 1 2 3 4 5 A B C • Could say: • A=1, B=2, C=5 • C came in 1st, B came in 2nd and A came in 3rd • Precision influences power and the less precise your measure, the less power you have to detect differences
Stats Between-Groups Designs • Between-groups study splits sample into two or more groups • Each group only interacts with one device • What causes variability? • Measurement error • The tool or procedure can be an imprecise device • Starting and stopping the stop watch • Unreliable • We are human, so if you test the same participant on different days and you might get a different time! • Individual differences • Participants are different, so some get different scores than others on the same task. Since we are testing for differences between A and B, this can be a problem
Stats What About Within-Groups Designs? • Within-Groups study has participants interact with all devices • What causes variability? • Measurement error • The tool or procedure can be an imprecise device • Starting and stopping the stop watch • Unreliable • We are human, so if you test the same participant on different days and you might get a different time! • Individual differences • Participants are different, so some get different scores than others on the same task. Since we are testing for differences between A and B, this can be a problem • No longer applies • Thus, less causes for variability results in statistical power
Stats More Common Statistical Tests • You are actually well aware of statistics – Descriptive statistics! • Measures of central tendency • Mean • Median • Mode • Definitions? • Mean = ? • Average • Median = ? • The exact point that divides the distribution into two parts such that an equal number fall above and below that point • Mode = ? • Most frequently occurring score
Stats When In Doubt, Plot Normal distribution Randomly sampled Mean = Median = Mode • Frequency • 1 2 3 4 5 • Score • Take scores • 1 2 3 4 5 • 2 3 4 5 1 • 3 3 4 5 3
Stats Skewed Distributions • Positive skew Tail on the right • Negative skew Tail on the left • Impact to measures of central tendency? • Mode • Median • Mean • “Central tendency”
Stats Got Central Tendency, Now Variability • We must first understand variability • We tend to think of a mean as “it” or “everything” • Consider a time on task as a metric • Measurement error • The tool or procedure can be an imprecise device • Starting and stopping the stop watch • Individual differences • Participants are different, so some get different scores than others on the same task. Since we are testing for differences between A and B, this can be a problem • Unreliable • We are human, so if you test the same participant on different days and you might get a different time!
Stats Got Central Tendency, Now Variability • We must first understand variability • We tend to think of a mean as “it” or “everything” • Class scored 80 • School scored 76 • Many scores went into the mean score • Variability can be quantified • [draw]
Stats Variability is Standardized (~Normalized) 1 Std Dev 1 Std Dev 2 Std Dev 2 Std Dev 3 Std Dev • Your score is in the 50th percentile • Ethan and Madeline are the smartest kids in class • AT scores, you saw your score—how did they get a percentile? • Distribution is normal • Numerically, the distribution can be described by only: • Mean and standard deviation
Stats Empirical Rule 1 Std Dev 1 Std Dev 2 Std Dev 2 Std Dev 3 Std Dev 3 Std Dev • Empirical rule = 68/95/99 • 68% Mean +/- 1 std dev • 95% Mean +/- 2 std dev • 99% Mean +/- 3 std dev
Stats Clear on Normal Curves? 40% 50% 60% 70% • Represent a single dataset on a single measure for a single sample • Once data are normalized, you can describe dataset simple by • Mean and standard deviation • 60% success rate with a std dev of 10%
Stats Randomly Sampled Think of data as coming from a population
Stats Things Can Happen By Chance Alone • Is this is really two samples of 10 drawn from a population who may have these characteristics
Stats Exercise • Pass out Yes / No cards • Procedure • Will give you a task • I will count (because it is hard to install a timer on DePaul PCs) • Make a Yes or No decision • Record count • Hold up the Yes / No • Practice
Stats Exercise • Is the man wearing a red shirt? • Decide Yes or No • Note time • Hold up card • Ready? • Decide Yes or No • Note time • Hold up card
Stats 1 sec
Stats 2 sec