460 likes | 613 Views
How Many Participants is Enough in a Usability Test? Dr. Bob Bailey www.webusability.com. How many usability test participants do you think is the correct number? ____________. Number of Participants.
E N D
How Many Participants is Enough in a Usability Test?Dr. Bob Baileywww.webusability.com
How many usability test participants do you think is the correct number?____________
Number of Participants • Having the appropriate number of subjects will accomplish the goals of a usability test as efficiently as possible • If too many are used • Increased cost • Increased development time • If too few are used • Fail to detect some important problems • Reduce the usability of the product
Factors Influencing the Number of Participants • Phase in the development cycle • Design approach used (user-centered?) • The product’s life cycle • Prototype (fidelity level?) • New system • Existing system • Complexity of the product • Number of features • Number of fields/screens/windows/pages • Traditional, new or unique technologies
Factors Influencing the Number of Participants (continued) • Testers/Evaluators • Usability testing experience • Domain knowledge • Users • Diverse nature of the population (unique segments) • Required domain knowledge (much, little) • Frequency of performance in actual system • Number of users in the population (e.g., unique monthly visitors)
Unique Audience for Certain Federal Government Sites (1 Month)* • Treasury 11,700,000 • Department of Defense 8,300,000 • Health and Human Services 7,300,000 • NASA 5,200,000 • Department of Education 4,300,000 • Executive Branch 2,680,000 • Department of State 2,100,000 • Department of Labor 1,990,000 • Department of Energy 1,600,000 • FirstGov 1,380,000 • Central Intelligence Agency 914,000 • National Archives 894,000 *Nielsen//NetRatings, February 2003
Factors Influencing the Number of Participants (continued) • Overall task complexity • Easier: Find the per diem rate for Chicago • Harder: Determine how much tax you owe • Repercussions of task failure • Lose money • Lose time • Lose a life
Usability Testing Categories • Automated evaluations • Inspection evaluations • Expert reviews • Heuristic evaluations • Cognitive walkthroughs • Human performance testing • Usability lab (local) • Online (remote) • Operational evaluations • Online intercept surveys (ACSI) • Web analytics
Web Analytics • The direct monitoring and analysis of online user behavior and interactions with a website • Commercial web analytics products: • CoreMetrics Online Analytics • NetGenesis • WebTrends 7 • Omniture SiteCatalyst • WebSideStory • These toolkits anonymously capture and analyze website traffic volumes and track visitor behavior • Typical features include • Capturing and reporting of navigation paths through the site • Referral analysis, e.g., where users came from, what content they view, etc. • Geographic trends analysis (origin of user access)
Usability Testing Categories • Automated evaluations • Inspection evaluations • Expert reviews • Heuristic evaluations • Cognitive walkthroughs • Human performance testing • Usability lab (local) • Online (remote) • Operational evaluations • Online intercept surveys (ACSI) • Web analytics
Testing Levels • Level 1: Traditional inspection evaluations • Primary focus: To identify usability issues • Evaluation methods • Heuristic evaluations • Cognitive walkthroughs • Level 2: Algorithmic expert reviews with scenarios • Primary focus: To identify usability issues • Evaluators • Focus on ‘algorithmic’ (not heuristic) issues, i.e., black text on white background • Use scenarios to stay focused on the most important tasks to identify usability issues • Level 3: Usability tests • Primary focus: To identify usability issues while observing participants • Use available participants • Use a set of test scenarios • Require participants to think-aloud during testing (discussions) • Secondary objective: Collecting quantitative data
Testing Levels (continued) • Level 4: Usability tests • Primary focus: To identify usability issues while observing participants • Use representative participants • Use representative test scenarios • Participants provide feedback at the end of a scenario or end of the test • Objectives include • Collecting quantitative data to set a baseline, or compare with a baseline • Collecting qualitative data • Use findings to identify usability problems • Level 5: Usability tests (rigorous, tightly controlled) • Primary focus: To compare an existing product (website) with • A set of objectives • A previous iteration of the same product • A competing product • Use truly representative participants • Use truly representative test scenarios • Collect quantitative data to use for • Demonstrating the performance level (summative test) • Comparing with other test results • Secondary objective: To identify usability problems
How Many Participants are Needed?The Short Answer • Level 1: At least 4 evaluators • Level 2: At least 4 expert evaluators • Level 3: 6 participants (or as many as you can get) • Level 4: 12-15 participants • Level 5: 20 participants per group
Evaluator Effect • Having multiple evaluators • Evaluating the same interface using the same method, and • Detecting markedly different sets of problems • One study found that the ‘evaluator effect’ for the three main evaluation methods was about the same (“they were all equally poor”)
The Evaluator Effect (continued) • The ‘evaluator effect’ existed for • Both novice and experienced evaluators • Both cosmetic and severe problems • Both problem detection and severity assessment, and • Both simple and complex systems • The average agreement between any 2 evaluators ranged from 5% to 65%
Determining the Number of EvaluatorsVirzi, 1990; Lewis, 1993 • The formula for calculating the number of evaluators needed to find a specific percentage of ‘problems’ • 1‑(1‑p)n wherep = mean probability of detecting a problem n = the number of evaluators • Problem discovery in inspection evaluations is consistent with this cumulative binomial probability formula
What is ‘p’? • Assume that all inspection evaluators together find 100 unique usability issues (duplicates are eliminated) • Assume that the average number of issues found by each evaluator was 30 • In this case: p = .30 (30/100) • How many evaluators will be needed to find 90% of the 100 usability issues?
Using 5 Evaluators and p=.30 • 1‑(1‑p)n wherep = mean probability of detecting a problem n = the number of evaluators • 1-(1-.30)5 • 1-(.7)5 • 1-.17 • .83 or 83%
Using 6 Evaluators and p=.30 • 1‑(1‑p)n wherep = mean probability of detecting a problem n = the number of evaluators • 1-(1-.30)6 • 1-(.7)6 • 1-.12 • .88 or 88%
Using 7 Evaluators and p=.30 • 1‑(1‑p)n wherep = mean probability of detecting a problem n = the number of evaluators • 1-(1-.30)7 • 1-(.7)7 • 1-.08 • .92 or 92%
You Only Need to Test With 5 UsersNielsen, 2000 • Nielsen suggested that “elaborate usability tests are a waste of resources” • “Collecting data from a single test subject enables a designer to learn almost a third of all there is to know about the usability of the design”
Only Five … (Nielsen continues) • “Testing a second potential user adds some new insights, but not near as much as did the first user • “When the third, fourth and fifth users are added, less and less new information is learned • “After the fifth user, you are wasting your time by observing the same findings repeatedly but not learning much new”
Iterative Design • The following graphic “clearly shows that you need to test with at least 15 users to discover all the usability problems in the design” • Nielsen’ graph
Average Detection RateVariations of ‘p’ • Spool (2001 - Live websites) - Average detection rate: 0.08 • Law and Hvannberg (2004 - Prototype): 0.09 • Lewis (1994 - Prototype): 0.16 • Walker, Takayama and Landay (2000 - Prototypes): 0.21 • Nielsen and Landauer (1993 - Prototypes): 0.31 • Virzi (1990 - Prototypes): 0.36 • Woolrych and Cockton (2002 - Prototypes): 0.43 • Nielsen (1992 - Prototypes) • Novice evaluators: 0.29 • Usability specialists with no domain experience: 0.46 • Usability specialists with domain experience: 0.61 • Jacobsen, Hertzum and John (1998 - Prototypes): 0.52
Average Detection RateVariations of ‘p’ • Spool (2001 - Live websites) - Average detection rate: 0.08 • Law and Hvannberg (2004 - Prototype): 0.09 • Lewis (1994 - Prototype): 0.16 • Walker, Takayama and Landay (2000 - Prototypes): 0.21 • Nielsen and Landauer (1993 - Prototypes): 0.31 • Virzi (1990 - Prototypes): 0.36 • Woolrych and Cockton (2002 - Prototypes): 0.43 • Nielsen (1992 - Prototypes) • Novice evaluators: 0.29 • Usability specialists with no domain experience: 0.46 • Usability specialists with domain experience: 0.61 • Jacobsen, Hertzum and John (1998 - Prototypes): 0.52
The True Number of Problems • Some people erroneously assume that the total number of issues found by all evaluators is identical to the total number of problems in the interface • The p-value is far less than the original .31 [reduce by half] • Only about half of the proposed usability issues are actually usability problems [reduce by half, again]
Level 3 Usability Testing • Primary focus: To identify usability issues while observing participants • Use fairly representative participants • Use test scenarios (presented by UTE) • Have participants think-aloud during testing (many probes and discussions) • Collect ‘soft’ quantitative data to help identify scenarios with the most usability problems
Usability TestingLevels 4 and 5[Now we need to get serious about the number of participants!]
Level 4 Usability Testing • Primary focus • To compare against usability objectives • To set a baseline • To further identify usability issues • Use representative participants (based on a screener) • Use test scenarios (presented by UTE) • Participants provide observations or feedback at the end • Of a scenario, or • Of the test • Enables • The effective collection of both quantitative and qualitative data • Making inferences from the sample to the population
Samples and Populations • Testers usually do not have access to an entire population of users • Too large • Not willing to be measured • Measurement process is too • Expensive or • Time-consuming • To estimate some population characteristic (e.g., the average time to click a link) • Take a sample, and • Compute a quantity (a statistic) • ‘Samples’ help testers understand the characteristics of a ‘population’
Confidence Intervals • One good way to determine how well a sample reflects the population is to use the concept of ‘confidence intervals’ • A confidence interval provides an estimate of the range of values that will most likely contain the true value, or true population value • Generally, the range of values includes those that have a 95% chance of being the true population value
Using Confidence Intervals • You just finished a usability test • You had 5 participants attempt a task in a new web application • All 5 of the participants completed the task • You announce this success to the development team and your supervisor • Your supervisor asks, “OK, this is great with 5 users, but what are the chances that 50 or 1000 or 10,000 will have the same 100% completion rate?”
Using Confidence Intervals (continued) • If five out of five users complete a task, you can be 95% confident that • The completion rate in the overall user population could be as high as 100% • But it also could be as low as 48% • In other words, when this web application is used by real users • All of them could successfully complete the task, or • Over half (52%) could fail the task
Confidence Level • The confidence level is the percent likelihood statement that accompanies the width of the confidence interval • It is usually set at 95% • A confidence level of 95% means that 5 out of 100 times your sample completion rate will NOT fall within your confidence interval
Usability Testing Level 5 • Primary focus: To compare an existing product (website) • Against a set of objectives • With a previous iteration of the same product • With a competing product • Secondary objective: To identify usability issues • The test procedure is very tightly controlled • Use truly representative participants • Use test scenarios • Collect quantitative and qualitative data • Demonstrates the performance level (summative test) • Allows comparison with other test results • Suggests some changes to the product
Power of the Test • The ‘power’ of a statistical test is a measure of its ability to detect a difference when there is one • Sample size is one of the main factors used to determine the power of a test
Comparing the Average ‘Click Time’ with a Competitor • After posting the website with an average click time of 10 seconds, management found a competing website that had reduced the average click time to 8 seconds • The website was redesigned with the goal of reducing the average click time to 6 seconds
The ‘Null’ and Alternative Hypotheses • The null hypothesis • Example: The average time to click on the correct link will be the same when compared with the competitor’s homepage • The alternative hypothesis • Example: The average time to click on the correct link when using the redesigned website will be reliably faster than the competitors
Required Number of Participants • Enough to be reasonably sure that you can detect a reliable difference if one exists • But not so many participants that small and unimportant differences are detected
Increasing Power • Consider ways to increase statistical power so that testers do not miss something important • Improving power • Increase the sample size - Keep in mind that if power is already high, increasing the sample size will do little or nothing • Increase the alpha level (.05 rather than .01) • Increase the acceptable effect size (try to identify larger differences) • Narrow the variance • Select randomly from actual users • Use the same tester for all participants • Exert greater control over all variables while testing • Make measurements that are more realistic and precise • Use ‘same subject’ (within subjects) tests where possible
Power-Sample Size CalculatorParticipants Required per Group http://www.health.ucalgary.ca/~rollin/stats/ssize/n1.html Local
How Many Participants are Needed?The Final Answer • Level 1: 4-8 evaluators • Level 2: 4-8 expert evaluators • Level 3: 4-8 participants in each iteration • Level 4: 12-15 participants, or the number needed to ensure acceptable confidence intervals • Level 5: “35” per group, or the number needed to ensure sufficient testing power