User Testing: Choosing Participants, Designing the Test, and Analyzing Data

CS 160: Lecture 16 Professor John Canny Spring 2004

Outline • More on user testing • Choosing participants • Designing the test • Basics of quantitative methods • Collecting data • Analyzing the data

Jakob Nielsen Why do User Testing? • Can’t tell how good or bad UI is until? • people use it! • Other methods are based on evaluators who: • may know too much • may not know enough (about tasks, etc.) • Summary Hard to predict what real users will do

Choosing Participants • Representative of eventual users in terms of • job-specific vocabulary / knowledge • tasks • If you can’t get real users, get approximation • system intended for doctors • get medical students • system intended for electrical engineers • get engineering students • Use incentives to get participants

Incentives • Paying every subject a fixed amount can be too expensive. • A small chance at a large reward is a good incentive – e.g. subjects get a 1/n chance to receive an MP3 player. • Offer to tell them the results ofthe study. • Free software for software companies. • Some courses require participation asresearch subjects.

Ethical Considerations • Sometimes tests can be distressing • users have left in tears (embarrassed by mistakes) • You have a responsibility to alleviate • make voluntary with informed consent • avoid pressure to participate • will not affect their job status either way • let them know they can stop at any time • stress that you are testing the system, not them • make collected data as anonymous as possible

Ethical Considerations • Get human subjects approval if needed – typically if results are going to be published. • Universities always have humansubjects panels, so do governmentagencies.

User Test Proposal • A report that contains • objective • description of system being testing • task environment & materials • participants • methodology • tasks • test measures • Get approved & then reuse for final report

Qualitative vs. Quantitative Studies • Qualitative: What we’ve been doing so far: • Contextual Inquiry: trying to understand user’s tasks and their conceptual model. • Usability Studies: looking for critical incidents in a user interface. • In general, we use qualitative methods to: • Understand what’s going on, look for problems, or get a rough idea of the usability of an interface.

Qualitative vs. Quantitative Studies • Quantitative: • Use to reliably measure something. • Requires us to know how to measure it. • Examples: • Time to complete a task. • Average number of errors on a task. • Users’ ratings of an interface *: • Ease of use, elegance, performance, robustness, speed,… * - You could argue that users’ perception of speed, error rates etc is more important than their actual values.

Quantitative Methods • Very often, we want to compare values for two different designs (which is faster?). This requires some statistical methods. • We begin by defining some quantities to measure - variables.

Variable types • Independent Variables: the ones you control • Aspects of the interface design • Characteristics of the testers • Discrete: A, B or C • Continuous: Time between clicks for double-click • Dependent variables: the ones you measure • Time to complete tasks • Number of errors

Deciding on Data to Collect • Two types of data • process data • observations of what users are doing & thinking • bottom-line data • summary of what happened (time, errors, success…) • i.e., the dependent variables

Process Data vs. Bottom Line Data • Focus on process data first • gives good overview of where problems are • Bottom-line doesn’t tell you where to fix • just says: “too slow”, “too many errors”, etc. • Hard to get reliable bottom-line results • need many users for statistical significance

The “Thinking Aloud” Method • Need to know what users are thinking, not just what they are doing • Ask users to talk while performing tasks • tell us what they are thinking • tell us what they are trying to do • tell us questions that arise as they work • tell us things they read • Make a recording or take good notes • make sure you can tell what they were doing

Thinking Aloud (cont.) • Prompt the user to keep talking • “tell me what you are thinking” • Only help on things you have pre-decided • keep track of anything you do give help on • Recording • use a digital watch/clock • take notes, plus if possible • record audio and video (or even event logs)

Break

Some statistics • Variables X & Y • A relation (hypothesis) e.g. X > Y • We would often like to know if a relation is true • e.g. X = time taken by novice users • Y = time taken by users with some training • To find out if the relation is true we do experiments to get lots of x’s and y’s (observations) • Suppose avg(x) > avg(y), or that most of the x’s are larger than all of the y’s. What does that prove?

Significance • The significance or p-value of an outcome is the probability that it happens by chance if the relation does not hold. • E.g. p = 0.05 means that there is a 1/20 chance that the observation happens if the hypothesis is false. • So the smaller the p-value, the greater the significance.

Significance • For instance p = 0.001 means there is a 1/1000 chance that the observation would happen if the hypothesis is false. So the hypothesis is almost surely true. • Significance increases with number of trials. • CAVEAT: You have to make assumptions about the probability distributions to get good p-values. There is always an implied model of user performance.

Normal distributions • Many variables have a Normal distribution • At left is the density, right is the cumulative prob. • Normal distributions are completely characterized by their mean and variance (mean squared deviation from the mean).

Normal distributions • The difference between two independent normal variables is also a normal variable, whose variance is the sum of the variances of the distributions. • Asserting that X > Y is the same as (X-Y) > 0, whose probability we can read off from the curve.

Analyzing the Numbers • Example: prove that task 1 is faster on design A than design B. • Suppose the average time for design B is 20% higher than A. • Suppose subjects’ times in the study have a std. dev. which is 30% of their mean time (typical). • How many subjects are needed?

Analyzing the Numbers • Example: prove that task 1 is faster on design A than design B. • Suppose the average time for design B is 20% higher than A. • Suppose subjects’ times in the study have a std. dev. which is 30% of their mean time (typical). • How many subjects are needed? • Need at least 13 subjects for significance p=0.01 • Need at least 22 subjects for significance p=0.001 • (assumes subjects use both designs)

Analyzing the Numbers (cont.) • i.e. even with strong (20%) difference, need lots of subjects to prove it. • Usability test data is quite variable • 4 times as many tests will only narrow range by 2x • breadth of range depends on sqrt of # of test users • This is when online methods become useful • easy to test w/ large numbers of users (e.g., Landay’s NetRaker system)

Lies, damn lies and statistics… • A common mistake (made by famous HCI researchers *) • Increasing n (number of trials) by running each subject several times. • No! the analysis only works when trials are independent. • All the trials for one subject are dependent, because that subject may be faster/slower/less error-prone than others. * - making this error will not help you become a famous HCI researcher .

Statistics with care: • What you can do to get better significance: • Run each subject several times, compute the average for each subject. • Run the analysis as usual on subjects’ average times, with n = number of subjects. • This decreases the per-subject variance, while keeping data independent.

Measuring User Preference • How much users like or dislike the system • can ask them to rate on a scale of 1 to 10 • or have them choose among statements • “best UI I’ve ever…”, “better than average”… • hard to be sure what data will mean • novelty of UI, feelings, not realistic setting, etc. • If many give you low ratings -> trouble • Can get some useful data by asking • what they liked, disliked, where they had trouble, best part, worst part, etc. (redundant questions)

B A Using Subjects • Between subjects experiment • Two groups of test users • Each group uses only 1 of the systems • Within subjects experiment • One group of test users • Each person uses both systems

Between subjects • Two groups of testers, each use 1 system • Advantages: • Users only have to use one system (practical). • No learning effects. • Disadvantages: • Per-user performance differences confounded with system differences: • Much harder to get significant results (many more subjects needed). • Harder to even predict how many subjects will be needed (depends on subjects).

Within subjects • One group of testers who use both systems • Advantages: • Much more significance for a given number of test subjects. • Disadvantages: • Users have to use both systems (two sessions). • Order and learning effects (can be minimized by experiment design).

Example • Same experiment as before: • System B is 20% slower than A • Subjects have 30% std. dev. in their times. • Within subjects: • Need 13 subjects for significance p = 0.01 • Between subjects: • Typically require 52 subjects for significance p = 0.01. • But depending on the subjects, we may get lower or higher significance.

Experimental Details • Order of tasks • choose one simple order (simple -> complex) • unless doing within groups experiment • Training • depends on how real system will be used • What if someone doesn’t finish • assign very large time & large # of errors • Pilot study • helps you fix problems with the study • do 2, first with colleagues, then with real users

Reporting the Results • Report what you did & what happened • Images & graphs help people get it!

Summary • Use real tasks & representative participants • Be ethical & treat your participants well • Want to know what people are doing & why • i.e., collect process data • Using bottom line data requires more users to get statistically reliable results

User Testing: Choosing Participants, Designing the Test, and Analyzing Data

User Testing: Choosing Participants, Designing the Test, and Analyzing Data

Presentation Transcript

BCB 444/544

Lecture S1: Sample Lecture

16.360 Lecture 13

Gene Finding and HMMs

Materials for Lecture 20

Materials for Lecture 17

Materials for Lecture 20

Materials for Lecture 21

QCD

Lecture 3

Lecture 20

Lecture 4

SOA Part1 Lecture 4

Lecture 3 Outline

Intro to APUSH

Lecture 6

Physics at the Tevatron Lecture III

6.096 Lecture 10

Cold atoms

Lecture 18

Lecture 2.6: Matrices*

“Elementary Particles” Lecture 6