Sponsored by
This presentation is the property of its rightful owner.
Sponsored Links
1 / 35

Sponsored By: PowerPoint PPT Presentation


  • 88 Views
  • Uploaded on
  • Presentation posted in: General

Sponsored By:. Top 7 Testing Pitfalls Presented live November 18, 2009 Featuring Guest Star: Ronny Kohavi GM, Microsoft Experimentation Platform. Admin Note: Attendees will also get a copy of these slides + an

Download Presentation

Sponsored By:

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Sponsored by

Sponsored By:

Top 7 Testing Pitfalls

Presented live November 18, 2009

Featuring Guest Star: Ronny Kohavi

GM, Microsoft Experimentation Platform

Admin Note:

Attendees will also get a copy of these slides + an

On-demand mp3 of this via email on Thursday afternoon November 19th


First why bother testing

WhichTestWon.com

First: Why Bother Testing?

-> ‘Best Practices’, standard Web design templates, and marketer’s “gut” often FAIL tests.

-> For previously untested sites, testing gives

an average ~ 40% conversion lift.

-> Tests can help you generate better quality leads or sales – not just more conversions.


Agenda

Agenda

  • Intro & controlled experiments in one slide

  • Examples: you’re the decision maker

  • Seven pitfalls

  • Q&A

    Pitfalls based on KDD 2009 paper: http://exp-platform.com/ExPpitfalls.aspx by Thomas Crook, Brian Frasca, Ronny Kohavi, and Roger Longbotham


Our experience at microsoft

Our Experience at Microsoft

  • The Experimentation Platform started at Microsoft in 2006

  • Experiments ran on 20 Microsoft properties, including MSN home pages in several countries, MSN Money, MSN Real estate, www.microsoft.com, store.microsoft.com, support.microsoft.com, Office Online, www.xbox.com, several marketing sites, and Windows Genuine Advantage

  • Large experiments run with tens of millions of users

  • Multiple experiments have projected annual improvements of over $1M each


Controlled experiments in one slide

Controlled Experiments in One Slide

  • Concept is trivial

    • Randomly split traffic betweentwo (or more) versions

      • A (Control)

      • B (Treatment)

    • Collect metrics of interest

    • Analyze  

  • Best scientific way to prove causality, i.e., the changes in metrics are caused by changes introduced in the treatment(s)

  • Must run statistical tests to confirm differences are not due to chance


Examples

Examples

  • Three experiments that ran at Microsoft

  • All had enough users for statistical validity

  • OEC: the Overall Evaluation Criterion

  • See how many you get right

    • Three choices are:

      • A wins (the difference is statistically significant)

      • A and B are approximately the same (no stat sig diff)

      • B wins


Office online

Office Online

Test new design for Office Online homepage

OEC: Clicks on revenue generating links (red below)

A

B

Is A better, B better, or are they about the same?


Office online1

Office Online

  • B was 64% worse

  • The Office Online team wrote

    A/B testing is a fundamental and critical Web services… consistent use of A/B testing could save the company millions of dollars


Msn uk hotmail experiment

MSN UK Hotmail experiment

Hotmail module on the MSN UK home page


Msn uk hotmail experiment1

MSN UK Hotmail experiment

A: When user clicks on email

hotmail opens in same window

B: Open hotmail in separate window

Trigger: only users that click in the module are in experiment (no diff otherwise)

OEC: clicks on home page (after trigger)

Is A better, B better, or are they about the same?


Uk hotmail

UK Hotmail

  • For those in the experiment, clicks on MSN Home Page increased +8.9%

  • <0.001% of users in B wrote negative feedback about the new window


Data trumps intuition

Data Trumps Intuition

  • We distribute experiment reports widely at Microsoft

  • Someone who saw the report wrote

    This report came along at a really good time and was VERY useful. I argued this point to my team (open Live services in new window from HP) just some days ago. They all turned me down.Funny, now they have all changed their minds.


Msn home page search box

MSN Home Page Search Box

OEC: Clickthrough rate for Search box and popular searches

A

B

  • Differences:

  • A has taller search box (overall size is the same), has magnifying glass icon, “popular searches”

  • B has big search button

Is A better, B better, or are they about the same?


Search box

Search Box

  • No statistically significant difference

  • Insight

    Stop debating, it’s easier to get the data


Hard to assess the value of ideas data trumps intuition

Hard to Assess the Value of Ideas:Data Trumps Intuition

  • At Amazon, half of the experiments failed to show improvement

  • QualPro tested 150,000 ideas over 22 years

    • 75 percent of important business decisions and business improvement ideas either have no impact on performance or actually hurt performance…

  • Based on experiments with ExP at Microsoft

    • 1/3 of ideas were positive ideas and statistically significant

    • 1/3 of ideas were flat: no statistically significant difference

    • 1/3 of ideas were negative and statistically significant

  • Our intuition is poor: 2/3rd of ideas do not improve themetric(s) they were designed to improve. Humbling!


The hippo

The HiPPO

  • The less data, the stronger the opinions

  • Our opinions are often wrong – get the data

  • HiPPO stands for the Highest Paid Person’s Opinion

  • Hippos kill more humans than any other (non-human) mammal (really)

  • Don’t let HiPPOs in your orgkill innovative ideas. ExPeriment!

  • We give out these toy HiPPOs at Microsoft


Is software just hard no

Is Software Just Hard? NO!

  • Doctors have been taking the HiPPocratic Oath and promising “no harm,” yet many beliefs werewrong for hundreds of years

  • For centuries, an illness was thought to be a toxin

  • Opening a vein and letting the sickness run outwas the best solution – bloodletting

  • One British medical text recommended bloodletting foracne, asthma, cancer, cholera, coma, convulsions, diabetes, epilepsy, gangrene, gout, herpes, indigestion, insanity, jaundice, leprosy, ophthalmia, plague, pneumonia, scurvy, smallpox, stroke, tetanus, tuberculosis, and for some one hundred other diseases

  • Physicians often reported the simultaneous use of fifty or more leeches on a given patient. Through the 1830s the French imported about forty million leeches a year for medical purposes


Bloodletting 2 of 2

Bloodletting (2 of 2)

Lancet

  • President George Washington had a sore throatand doctors extracted 82 ounces of blood over 10 hours (35% of his total blood), causing anemia and hypotension. He died that night

  • Pierre Louis did an experiment in 1836 that is now recognized as one of the first clinical trials, or randomized controlled experiment. He treated people with pneumonia either with

    • early, aggressive bloodletting, or

    • less aggressive measures

  • At the end of the experiment, Dr. Louis counted the bodies. They were stacked higher over by the bloodletting sink


Agenda1

Agenda

  • Intro & controlled experiments in one slide

  • Examples: you’re the decision maker

  • Seven pitfalls

  • Q&A


Pitfall 1 wrong success metric

Pitfall 1: Wrong Success Metric

Remember this example?

OEC: Clicks on revenue generating links (red below)

A

B


Pitfall 1 wrong oec

Pitfall 1: Wrong OEC

  • B had drop in the OEC of 64%

  • Were sales correspondingly less also?

  • No. The experiment is valid if the conversion from a click to purchase is similar

  • The price was shown only in B, sending more qualified purchasers to the pipeline

  • Lesson: measure what you really need to measure, even if it’s difficult!


Pitfall 2 incorrect interval calculation

Pitfall 2: Incorrect Interval Calculation

  • Confidence Intervals (CI) are a great way to summarize results that have variability

  • Example: 95% CI for conversion rate might be 2.8%-3.2% (mean of 3.0% +/- 0.2%), which improved from 1.8%-2.2%

  • Business users prefer percent effect: 2% to 3% is a 50% improvement in conversion!

  • How can we provide a confidence interval on the 50%?


Pitfall 2 incorrect interval calculation cont

Pitfall 2: Incorrect Interval Calculation (cont)

  • You can’t just convert the confidence interval to a percent effect because the denominator is a random variable (we have a ratio of means)

  • Use Fieller’s formula for an exact percent effect

    • More complex formula, but that’s why we have computers (and statisticians who figured this out in 1954)

    • Note: the confidence interval is not always symmetric around the mean in this case


Pitfall 3 using standard formulas for standard deviation

Pitfall 3: Using Standard formulas for Standard Deviation

  • Many metrics for online experiments cannot use the standard statistical formulas

  • Example: Click-through rate = clicks/page-views

  • The standard statistical approach would assume this would be approximately Bernoulli

  • However, the true standard deviation is commonly

    larger than Bernoulli because of independence violations

  • Solution: Bootstrap or the delta method


Best practice ramp up

Best Practice: Ramp-up

  • Ramp-up

    • Start an experiment at 0.1%

    • Do simple analyses to make sure no egregious problems can be detected

    • Ramp-up to a larger percentage, and repeat until desired percent (e.g., 50%)

  • Big differences are easy to detect because the min sample size is quadratic in the effect we want to detect

    • Detecting 10% difference requires a small sample and serious problems can be detected during ramp-up

    • Detecting 0.1% requires a population 100^2 = 10,000 times bigger


Pitfall 4 combining data when percent to treatment varies

Pitfall 4: Combining Data when Percent to Treatment Varies

  • Simplified example: 1,000,000 users per day

  • For each individual day the Treatment is much better

  • However, cumulative result for Treatment is worse

  • This is called Simpson’s Paradox


Pitfall 5 not filtering out robots

Pitfall 5: Not Filtering out Robots

  • Internet sites can get a significant amount of robot traffic (search engine crawlers, email harvesters, botnets, etc.)

  • Robots can cause misleading results

    • Most concerned about robots with high traffic (e.g. clicks or PVs) that stay in Treatment or Control

    • We’ve seen one robot with > 600,000 clicks in a month on one page (and it was executing JavaScript)


Pitfall 5 not filtering out robots cont

Pitfall 5: Not Filtering out Robots (cont)

  • Identifying robots can be difficult

    • Some robots identify themselves through the UserAgent

    • Many look like human users and execute Javascript

    • Use heuristics to ID and remove robots from analysis

      (e.g. more than 100 clicks in an hour)

    • Ongoing research. No silver bullet


Effect of robots on a a experiment

Effect of Robots on A/A Experiment

  • Each hour

    represents

    clicks from

    thousands

    of users

  • The “spikes”

    can be traced

    to single “users”

    (robots)


Pitfall 6 invalid or inadequate instrumentation

Pitfall 6: Invalid or Inadequate Instrumentation

  • Validating initial instrumentation

    • Logging audit – compare experimentation observations with recording system of record

    • A/A experiment – run a “mock” experiment where users are randomly assigned to two groups but users get Control in both

      • Expect about 5% of metrics to be statistically significant

      • P-values should be uniformly distributed on the interval (0,1) and no p-values should be very close to zero (e.g. <0.001)

    • Many of our “customers” initially fail one of these tests


Pitfall 7 insufficient experimental control

Pitfall 7: Insufficient Experimental Control

  • Must make sure the only difference between Treatment and Control is the change being tested

  • Plot shows hourly click-throughrate for Control and Treatmentin the MSN Home Page

  • Headlines were supposed to be the same in both

  • One headline was

    different for one 7 hour period, significantly changing the result


Summary

Summary

  • It is hard to assess the value of ideas

    • Get the data by experimenting because data trumps intuition

    • Examples are humbling

    • Avinash Kaushik wrote: “…the power of: Controlled Experiments. I am convinced this is God’s gift to online humanity.”

  • Replace the HiPPO with an OEC

    • Make sure the org agrees what you are optimizing (long term lifetime value)

    • Experts are often wrong. Doctors did bloodletting for centuries (and they swear by the HiPPOcratic oath)

  • Watch out for the pitfalls


Resources for deeper drive

Resources for Deeper Drive

  • Controlled Experiments on the Web: Survey and Practical Guide in Data Mining and Knowledge Discovery journal, 2009http://exp-platform.com/hippo_long.aspx

  • KDD 2009 Tutorialhttp://exp-platform.com/tutorial.aspx

  • Contact: ronnyk@ microsoft dot you know what


Live q a with anne ronny roger

WhichTestWon.com

Live Q&A with Anne, Ronny, Roger


Sponsored by

Thanks, plus 2 free offers:

  • Online Testing Awards

    • Free entries

    • Everyone eligible

    • Deadline this Friday!

  • http://whichtestwon.com/awards

  • Free Landing Page

    Evaluation Offer

    Click to schedule:

    http://whichtestwon.com/widerfunnel/lp.html


  • Login