Read time: 9 min

A good test is one of the most effective tools to learn what works; a bad test is one of the easiest ways to learn something that isn’t true. Sorting the good from the bad mostly comes down to consistently following some simple rules, and resisting the urge to take shortcuts.

A lot of us had to learn the proper habits the hard way, with messy and misleading results. Here’s a little refresher of how to avoid the most common testing pitfalls and mistakes.

1) You’re trying to change too many things at once.

It goes like this: you have an advocacy form and want to optimize it for page completion. You want to try a different headline, but your coworker wants to test removing some of the required fields, and a third coworker thinks that the image might be having a negative impact and wants to change it to something new. These all seem like helpful ideas, so you make all three of these changes at once, and then you see a positive impact on your conversion rate. Great, it worked!

Except, you don’t know if it was the headline, the fields, or the image that helped. And what if you make all of these changes and your performance remains… exactly the same? Does that mean none of these elements matter?

When we limit ourselves to testing one variable at a time, we can see the impact of those changes much more clearly.

Granted, sometimes it isn’t possible to test only one element at a time—for example, if we’re testing a full redesign of a site. Not ideal, but hey, it’s life. Just remember that the results of your test could stem from any number of elements and can’t be attributed to just one change. If you choose to forge ahead with a full redesign or other package test, consider testing elements of the new page later on, which could allow for some additional, incremental improvements.

2) You don’t have a hypothesis for your test.

This is a big one (all of these are big ones tbh). The hypothesis should be the cornerstone of every experiment you run, no matter how big or how small.

The purpose of hypotheses is to make sure we’re thinking critically about what we want to test and why, and to make sure we learn from the test. It should also define your metric or metrics for measurement.

Here’s our formula:

Changing [control variable] to [test variation] will improve [test objective] because [reason you think the change will help].

For example, say your donate button is blue, and you think a red button will be more eye-catching. Your hypothesis would be:

Changing the button color from blue to red will increase clicks because the red color will attract users’ attention more effectively than blue.

Of course it’s pretty simple, but if you can’t clearly explain which metric your variable will impact and how it will do so, it’s best to rethink things before rolling out a test.

3) You didn’t randomize your audience.

It is enough to crush your spirit. An organization is plugging away at their email campaign test, sending out messages, seeing great results, and then someone figures out—usually by chance—that… all of the donors made it into one test segment, and all of the prospects ended up in the other. Or the a/b testing tool was re-randomizing the full audience for every email send. Or the randomizing tool actually didn’t randomize the test at all.

Suddenly, all those amazing test results are invalidated, and I am curled up on a couch breathing into a paper bag, and maybe I should call my mom.

Whenever we test, we want a random mix of our audience, so that one segment (such as your donors—a highly responsive audience) doesn’t skew results. (To be clear, testing just to a donor audience is fine, as long as you know that your results will only apply to donors, and both your test groups are a random sample of this audience.)

There are a number of tools available that will split your audience into random groups for you—your CRM may include one, or you can install third party software like Google Optimize directly on your site. Whichever you use, do some spot checking once it’s set up to make sure it’s actually randomizing your audience.

When in doubt, for email testing, we recommend downloading your audience (whoever you’re testing on, whether the full file or just a segment!) and manually randomizing them. Then upload those beautiful, pristine, definitely randomized files as static groups to message. For smaller lists, you can randomize in excel. For larger files of 100k or more, you can make friends with a developer or data analyst (we’re super nice, I promise!), and they can do the heavy splitting for you.

4) You check your test too many times

Coincidences happen. Every time you take your results and run them through that shiny significance calculator, you’re performing a new statistical test on a set of outcomes. Probability tells us that when you test something over and over again, the chances that you’ll get a random outlier—a.k.a. a fake result—are much higher.

When we analyze test results, we’re measuring for confidence level. We typically look for differences in our results that are statistically significant with 95% confidence. What this really means is that there is a 95% probability that the differences we observe aren’t the result of random chance and can be confidently attributed to the variable(s) we were testing.

But it also means there is a 5% chance that it’s just meaningless noise. When you check your results over and over again, you’re more likely to hit on one of that 5% and be tempted to run with the results you like. The more times you check the numbers, the more you’re inviting randomness to the table. And no one likes randomness at their table.

We know you’d never give in to fake results, but it takes discipline, and it’s hard! Your experiment is live, people are responding. Juicy data are coming in, and you are itching to look at it to see if you can validate your hypothesis.

Do yourself a favor: DON’T LOOK. At least, not until your test has achieved its estimated sample size or duration (more on that in the next section).

It’s a trap that even experienced testers can fall into. If you get in the habit of checking your tests constantly, declaring winners at the first whiff of statistical significance, and abandoning duration and sample size measurements, you’re setting yourself up for failure.

5) Your sample size is too small

Even a massive difference in performance doesn’t mean much if our audience isn’t big enough. Going from 3 conversions to 6 is a 100% increase, but it probably isn’t giving you a statistically significant result.  

Before we run a test, we use something called Minimum Detectable Effect (MDE) to determine the ideal audience size and duration for our testing. This is an estimate of how much variation we would be able to statistically detect given our baseline conversion rates and audience size. It boils down to one easy concept: to detect small changes (usually on small rates), you need a big audience; and to detect big changes, or changes on bigger rates, you need a smaller audience.

Estimating MDE helps us determine whether a test has a realistic chance of producing a usable result. It starts by collecting some information about what we’re testing. Often, that includes previous campaign performance, website metrics, or other useful performance data. From there, we come up with a “baseline” conversion rate, upon which we are trying to improve.

Handy sample size calculators like this one from Optimizely demonstrate this balance between audience size and baseline conversion numbers.

For example, let’s say we’re trying to move the needle on fundraising email response rate, which is 0.06% for most organizations. If we can split our audience into test groups of 650,000 people, we‘ll need to detect about a 20% increase in order to trust our results. That’s a pretty big jump in response rates but not totally out of the question. But if our audience only 60,000, then you’d need to cause a 100% improvement in response rate to see statistical significance. That’s a stretch, which means this particular test is probably not worth running for this particular audience.

I follow a general rule of thumb to keep my MDE under 25%—which is about the most change I can reasonably expect to see from most adjustments we make under normal conditions. There are exceptions to this rule, but it can get complicated quickly, so let’s just stick with 25% for now.

These sample size estimates are just that—estimates. They are merely starting points to be able to gut check the test you want to run to determine if you’ll even have a chance at seeing statistical significance. Sometimes (a lot of times) you’ll find that you need more users in your test to have a big enough sample to measure a pattern of change. There are some options: you can run the test for longer, expand your audience, or maybe add some additional pages or sections of your site to the test to expose the conditions to additional users.

So there you have it—and honestly, there are a lot of other ways your test can go wrong! But don’t despair—if you follow these refreshers, you’ll soon learn that there are few things as satisfying as getting clean, reliable data from your test that can set you on a path toward a better page, email, or social post. So get out there and get testing! And if you need more inspiration or help analyzing your results, check our M+R’s free toolkit here.