Archive | ab-testing RSS feed for this section

I Wrote A Book On Conversion Optimization For Software Companies

Long story short: I wrote a book on conversion optimization, SEO, and related topics, for software companies.  You can buy it here (Kindle, iPad, Nook, PDF) or on Amazon (Kindle).

For the last couple of years, folks have been asking for me to write about A/B testing, conversion optimization, and whatnot in book form.  I’ve never done it, simply because the notion of spending months of work with a publisher to write a book that would (all things being equal) likely fail to earn-out a $5,000 advance seemed to be a silly thing to do just to put “published author” on my resume.  I love writing and I like teaching, don’t get me wrong, but writing as a profession always struck me as work, and not even particularly fun work.

The folks at Hyperink convinced me to give it a try, though.  They are basically trying to make Publishing 2.0 work as a business model: provide authors with design/editing/etc using a workflow which was invented by people who grew up on Google Docs rather than manual typewriters, and create books relevant to niche audiences partially by republishing existing essays and partially by supplementing them with new material.  (The upshot for the authors is that royalties are split more equitably than 93-7-but-with-accounting-practices-that-would-make-the-RIAA-proud.)

What It Includes

  • ~ 20 essays that originally appeared on my blog, covering selling software, software pricing, conversion optimization, A/B testing, SEO, and the like, mostly of interest to software companies
  • ~ 4 essays which are totally new, including one on reducing churn rates
  • a follow-up or two on how some experiments worked out after I had written them up… including never-before-seen tales of abysmal failurebecause that sometimes teaches as much as the successes

Who Should Read This

  • Solo entrepreneurs running software businesses.  (I’d suggest actually having a working product — this book doesn’t cover product development, except when it is incidental to optimizing for marketing outcomes.)
  • Marketing / engineering / product folks at SaaS companies looking to synergize get some ideas of things which engineers can build that will make meaningful differences for the business
  • Anybody who has ever thought “Rather than reading through 600 posts in chronological order, could you just distill your blog down into the best twenty posts and categorize them for me?  My time isn’t totally valueless.  And put them on my Kindle/iPad/etc so I can read them on a plane.”)
  • My family.  (“You wrote a book?  I want to read it!  What is it about?”  ”Conversion optimization for software websites.”  ”I’ll pass!”)

Chapter List

  • Preface
    • Preface (new essay)
  • Selling Your Stuff
    • Introduction (new essay)
    • You Should Probably Send More Email Than You Do
    • Does Your Product Logo Actually Matter?
    • Dropbox-style Two-sided Sharing Incentives
    • Two-sided Referral Incentives Revisited! (new essay)
    • Engineering Your Way To Marketing Success
    • Selling Software To People Who Don’t Buy Software
    • Increase Your Software Sales
    • The Black Arts of SaaS Pricing
  • Increasing Conversions
    • Introduction (new essay)
    • Stripe And A/B Testing Made Me A Small Fortune
    • The Most Radical A/B Test I’ve Ever Done
    • Keeping The User Moving Towards Conversion
    • Practical Conversion Tips For Selling Software
    • Minor Usability Errors In Checkout Funnel = You Lose Lots Of Money
    • 10-Minute Tweaks to Boost Your Conversion
  • All About SEO
    • Introduction (new essay)
    • SEO for Software Companies
    • Strategic SEO for Startups
    • The Big Book of Getting People to Link to You
    • Developing Linkbait For a Non-Technical Audience
    • Why You Shouldn’t Pay Any SEO You Can Afford
  • Conclusion
    • Thanks for Reading, Lets Talk Churn Rates  (new essay)

Luckily, Hyperink Was In Charge Of Design, Not Me

If you’ve followed my blog or products for a while, you’re probably aware that I have the design sense of an addlebrained squirrel who fell into the Christmas eggnog and drowned.  Luckily, Hyperink took care of the book design and typesetting, so that it looks better on your e-reader or screen than anything I would have natively produced.  Here’s a sample (click to enlarge):

Formats Available

In Which I Explicitly Ask For The Sale

If you generally enjoy my writing and think a curated collection of twenty essays on the topic of making more money for your software business is of interest to you, please buy the book.  (It is, as far as I know, $9.99 everywhere you can buy it, but vagaries of the publishing industry mean that I can’t guarantee that this is true for you.)  If you don’t want to buy it, don’t worry, I won’t think any less of you — enjoy the blog, come back for more next year.  If you buy the book and enjoy it, I’d encourage you to leave a review on Amazon, as folks are really keen on seeing them.

Note to other potential authors: the folks at Hyperink are Good People and were a pleasure to work with in the discussion and editing process.  If you’ve considered trying your hand at writing a book but, like me, thought the traditional publishing industry is largely toxic and exploitative by construction, I’d encourage you to give them a whirl.

P.S. I traditionally post a Year In Review for my businesses, covering what worked and what didn’t as well as statistics, shortly before Christmas.  See, for example, 2011′s edition.  I will do it again this year, but owing to some bookkeeping hold-ups, it will be shortly after Christmas rather than before.  May you and your families have peace, love, and health this Christmas and always.

Doubling SaaS Revenue By Changing The Pricing Model

Most technical founders abominably misprice their SaaS offerings to start out.  I’m as guilty of this as anyone, so I wrote up my observations about un-borking this as The Black Arts of SaaS pricing a few months ago.  (It went out to my mailing list — sign up and you’ll get it tomorrow.)  A few companies implemented advice in there to positive effect, and one actually let me write about it, so here we go:

Aligning Price With Customer Value

Server Density does server monitoring to a) give you peace of mind when all is well and b) alert you really darn quickly when all isn’t.  (Sidenote: If you run a software business, you absolutely need some form of server monitoring, because the application being down costs you money and trust.  I personally use Scout because of great Ruby integration options.  They woke me up today, as a matter of fact — apparently I had misconfigured a cronjob last night.)

Anyhow, Server Density previously used a pricing system much beloved by technical founders: highly configurable pricing.

Why do geeks love this sort of pricing?  Well, on the surface it appears to align price with customer success (bigger customers pay more money), it gives you the excuse to have really fun widgets on your pricing page, and it seems to offer low-cost entry options which then scale to the moon.

I hate, hate, hate this pricing scheme.  Let me try to explain the pricing in words so that you can understand why:

  • It costs $11 per server plus $2 per website.
  • Except if you have more than 10 servers it costs $8 per server plus $2 per website.
  • Except if you have more than 50 servers it costs $7 per server plus $2 per website.

This is very complicated and does not align pricing with customer success.  Why not?

Pricing Scaling Linearly When Customer Value Scales Exponentially Is A Poor Decision

Dave at Server Density explained to me that their core, sweet-spot customer has approximately 7 servers, but that the per-server pricing was chosen to be cheap to brand-new single-server customers.  They were very concerned with competing with free.

Regardless of whether this wins additional $13 accounts, it clearly under-values the service for 7 server accounts, because their mission-critical server monitoring software in charge of paging the $10,000 a month on-call sysadmin to stop thousands of dollars of losses per minute only costs $79.  You don’t get 7x the value from server monitoring if you increase your server fleet by 7x, you get (at least) 50x the value.  After you get past hobby project you quickly get into the realms of a) serious revenue being directly dependent on the website, b) serious hard costs like fully-loaded developer salaries for doing suboptimal “cobble it together ourselves from monit scripts” solutions, and c) serious career/business reputational risks if things break.

Let’s talk about those $13 accounts for a moment.  Are $13 accounts for server monitoring likely to be experienced sysadmins doing meaningful work for businesses who will solve their own problems and pay without complaint every month?  No, they’re going to be the worst possible pathological customers.  They’ll be hobbyists.  Their servers are going to break all the time.  They’re going to misconfigure Server Density and then blame it for their server breaking all the time.  They’ll complain that Server Density costs infinity percent more than OSS, because they value their own time at zero, not having to e.g. pay salaries or account for a budget or anything.

My advice to Dave was that Server Density switch to a SaaS pricing model with 3~4 tiers segmented loosely by usage, and break with the linear charging.  The advantages:

  • Trivial to buy for non-technical stakeholders: name the plans correctly and they won’t even need to count servers to do things correctly.  (“We’re an enterprise!  Of course we need the Enterprise plan!”)
  • Predictable pricing.  You know that no matter what the sysadmins do this month, you’re likely to end up paying the same amount.
  • Less decisions.  Rather than needing to do capacity planning, gather data internally, and then use a custom-built web application to determine your pricing, you can just read the grid and make a decision in 30 seconds.
  • More alignment with business goals.  Unless you own a hosting company, “number of servers owned” is not a metric your CEO cares about.  It only tends to weakly proxy revenue.  Yes, in general, a company with 10 servers tends to have more commercial success than a company with 1 server, but there are plenty of single-server companies with 8 figures of revenue.

(Speaking of custom-built web applications to determine pricing, the best product with the worse pricing strategy is Heroku.  Enormously successful, but I’m pretty sure they could do better, and have been saying so for years.  All Heroku would have to do is come up with four tiers of service, attach reasonable dynos/workers/databases to them, and make that the core offering for 90% of new accounts.  You could even keep the actual billing model entirely intact: make the plans an abstraction over sensible defaults picked for the old billing model, and have the Spreadsheet Samurai page somewhere where power users and the sales team can find it.)

Ditching Linear Scaling In Favor Of A Plan Model

After thinking on my advice, Server Density came up with this redesign:

I love this.

  • The minimum buy-in for the service is now $99 a month, which will segment away customers who are less serious about their server uptime.
  • You now only need to make one decision, rather than needing to know two numbers (which you might not have access to at many of their customers).
  • The segmentation on users immediately triples the price for serious businesses using the service, irrespective of the size of their server fleet.  This is good because serious businesses generate a lot of money no matter how many servers they have.
  • Phone support will be an absolute requirement at many companies, and immediately pushes them into the $500 a month bucket.

My minor quibbles:

  • I still think it is underpriced at the top end.  Then again I say that about everything.
  • Did you notice the real Enterprise pricing?  (Bottom right corner, titled “More than 100?”) Like many SaaS services, Server Density will quote you a custom plan if you have higher needs.  Given that these customers are extraordinarily valuable to the business both for direct sales and for social proof, I might make this one a little more prominent.

Results From Testing: 100% Increase In Revenue

Server Density implemented an A/B test of the two pricing strategies using Visual Website Optimizer.

At this point, there’s someone in the audience saying “That’s illegal!”  That person is just plain wrong.  There is no carbon in a water molecule, and price testing is not illegal.

What if the fact of the price testing were discovered?  Not really that problematic: you can always offer to switch someone to the most advantageous pricing model for them.  Since most existing customers would pay less under variable pricing than they would under the above pricing grid, simply grandfathering them in on it removes any problem from people who have an actual stake in the business.  For new customers who get the new pricing grid but really, really feel that they should be a $13 a month account, you can always say “Oh, yep, we were testing.  I’ll give you the $13 pricing if you want it.”  (David from Server Density says that this is in fact what they did, three times, and had no lasting complaints.)

Most customers will not react like this because most customers do not care about price.  (Those that do are disproportionately terrible customers.  To quote David from Server Density, “We had the occasional complaint that pricing was too high but this was from users with either just a single server or very low cost VPSs where the cost of monitoring (even at $10/m) was more than the cost of the server.”)

Anyhow, where were we?  Oh yeah, making Server Density piles of money.  They requested that I not disclose the interval the test was conducted over, to avoid anyone reasoning back to their e.g. top-line revenues, but were OK with publishing exact stats otherwise.

Variable pricing: 150 free trial signups / 2161 visitors

Pricing plans: 113 free trial signups / 2153 visitors

At this point, variable pricing is clobbering the pricing plans (they get 25% less signups and pricing plans being inferior at maximizing trials has a confidence over 99%)… but let’s wait until this cohort reaches the end of the trial period, shall we?

Server Density does not make credit card capture mandatory.  (I might suggest revising that decision as another test.)

Variable pricing: 23 credit cards added / 2161 visitors

Pricing plans: 18 credit cards added / 2153 visitors

That’s a fairly similar attachment rate for credit cards.  But collecting credit cards doesn’t actually keep the lights on — the important thing is how much you successfully charge them, and that is highly sensitive to the prices.

Variable pricing: $420 monthly revenue added / 2161 visitors   (~$0.19 a visitor)

Pricing plans: $876 monthly revenue added / 2153 visitors  (~$0.41 a visitor)

+100% revenue (and revenue-per-visitor) for that cohort.  Pretty cool.

(P.S. Mathematically inclined readers might get puzzled at the exact revenue numbers — how do you get $876 from summing $99, $299, and $499?  Long story short: Server Density is a UK company and there are conversion issues from GBP to USD and back again.  They distort the exact revenue numbers a wee bit, but it comes out in the wash statistically.)

We Doubled Revenue?!  Can We Trust That Result?

Visual Website Optimizer displays on the dashboard that it is 93% confident that there was indeed a difference between the two.  (The reported confidence intervals are $0.19 +/- 0.08 and $0.41 +/- $0.16.  How to read that?  Well, draw your bell curves and do some shading, but for a qualitative description, “Our best guess is that we doubled performance, but there’s some room for error in these approximations.  What would those errors look like?  Well, calculus happens, here we go: it is more likely that the true performance improvement is more than ~3x than it is that there was, in fact, no increase in performance.”)

Truth be told, I don’t know if I trust that confidence in improvement or not, because I don’t understand the stats behind it.  I understand the reported confidence intervals and what they purport to measure, I just don’t know of a defensible way to get the data to that point.  The ways I’m aware of for generating confidence intervals for averages/aggregates of a particular statistic (like, say, “Average monthly revenue per visitor of all visitors who would ever sign up under the pricing plan”) all have to assume something about the population distribution.  One popular assumption is “Assume normality”, but that’s known to be clearly wrong — no plausible arrangement of numbers makes X% $99, Y% 299, Z% $499 into a normal distribution.  Even in absence of a rigorous test for statistical confidence, though, there’s additional information that can’t be put in this public writeup which causes me to put this experiment in the “highly probable win” column.  (If my Stats 102 is failing me and there’s a simple test I am neglecting, feel free to send me an email or drop a mention in the comments.)

Note that since this is a SaaS business that is monthly revenue added.  Increasing your monthly revenue from a particular cohort by $450 increases your predicted revenue over the next year by in excess of $4,000.  (The calculation is dependent on your churn rate.  I’m just making a wild guess for Server Density’s, biased to be conservative and against their interests.)

Now, in the real world, SaaS customers’ value can change over time via plan upgrades and downgrades, and one would ideally collect many months of cohort analyses to see how things shook out.  Unfortunately, in the equally real world which we actually live in, sometimes we have to reason from incomplete data.  If you saw a win this dramatic in your business and were wondering whether you could “take your winnings” now by adopting the new pricing across all new accounts, I would suggest informing that decision with what you previously know about customer behavior vis-a-vis number of servers over time.  My naive guess is that once a server goes into service it gets taken out of service quite rarely indeed and, as a consequence, most Server Density accounts probably have roughly static value and the few that change overwhelmingly go up.

And what about the support load?  Well, true to expectations, it has largely been from paid experts at larger companies, rather than from hobbyists complaining that they don’t get the moon and stars for their $13 a month.  Dave was particularly impressed how many were happy to hop on a phone call to talk about requirements (which helps learn about the customer segments and develop the future development and marketing roadmaps) — meanwhile, the variable pricing customers largely a) don’t want to talk about things and b) need a password reset right now WTF is taking so long.

Server Density expects that their plan customers will be much less aggravating to deal with in the future, but it is still early days yet and they don’t have firm numbers to back up that impression.

Testing Pricing Can Really Move The Needle For Your Business

Virtually no one gets pricing right on the first try.

(When I wrote the pricing grid for Appointment Reminder I snuck a $9 plan in there, against my own better judgment, and paid for that decision for a year.  I recently nixed it and added a $199 plan instead.  Both of those decisions changes been nothing but win.)

Since you probably don’t have optimum pricing, strongly consider some sort of price testing.  If I can make one concrete recommendation, consider more radical “packaging” restructurings rather than e.g. keeping the same plan structure and playing around with the plan prices +/- $10.  (This means that, in addition to tweaking numbers, you find some sort of differentiation in features or the consolidated offering that you can use to segment a particular group of customers into a higher plan than they would otherwise be at numerically.)

For more recommendations, again, you probably want to be on my mailing list.  You’ll get an email today with a link to a 45 minute video about improving your app’s first run experience, the email about SaaS pricing tomorrow, and then an email about weekly or biweekly about topics you’ll find interesting.  Server Density is not the only company telling me that those emails have really been worth people’s time, but if they don’t appeal to your taste, feel free to unsubscribe (or drop me an email to tell me what you’d rather read) at any time.

Disclosure: Server Density is not a client, which is very convenient for me, because I’m not ordinarily at liberty to talk about doubling a client’s revenue.

Stripe And A/B Testing Made Me A Small Fortune

I’ve run software businesses for the last six years, all premised on the simple notion that if I provide value to customers they should pay me money.  The actual implementation of translating their desire to pay into money in my bank account was less simple… until I found Stripe.  They’re now up there with Twilio and Heroku in terms of “infrastructure companies which will totally change the way savvy software companies do business”, and if they ever get international processing nailed, I think they’ll probably take over the industry to Paypalian scales.

How do I love you Stripe, let me count the ways…

Well Thought Out API Design

Ever worked directly with the Paypal API?  Keith, my podcast co-host and somebody who routinely codes systems that process millions in payments, shudders when he mentions it.  The Paypal API is powerful and (fairly) reliable, but the experience of coding against it is absolutely maddening.  It is very much a legacy API which has to support decisions made at the dawn of the Internet which were largely driven by considerations not relevant to software developers or web entrepreneurs.

Stripe’s API is one of the best I’ve ever worked with:

  • It uses all the REST-y goodness that the web development community has come up with in the last few years.
  • The documentation is suitably comprehensive, organized for easy consumption, and screams “You will have this in a secondary window when you’re coding stuff that matters” rather than “This was designed as a 450 page PDF by a standards committee.”  The table of contents for any one of Paypal’s APIs is longer than all the docs for Stripe… and less useful.
  • There are several first-party libraries available, they work, and they feel like first-class citizens of their respective ecosystem.  Stripe-ruby is fantastic and feels like ruby.

Most Painless Integration Ever

As a direct consequence of having a really, really well-designed API, integration with Stripe was a breeze.  Getting credit card processing hooked into Bingo Card Creator — authorization, charging, accounting, the works — was 29 lines of code.  I signed up for Stripe, got started with the API documentation, and successfully charged a real credit card from production three hours later.  They’ve got the fastest time-to-business-value of any API since Twilio.
One major reason Stripe works exceptionally easily is because of stripe.js.  Basically, if you’ve ever tried to charge credit cards before, you’re aware that there is a PCI-DSS standard out there and if, e.g., credit card numbers ever hit your hardware then you’re in for a world of painful compliance audits and ridiculous checking-boxes-for-the-sake-of-checking-boxes.  (“No, I don’t run my server on a server which sits in an unlocked room in a building the general public has access to.  Phew, dodged a bullet there.  Now excuse me while I go install some anti-virus software on my Ubuntu box and very diligently review my Nginx logs daily.”)
There are two major ways around PCI compliance:
  • You redirect people off-site for the transaction to e.g. Paypal or Google Wallet, and let the megacorps worry about it, then they redirect people back to you when they are done.  This is a poor user experience that often confuses customers and might decrease conversion rates.
  • You have an iframe or something capture their credit card on your site but actually submit it only to the payment processor.
Stripe.js is a very well-implemented “or something”, where JavaScript that they’ll provide for you hooks into your credit card form with trivial work.  (About ~6 lines for me.)  When a user submits the form, you instruct Stripe.js to AJAX-y over to their servers and authorize the card.  Then you process the results in a callback.  This lets you verify e.g. that the card exists and is chargeable prior to submitting your form and executing your server-side purchase logic.  Stripe will then give you a token allowing you to securely charge the card for the authorized amount, and you can choose whether to do that or not on your server-side.  (For example, I perform other business logic validations first, and void the authorization if e.g. the user has already purchased the software.)
This means that their credit card details never hit your server.  Now, rationally speaking, if your server is insecure then the page the credit card form is hosted on is in the hands of the enemy, and you can no longer trust that Stripe is the only party which sees the credit cards.  However, PCI compliance has very few rational parts about it.  Stripe gets you past that hurdle with a minimum of pain.
This is really, really important for developers because you get end-to-end control of the user experience.  You don’t have to do a redirect off-site and you don’t have to have a garishly styled external iframe in the middle of your app.  You can slide a credit card form in whatever part of your workflow makes sense, have it feel organically like your app (because it is, actually, your app), avoid the Paypal/Google attempts to use your relationship with a customer to capture a new account for themselves, etc etc.  That has the potential to significantly increase revenue.  (More on that later.)

Amazing Support For Developers by Developers

So let’s say you happen to support a Ruby on Rails application coded by a novice web programmer in 2008.  Hey, it happens.  There are a lot of old gems required for the program to operate, somewhat creakily.  Let’s further suppose that this causes you to have a conflict with a dependency from an external API vendor, because the vendor doesn’t specify what version of the old gem to use with their ruby library.  If you mail support@, and tell them “When using a version of this gem four years out of date, your library dies on a particular line, because you use an API that doesn’t exist in the oldest versions of the library.  I can’t use the latest version of the library because it causes dependency conflicts.  What should I do?”, what would you expect them to say?
Here’s what I expected:  ”Thanks for your email.  We can’t help you with coding your application.  You should use the latest version of the library.  Please see our FAQ at…”
Here’s what Stripe actually said:

 Hey Patrick,

Thanks for the report. I took a quick look:

$ git clone
$ git bisect start
$ git bisect good v1.0.3
$ git bisect bad v1.6.1
$ git bisect run ruby -rubygems -e ‘$:.unshift “lib”; require
“stripe”; Stripe.api_key = “KEY”; begin; Stripe::Plan.all.count;
rescue; exit 0; end; exit 1′

Suggests that 7563fd as the culprit. Looking at the log, this seems to
be around 1.3.0. Then:

$ git log v1.3.1..7563fd
$ git log v1.4.0..7563fd

So, looks like v1.4.0 is the first version that included that #body
interface change.

I just pushed stripe-ruby 1.5.22, which adds a dependency on 1.4.0 of

Thanks again for the heads up. Let us know if you run into anything else.

I am not easily emotionally moved by git command lines, but this is clearly somebody who understands me and what I need in life.  In addition to exactly diagnosing the problem (I was on rest-client 1.0.3, the most recent version was 1.6.1, and it would really have been compatible with anything after 1.4.0), he fixed it for everyone else.

(Sidenote: This is one of the very few times in my life where mailing support@ made me a better engineer.  ”You can figure out which version of a library breaks your application by running your minimal failing test suite commit-by-commit, watching for exactly the commit where it fails, and then correlating that to the released version which will actually work for you.  But since that will take forever, use binary search instead.  And there exists a git command which will do this for you.”)

That email was signed by the co-founder.  Patrick Collison, when he isn’t running a payments company, apparently found time to verse himself in arcane git commands.  I was practically vibrating with “These guys understand where I’m coming from.” after that, and they’ve not let me down since.

I’ve had exactly one serious problem with Stripe, in a year.  Their API broke for three transactions, due to a load balancer issue.  This caused their client library to return “Unspecified error, card not charged”, prompting my application to not deliver software to the user, but they were actually charged.  Clearly, that’s quite problematic.  They proactively got in touch with me about it, fixed the problem, and generally demonstrated competence and professionalism.  I gave away three free copies of the software and apologized profusely.  We haven’t had any recurrences since then, over about a thousand transactions.

Their Web Application Rocks

So in addition to programmatically charging cards, payment processors typically provide web interfaces.  They’re typically abysmal.  Paypal’s — and, again, I like Paypal — will take upwards of 15 seconds to find a transaction when you’re searching by its primary key, and it looks like it was written in 1996, principally because it was.

Stripe’s interface is pretty (don’t discount how much that matters, since you actually have to use it), snappy, responsive, and well-thought-out.  It has an awesomebox which, given 1234 as input, will quickly find every transaction with 1234 as the last four digits of the credit card, bring up the transaction for sale #1234 (an ID my code passed over with the transaction), finds all of your $1,234.00 charges sorted by recency, etc.  There has to be someone in Stripe who actually runs a side-business on it, I swear.  That or they’re telepaths.

Refunds are one click.  (And also available with an API.  This has saved me tens of minutes versus Paypal, since I have to log in, find the transaction, and write a refund note manually to do a refund with them.  It also saves me a lot of frustration, as correlating Cindy Smith to a Paypal transaction is difficult, whereas in Stripe all I have to do is keep their authorization token around server-side and then refunding a transaction is as easy as!)

Each transaction has a programmer-comprehensible set of logs attached to it, so I can quickly debug application problems.

Oh, they also have an API sandbox, with credentials segregated from the production API, and which can be manipulated via both the web interface and the API trivially.  I think this is an absolute hard requirement for APIs which can actually touch the real world.  (It is one of my very few knocks against the Twilio API.)

Stripe Has Fair, Comprehensible, Comparatively Transparent Terms

Ever heard “Paypal turned off my account waily waily” or “Paypal froze my money waily waily”?  Most complaints about Paypal actually aren’t about the API, they’re complaints motivated by a) commerce is hard because of the amount of fraud on the Internet and b) Paypal doesn’t historically do a great job of giving you resolution options if it’s fraud detection is overly aggressive in your case.  (I actually believe that they’re worlds better on this than they are perceived to be — the one time Paypal had an overly aggressive fraud alert on my account I was able to resolve it with a single one-minute telephone call.)

Stripe asks for prior review of your business model but, in my experience, approves you automatically and then actually does the human review while you actually have a set of working API keys.  They make transfers to your bank account 7 days after your customers pay you, to make an allowance for fraud/refunds/etc.  Seven days is impressively faster than most merchant accounts.

Stripe really shines in those rare cases where you need a human in the loop.  It’s still always a good idea to tell people in advance if you’re going to do something which will trip a fraud screen (e.g. open payments for a widely anticipated conference and then collect $X00,000 in $500 chunks from people in multiple countries — this will almost certainly get a Paypal account frozen).  A friend of mine — who has previously had issues with cleared payments getting filched by Google Checkout and then got Google’s customary /dev/null customer support — asked Stripe if an upcoming product launch would cause an account freeze.  ”Thanks for contacting us.  Of course not.  You’re clearly legit.”  It’s like they found the sweet spot between “Computers can make decisions in lightning-speed at scale” and “Humans can actually be trusted with discretion.”

Developers Obsess About Price So I Guess I Have To Mention It

Stripe charges $0.30 + 2.9% per transaction, which is comparable with Paypal at low volumes.  This is frequently the #1 thing devs tell me they look for in their payment processor.  That is insane.  We sell products which have margins that come very close to 100%, and saving pennies on transactions to spend tens of thousands on integration costs (*) or to shave full percentages off our conversion rates is absolute madness.

* Think I’m exaggerating?  That’s about two weeks of dev time.  Trust me, you will not get a shopping cart integration done in two weeks with most payment providers.  Again, it took me three hours with Stripe.  I still can’t believe that.

I Extensively A/B Tested Stripe Against Off-Site Checkout And Found…

… that I should really not ship prototype shopping carts, even when I think it is really cool to get something out the door.

Back prior to the redesign of Bingo Card Creator, I tested Stripe on-site credit card payments (the interface for which I threw together with Twitter bootstrap in ~45 minutes) against Paypal and Google Checkout.  Specifically:

Test #1: Paypal / Google Checkout vs. Paypal / Google Checkout / Stripe

Test #2: Paypal / Google Checkout / Stripe vs. Stripe

Test #3: All three vs. all three, with the difference being whether customers upgrading directly after a trial limitation had been reached were sent to the purchasing page or an in-app credit card form

All three of these tests were null results.  (i.e. No significant difference in aggregate purchases between either of the two options.  Interesting, though, any time paying by credit cards was an option, that was overwhelmingly the customer favorite.  When the choice is Paypal versus Google Checkout, I get 50/50.  When the choice is Paypal / Google Checkout / Credit Card, I get 5-5-90 or thereabouts.  That could be sensitive to the design of my pages, I don’t know — I tested e.g. re-ordering the buttons and that didn’t change things, but didn’t go on to test e.g. button copy or color.)

[Edit: Whoopsie!  A comment on HN sent me back to my A/B testing records.  Turns out Test #2, which I had misreported as PP/GC vs. Stripe but was actually PP/GC/Stripe vs. Stripe, was actually a weak 90% significant win for PP/GC/Stripe.  Test #3 was a weak 90% significant win for sending folks to the purchasing page rather than the hacky Boostrap CC form.  Sorry for the misreporting earlier -- these were in the Big Book O' Failed Tests and I forgot to check the detailed reason for why.]

So, despite customers overwhelmingly choosing credit cards, adding that option via Stripe wasn’t capturing additional sales at the margin.  This was surprising to me, because it is received wisdom in the conversion rate optimization community that users hate off-site checkout.  I mentally tied a string around my finger to revisit the issue later.

I Followed Up On Earlier Testing And…

Earlier this year, after having decided to offer all three payment options full-time, I did an experimental website redesign, in an A/B test.  This gave me the opportunity to have my cobbled-together credit card form replaced by one done by a professional designer.  That experimental redesign was very, very wide-ranging and affected pretty much every stage of the software purchase funnel.  Results were mixed — some steps radically better, others worse — and netted out to no significant change in revenue.  Since the user experience was very improved, I adopted the redesign.

While I was looking at the stages of the purchasing funnel, I saw that the newly redesigned checkout experience didn’t really seem to motivate customers more or less than the old, ugly checkout experience, but users continue to overwhelmingly prefer credit card checkout either way.

Anyhow, some months later, I took a run at fixing the part of the funnel which had suffered most in the redesign.  For whatever reason, improvements in the usability of my application had made users much less likely to hit the free trial limitations.  This caused less of them to get taken into the purchasing pathway, after which point their experience was largely consistent across both versions of the site.

So I tested the stupidest thing that could possibly work to get more people to hit the trial limitations: decrease people’s free quota from 15 cards to 8 cards.  And I did that in an A/B test.  One line of code which tested, literally, two characters in the program.


Not only did the 8 card limit absolutely crush the 15 card limit (99% statistical confidence, 1.89% conversion rate instead of 1.04% conversion rate from free trials to paid accounts on a sample size in the 5,000s range), it did something which is fairly rare in my A/B tests: it caused synergistic effects.  Ordinarily, I operate on the Bayes-is-about-to-turn-over-in-his-grave assumption that two stages in the funnel are largely totally independent from each other.  So, for example, if stage #1 is “Did they hit the trial limitation?” and stage #2 is “Did they purchase the software once in the shopping cart?”, I default to expecting that a test which increases the number of people hitting the limitation will not meaningfully impact the conversion rate in the shopping cart.  This is because this assumption has previously been good enough to bet money on, at least in my business.

Well, this time I lost the bet… or I won, catastrophically.  It seems that the marginal prospects (with the between-8-and-15 cards needs)  hitting the trial limitation have very different behavior when exposed to the shopping cart than the will-hit-the-limit-regardless prospects.  I did half a dozen tests to isolate the exact cause (I’ll spare you the deep dive into bingo customer minutiae).  Suffice it to say there is a) a customer group which needs between 8 and 15 cards and b) they really, really like pretty checkouts.  (I’m guessing that I’ve probably captured significant business from a portion of the population which isn’t teachers, who make up about 60% of my customers typically, but haven’t done any qualitative surveys to figure out who these new folks are.)

So, anyhow, with 99% confidence of a huge increase, you adopt the change, right?  I did that back in May.

Since selling to elementary schoolteachers is highly seasonal, let’s look at year over year results.  All of these months have the new redesign in place for 2012, but the new trial limitation was implemented mid-May and default behavior by the end of the month.

May: +38% increase in sales

June: +108% increase in sales (in the dead of summer, my slowest season)

July: +33% (dang, only?)

If this change continues being motivational during the school year it will be worth several tens of thousands of dollars a year to me.  If not, drats, it only doubled my money on the redesign.  I like giving credit where credit is due, so:

  • The redesign that debuted as “Awesome for users, meh for the business” now retroactively looks like the best idea for this year.  Thanks Ashraful.  (Hire him.)
  • Stripe, which makes the purchasing part of the funnel possible now, is incredibly amazing.  (And now processing 90% of my transaction volume for this product.)
  • As much as I love the above two, I have to give most of the credit to making decisions on the basis of data.  I know I’m a broken record on this, but no matter how many times I say it it doesn’t seem to change the behavior of many folks in the industry, so: A/B testing prints money.  So does having sufficient metrics in place such that one knows where the high priority places to A/B test are.

Incidentally, I do A/B testing with A/Bingo and measure test effectiveness throughout the funnel using KissMetrics, since A/Bingo won’t track multiple conversion types for a single test out of the box.  (Ben Kamens at the Khan Academy persuasively argues that fixing this would be a good idea.  It is on my list.)  Two years ago someone asked me whether I thought $150 a month for KissMetrics was worth it.  Ahem, yep!

Back To Stripe

Stripe is now my first choice for payment processing.  All of my new projects will start with Stripe and — maybe! — use Paypal if I get around to it.  (I don’t feel any impetus to migrate away from Paypal on BCC or Appointment Reminder — the code works and Paypal is, as mentioned, responsive when I have problems… but the CEO of eBay isn’t running git bisect for me if I have an API issue, so I feel no need to keep them in my plans forever.)

Two minor niggles mentioned for the purpose of completeness:

  • They occasionally expect me to be a better programmer than I am, by trusting me to do things correctly the first time.  (A customer had — I kid you not — a lightning strike hit her computer during checkout, and as a consequence the JS callback fired 36 times.  This resulted in 36 transactions, which Stripe processed without complaint.  Oops.  Server-side validation added.  Luckily, I caught the anomaly before my customer did, so I was able to refund and explain it to her prior to her bank asking for $1,078.20.)
  • “Authorize first, charge a second later” shows up on a lot of my customer’s online credit card statements as two separate charges until the first authorization gets voided, which can take days.  I’m almost certain this is not a Stripe issue and is, rather, a legacy payments infrastructure issue.  C’est la vie.  This causes about an email a month, and no customer has ever had a problem after I explain it.  (Editor’s note: Somebody from Stripe emailed me a work-around — just don’t feed Stripe.js a price and it won’t pre-auth the card.)
If you’re US-based, you can use Stripe, too, and they have my unreserved don’t-even-bother-looking-elsewhere recommendation.  If you’re not US-based, I feel your pain, and hope Stripe expands to your neck of the woods as quickly as possible.  (In the meanwhile, check out Paypal.)

Standard disclaimer: I occasionally write about companies which I use in my business and I feel are relevant to you guys.  Stripe isn’t a client.  I haven’t accepted anything of value from them… well, OK, technically speaking they have deposited $30,000 into my bank account, but you know what I mean.  (I think I also got mailed a hoodie at one point.)

I Redesigned My Software. Users: Thrilled. Conversion Rates: Up. Sales: Unchanged.

My oldest software product, Bingo Card Creator, is currently in maintenance mode.  For the last year and change, I’ve done very little to actively improve or market it — I just send emails, cut checks, and collect profits.  That was pretty much the plan for this year, too.

Then, I got an email out of the blue from Ashraful Sheik, a designer who had seen a years-old HN post by me about my design needs and wanted to see if I needed any work done.  I don’t usually rush to employ people who send me unsolicited emails, but I’m always happy to read emails, so I took a glance at his portfolio in case I could recommend him the next time one of my clients needed a designer.

I noticed that he had previously done a design for VLC Player.  It’s software that does… actually, I’m not really sure what it does, but I remember it from an HN thread waaaay back about them having SEO problems, and the reason I remember it was because I really liked their website design.  Simple, elegant, modern…  and very much not what Bingo Card Creator looked like.  So I mulled it over for a few minutes, figured I had a few days free in April, and asked Ashraful for a quote for a full-blown redesign of BCC.  I thought I’d try A/B testing it against the existing site.

Technical Sidenote: Why I Never Do Big-Bang A/B Tests

People have often asked me why I’ve never tested full redesigns before, and the answer is always “They’re a metric tonne of work to do correctly.”  You might naively assume that you just create two versions of your application’s template and, bingo, the money starts rolling in, but it is never that simple for non-trivial applications.

If you only have one site-wide template and you are totally religious about not including presentational code in any view/template/partial and the before and after redesigns are very compatible at the DOM level, then doing the A/B test isn’t that bad.  This was very much not the case for BCC and will likely not be the case for most live applications.

I actually considered making a complete copy of the BCC application with a shared database, then doing the split testing with some sort of software load balancer redirecting people to two entirely separate Rails stacks, but that promised to be a whole heck of a lot of maintenance pain going forward in return for avoiding coding pain for the integration.  So having nixed that idea, I did some plumbing in the Rails 2.3 internals to override how Rails magically picks layouts. This let me make duplicates of my existing layout structure for the redesign, do the appropriate HTML changes to them, and then start worrying about the main content areas of the layouts (where the real work began).

#goes in application_controller.rb
#Hack hack HACKEY HACK to re-direct all layouts to /layouts/redesign if this user is in that A/B test.
alias_method :old_active_layout, :active_layout

def active_layout(passed_layout = nil, options = {})

  redesign_choice = session[:big_bang_redesign] || ab_test("big_bang_redesign", ["default", "redesign"], :conversion => "purchase")
  # Exclude abingo controller and one blog article from scope of redesign -- customers don't see them and they'd take real work to get right.
  @use_redesign = (redesign_choice == "redesign" && (params[:controller] != "abingo") && (params[:action] != "developing_shopping_cart"))
  unless (@use_redesign)
    old_active_layout(passed_layout, options)
  layout_name = old_active_layout(passed_layout, options).to_s rescue nil
  return nil if layout_name.nil?
  chosen_layout = "layouts/redesign/#{layout_name.gsub("layouts/", "").gsub("redesign/", "")}"
  find_layout(chosen_layout, default_template_format, options[:html_fallback]) if chosen_layout

Ahh, DOM conflicts.  One of my requirements for the new redesign, to save my sanity, was that it be built on a grid system.  I picked because I happen to like it, but Bootstrap would have worked just as well.  BCC is presently written without the benefit of a grid system, so the internals of many pages require a bit of tweaking to fit onto one.  Also, the redesign omits elements of the previous design in some places in such a way that “display:none” doesn’t really cut it, so I wanted to be able to quickly turn off and on bits of HTML based on whether someone was using the redesign or not. I made a quick helper to do so: redesign(true) { # renders only if someone is seeing the redesign}, redesign(false) { #renders only if someone is seeing the old version}.

  def redesign(for_redesign = true, &block)
    test_choice = session[:big_bang_redesign] || ab_test("big_bang_redesign", ["default", "redesign"], :conversion => "purchase")
    # Exclude abingo controller and one blog article from scope of redesign -- customers don't see them and they'd take real work to get right.
    @use_redesign = (test_choice == "redesign" && (params[:controller] != "abingo") && (params[:action] != "developing_shopping_cart"))
    #If for_redesign and use_redesign are both truth or both false, yield.
    if (!(for_redesign ^ @use_redesign))

This let me start attacking the marketing site and application for Bingo Card Creator, adding code-spackle where required to get things actually working correctly. It turned out to be necessary in 60 places, consuming most of the three days it took to get the redesign working after receiving the HTML and CSS mockups for it.

The final technical measure was for user experience: anyone who has ever done a redesign knows that a vocal contingent of users hates them.  As long as I was doing an A/B test anyhow, I gave users the ability to flip between which version they were seeing, buried waaaaaaaaay at the bottom of the page in the footer.  This way when people complained (and two inevitably did) I could tell them how to opt-out of the new version.  (It is also indispensable when testing to see that the site was functional in both versions.)  Feel free to use this feature if you want to see both versions of the site live.

Enough Technical Mumbo-Jumbo, Let’s See Some Screenshots


The old version of the home page:

The new version:

As you can see, the new version is cleaner, more modern, and (partially as a consequence of finally adapting to a world with wider displays than 800×600) has quite a bit more room to breathe between elements.  It probably still won’t get featured in any design galleries, but that isn’t the point: this site exists to sell software.  (I rush to add that this isn’t a reflection on my designer’s skill: my brief constrained him into favoring the commercial imperative over design imperatives in a few ways.  As always I’m ultimately responsible for anything which looks bad and the designer gets the credit for anything that doesn’t, since if I were left alone to design things they’d look like big balls of blue-green mud with large orange BUY NOW buttons stuck in them.)

Redesigning The UX of The App

As long as we were giving the site a facelift, I decided to see if without majorly tweaking the underlying application we could make it more usable.  I thought of adding a prominent graphical element suggesting what steps it requires to make bingo cards and tracking user progress, something which has been reported to work frequently among UX folk.  (I also have a motivational result or three from clients about this.)

This is what a new trial user previously saw:

Here’s what they see now:

I added a few affordances to that design.  For example, clicking on the elements of the progress bar makes my best context-based guess of how to move you backward or forward along the path of making bingo cards.  It also highlights showing you how far you’ve gone, as seen here:

You’ll note that I haven’t fixed the Next Step button yet.  Ahh well, always one more thing to do…

So How Did Things Go?

I generally do far less extensive A/B tests than this, and track them only to a single conversion.  (That is actually a limitation of my A/B testing software, A/Bingo, because I never really saw the need before to track a change’s effect on multiple conversions in my own business.)

However, since this redesign affects every part of my funnel from the AdWords landing pages to the internals of the application to the purchasing page, I thought it might reasonably be the case that the redesign was a win somewhere and a loss elsewhere, so just tracking to the final conversion (purchase) might cause me to have an incomplete view of the implications of  the redesign.

Enter KissMetrics, my current favorite funnel tracking software.  They’re wonderful, you should use them.  I’ve happily paid $150 a month for the last year or so and barely log in — that is totally justified now.

KissMetrics lets you include custom properties as people cause events in your site/app/etc (which you can then retrospectively organize into funnels on their website).  I simply included which version of the site someone was seeing as a custom property, then fiddled with their UI a bit until I had the filters set properly, and voila, I can now see the A/B test affect every funnel I have.

In some cases, the redesign was a win.

AdWords Landing Pages: Did Registrations Go Up?

Consider the AdWords landing pages, where I measure conversion to the free trial (all stats taken from last 7 days just for convenience, but they mostly match results since start of test):

old version Visits: 1,403 Conversions: 293 (20.9%)
redesign Visits: 1,349 Conversions: 311 (23.1%)

I’ll spare you the z-test: That modest increase is statistically significant at the 10% confidence level, but not at the 5%.  So middling evidence of a change in the right direction.  So far, so good.

It was actually a much more dramatic increase on my non-AdWords landing pages (SEO-ified content, you’ve heard me talk about it before), but that pales next to the following improvement.

The Application: Did User Success Increase?

I’ve written and presented previously about using funnel tracking to improve your application such that users are more likely to succeed with it.  Success with Bingo Card Creator involves, predictably, actually creating bingo cards.  Somewhat surprisingly, back when I started tracking it, only 48% of free trials actually got as far as actually successfully downloading a bingo card.  I’ve worked on that and gotten the number to 60% over the years — roughly a 25% lift in user success with a huge, honking increase in my bottom line results as a consequence.  My consistent experience has been that the more users succeed with BCC, the more money I make.

So if we look at the workflow for BCC:


I had to stitch that together from a pair of screenshots because it won’t fit on one interface in KissMetrics, but the numbers are accurate.  Let me direct your attention to the salient bits:

  • The redesign has a lot more people start at the Dashboard than the “default” (old) version does.  This is because the redesign is, as we previously discussed, crushing the old version at getting user signups to the free trial.
  • Each step of the funnel includes a percentage, which is the percent of the folks who started at Dashboard that made it all the way through that step.  (They’re not inter-step percentages, they’re cumulative.)  If you want to back out the math, you’ll see that the redesign outperforms the drop-off rate of the old site at every step in the funnel — 1% here, 2% here, 5% here.
  • This compounds multiplicatively.  By the end of the funnel, the redesign has a 9.2% increase (a 15% lift) in user success compared with the old version of the software.  To give you some appreciation of that: if you’re working with mature software and have already grabbed your low hanging fruit, the magnitude of that improvement is staggering.  That is literally better than a year’s worth of active tweaking at this point.
  • When you compound the pre-workflow increase in trial signups and the increase in user success, the actual number of teachers who successfully create bingo cards out of a given number sent to my site in the first place goes up by 45%.  This makes this the most successful A/B test I’ve ever run, at least by that metric.

Egads, So This Printed Money For You, Right?

Cue the bad news!  Teachers are so successful with the newly redesigned BCC that, out of any 100 using the software, less of them decide to purchase it.  This almost exactly cancels out gains in trial registrations and user success.  It is downright painful: last week, I got 26 sales, and would you believe they were split exactly 13/13?  That’s practically a textbook null result — it was so improbable to me that I spent a few hours checking stats to see if I hadn’t made a systematic error somewhere, but no, crosschecking in a few places makes it look legitimate.  (The split since I started the test is 50/46 in favor of the old version, which is comfortably in the null result territory as well.)

Does that result sound counterintuitive to you?  It is the sort of thing that, when I have to report it to clients, always sets me walking on eggshells.  The first rule of A/B testing is that everything you know is wrong.

Since I have the luxury of well-instrumented funnels, I can tell you where the problem isn’t:

We did a complete redesign of the purchasing page and shopping cart.  I omitted showing it above to save space, but I’m pretty proud of how it turned out .  The new purchasing page/shopping cart is not the problem: precisely as many people will, once getting to the purchasing page, complete a purchase as they would previously.  (On the subject of plumbing-that-takes-money: Stripe.  I’ve promised them a case study post someday, but the capsule version is that if you can use Stripe you shouldn’t be using anything else.)

The problem appears to be, simply, that less users hit our trial limitations now.  (Hitting the 15 card trial limit is the overwhelming cause of going to the purchasing page.)  This suggests that either a) the new site is converting more parents than the old one used to, and since parents rarely have 15 children they’re simply having a happy bingo experience and not paying (net win for the world, not really a win for me personally though) or b) for indefatigable reasons, users simply get what they need out of the free trial and don’t convert.  It is entirely possible that any of the sixty small tweaks I had to make to the site nudged people away from hitting those limitations.

This is one of the reasons I hate big-bang A/B tests — when you have a huge batch like that, isolating the exact cause of any observed effect is difficult (other than “Well, clearly, something changed in the redesign”), whereas A/B testing an individual element structurally gives you configurable levels of confidence that a particular element was the one and only cause of an observed effect if the stats shake out that way.

So Where Does That Leave You?

If I had a great desire to do more work in BCC, this would provide a good place to target.  I could figure out why the new version of the site is having less folks hit trial limitations, and either tighten those limitations or tweak the UX such that the site nudges more people into hitting them.  That said, the free time on my schedule is rapidly drying up as we get closer to my wedding, and even if I had captured all of that 45% increase to the bottom line that isn’t really the path forward for my business, so BCC is going back into maintenance mode.  This one is getting written off as an amusing and partially successful experiment which helped out my users but didn’t succeed at making me money.  I will likely finalize the redesign and kill the old version in the coming weeks.

How did existing users respond to the change?

Well, half of them don’t know about it yet, obviously. With two exceptions, the feedback has been overwhelmingly positive. Many of them were appreciative that they got the Totally New Software Absolutely Free. It is actually functionally identical to the old version — not one line of model code or business logic changed — but I have received many compliments about the wonderful new features, performance increase, and improved compatibility with Epson printers. Before you laugh, consider that probably 95% of software businesses don’t A/B test yet, so my users are in good company making inferences from observed behavior changes over time even when their explanation for the changes has no relationship to reality whatsoever.

Sidenote: If new software is assumed to be worth money and reskins make software “new” in the minds of the only people that matter (users), that probably suggests a viable marketing strategy. Your engineers don’t have to like it.

How much did it cost you?

BCC is growing like a weed for reasons totally unrelated to my work on it (long story but you can see the recent stats).  This means I have quite a bit of flexibility to cut checks to make things happen.  To give you a rough idea, we settled on a price of $1,X00 for PSD, HTML, and CSS mockups of my front page, my pricing page (complete with minor JS for the shopping cart), and the main page of the application.  I then munged that into site- and application-wide templates to get things to their current state.

Ashraful was wonderful to work with: very responsive to email, timely, and receptive to feedback in a way that improved the quality of the designs vis-a-vis my target market while also decreasing the amount of integration work I had to do.  I’d work with him again in a heartbeat.  You should hire him.

The Most Radical A/B Test I've Ever Done

About four years ago, I started offering Bingo Card Creator for purchase.  Today, I stopped offering it.

That isn’t true, strictly speaking.  The original version of Bingo Card Creator was a downloadable Java application.  It has gone through a series of revisions over the years, but is still there in all its Swing-y glory.  Last year, I released an online version of Bingo Card Creator, which is made through Rails and AJAX.

My personal feeling (backed by years of answering support emails) is that my customers do not understand the difference between downloadable applications and web applications, so I sold Bingo Card Creator without regard to the distinction.  Everyone, regardless of which they are using, goes to the same purchasing page, pays the same price, and is entitled to use either (or both) at their discretion.  It is also sold as a one-time purchase, which is highly unusual for web applications.  This is largely because I was afraid of rocking the boat last summer.

The last year has taught me quite a bit about the difference between web applications and downloadable applications.  To whit: don’t write desktop apps.  The support burden is worse, the conversion rates are lower, the time through the experimental loop is higher, and they retard experimentation in a million and one ways.

Roughly 78% of my sales come from customers who have an account on the online version of the software.  I have tried slicing the numbers a dozen ways (because tracking downloads to purchases is an inexact science in the extreme), and I can’t come up with any explanation other than “The downloadable version of the software is responsible for a bare fraction of your sales.”  I’d totally believe that, too: while the original version of the web application was rough and unpolished, after a year of work it now clocks the downloadable version in almost every respect.

I get literally ten support emails about the downloadable application for every one I get about the web application, and one of the first things I suggest to customers is “Try using the web version, it will magically fix that.”

  • I’m getting some funky Java runtime error.  Try using the web application.
  • I can’t install things on this computer because of the school’s policies.  Try using the web application.
  • How do I copy the files to my niece’s computer?  By the way it is a Mac and I use a Yahoo.  Try using the web application.

However, I still get thousands of downloads a month… and they’re almost all getting a second-best experience and probably costing me money.

Thus The Experiment

I just pushed live an A/B test which was complex, but not difficult.  Testers in group A get the same experience they got yesterday, testers in group B get a parallel version of my website in which the downloadable version never existed.  Essentially, I’m A/B testing dropping a profitable product which has a modest bit of traction and thousands of paying customers.

This is rather substantially more work than typical “Tweak the button” A/B tests: it means that I had to make significant sitewide changes in copy, buttons, calls to action, ordering flow, page architecture, support/FAQ pages, etc etc.  I gradually moved towards this for several months on the day job, refactoring things so that I could eventually make this change in a less painful fashion (i.e. without touching virtually the entire site).  Even with that groundwork laid, when I “flipped the switch”  just now it required changing twenty files.

Doing This Without Annoying Customers

I’m not too concerned about the economic impact of this change: the A/B test is mostly to show me whether it is modestly positive or extraordinarily positive.  What has kept me from doing it for the last six months is the worry that it would inconvenience customers who already use the downloadable version.  As a result, I took some precautions:

The downloadable version isn’t strictly speaking EOLed.  I’ll still happily support existing customers, and will keep it around in case folks want to download it again.  (I don’t plan on releasing any more versions of it, though.  In addition to being written in Java, a language I have no desire to use in a professional capacity anymore, the program is a huge mass of technical debt.  The features I’d most keenly like to add would require close to a whole rewrite of the most complex part of the program… and wouldn’t generate anywhere near an uptick in conversion large enough to make that a worthwhile use of my time, compared to improving the website, web version, or working on other products like Appointment Reminder.

I extended A/Bingo (my A/B testing framework) to give a way to override the A/B test choices for individual users.  I then used this capability to intentionally exclude from the A/B test (i.e. show the original site and not count) folks who hit a variety of heuristics suggesting that they probably already used the downloadable version.  One obvious one is that they’re accessing the site from the downloadable version.  There is also a prominent link in the FAQ explaining where it went, and clicking a button there will show it.  I also have a URL I can send folks to via email to accomplish the same thing, which was built with customer support in mind.

I also scheduled this test to start during the dog days of summer.  Seasonally, my sales always massively crater during the summer, which makes it a great time to spring big changes (like, e.g., new web applications).  Most of my customers won’t be using the software again until August, and that gives me a couple of months to get any hinks out of the system prior to them being seen by the majority of my user base.

My Big, Audacious Goal For This Test

I get about three (web) signups for every two downloads currently, and signups convert about twice as well as downloads do.  (Checking my math, that would imply a 3:1 ratio of sales, which is roughly what I see.)  If I was able to convert substantially all downloads to signups, I would expect to see sales increase by about 25%.

There are a couple of follow-on effects that would have:

  • I think offering two choices probably confuses customers and decreases the total conversion rate.  Eliminating one might help.
  • Consolidating offerings means that work to improve conversion rates automatically helps all prospects, rather than just 60%.

Magic Synergy Of Conversion Optimization And AdWords

Large systemic increases in conversion rates let me walk up AdWords bids.  For example, I use Conversion Optimizer.  Essentially, rather than bidding on a cost per click basis I tell Google how much I’m willing to pay for a signup or trial download.  I tell them 40 cents, with the intention of them actually getting the average at around 30 cents, which implies (given my conversion from trials/signups to purchase) that I pay somewhere around $12 to $15 for each $30 sale.  Working back from 30 cents through my landing page conversion rate, it turns out I pay about 6 cents per click.

Now, assuming my landing page conversion is relatively constant but my trial to sale conversion goes up by 25%, instead of paying $12 to $15 a sale I’d be paying $9.60 to $12 a sale.  I could just pocket the extra money, but rather than doing that, I’m probably going to tell Google “Alright, new deal: I’ll pay you up to 60 cents a trial”, actually end up paying about 40 cents, and end up paying about 8 cents per click.  The difference between 6 and 8 will convince Google to show my ads more often than those of some competitors, increasing the number of trials I get per month out of them.  (And, not coincidentally, my AdWords bill.  Darn, that is a bloody brilliant business model, where they extract rent every time I do hard work.  Oh well, I still get money, too.)

We’ll see if this works or not.  As always, I’ll be posting about it on my blog.  I’m highly interested in both the numerical results of the A/B test as well as whether this turns out being a win-win for my customers and myself or whether it will cause confusion at the margin.  I’m hoping not, but can’t allow myself to stay married to all old decisions just out of a desire to be consistent.

Stats Bug In A/Bingo v1.0.0 and earlier

Many thanks to Ivan for reporting this one: there is a significant bug in A/Bingo calculation of z-scores for versions 1.0.0 and earlier, which borks substantially all z-score calculations and in some cases can change whether A/B test results are reported as statistically significant or not. 

The bug is all of one character long:

def zscore()
#omitted for clarity
 cr1 = alternatives[0].conversion_rate
 cr2 = alternatives[1].conversion_rate
 n1 = alternatives[0].participants
 n2 = alternatives[1].participants

 numerator = cr1 - cr2
 frac1 = cr1 * (1 - cr1) / n1
 frac2 = cr2 * (1 - cr1) / n2   #this line is bugged
 numerator / ((frac1 + frac2) ** 0.5)

I have fixed the bug (via the Slicehost console, on a Japanese cafe Internet PC, because I am stuck in Nagoya today again) and pushed the fix to the git repository.

Does this make my results invalid?

You can probably still have confidence in results you got from A/Bingo previously. While the numerical calculation of the z-score was borked, it was borked in a subtle enough fashion that most statistically significant tests will retain their statistical significance under the borked calculation and most statistically insignificant tests will not gain statistical significance magically as a result of the borked calculation. (My quick eyeball suggests that it causes BCC to overstate the significance of tests which are very significant and understate the significance of tests which are insignificant, which is a very fortuitous set of properties for a random bug to have in an A/B testing framework.)

I have re-run statistical confidence tests for everything I’ve ever done for BCC that I still have data for, and no experimental results changed as a result of the error. Nonetheless, I deeply regret the bug, and will write unit tests for the statistics code as soon as I am physically capable of doing so to rule out the possibility of this sort of thing in the future.

Lesson from Madlibs Signup Fad: Do Your Own Tests

Periodically, news of an innovative, goofy, compelling, or compellingly goofy design decision will sweep across the Internets like wildfire.  Most recently, this happened with a madlibs-looking lead generation form.

I think it has much to recommend it in the context of lead generation forms (long, arduous monstrosity that you sign up for in the hopes you are contacted but not spammed to death), but I didn’t see much possible upside for using it on a new user registration form (short form which you sign up to use something).

However, I’m wary of trusting my instincts on such things when I could trust data instead.  There is a key point about A/B testing: trust your data, not somebody else’s data.  After all, you only make money when it improves your conversion rate, not their conversion rate.  You can feel free to use other folk’s successful experiments for inspiration but for heaven’s sake use them to inspire you to run tests, rather than inspire you to fire blindly.

I was particularly wary about trusting this result because, as pointed out by numerous people in the Hacker News discussion, roughly seven things changed between the two forms in the A/B test performed on the standard form versus the madlib form, and there is no particular reason to assume that the salient difference was caused by the part which strikes us as creative as opposed by more boring things like e.g. the call to action in the header.

When In Doubt, Test.  (When Not In Doubt, Test Twice.)

No less than six people said “Hey Patrick have you seen this madlibs thing yet?  You’ve got to try it.”, and because knocking something together would take less than 10 minutes because I have an A/B testing framework that makes this a one-line proposition, I decided I’d humor them.  I isolated just the madlibs versus standard style for the test, knocked up an alternative in about ten minutes with my (decidedly limited) CSS and Javascript skills, and set them against each other.  My conversion goal for this test is successfully inducing someone to sign up for the free trial of Bingo Card Creator.

My Usual Registration Form

The Madlibs Registration Form

P.S. If you have good eyes you’ll spot the other A/B test ongoing on this page.  I’m using the traditional way of mitigating cross-test interaction… ignoring the possibility of it.  Don’t tell your college stats professor, but this actually works pretty well in practice.


I ran this test until A/Bingo, my A/B testing framework for Rails, told me that further testing was just a waste of my time.  It didn’t take long at all — 34 hours after the test alternative went live for the site, the first time I checked the results, they were already overwhelming.  Let me copy/paste right off my public results page:

Signup Madlibs Versus Standard Standard (27.55%) winner
Madlibs (21.73%)

By my count that is a 22% decrease in conversion rates for using the madlibs signup style over the standard signups style, and the fact of the decrease (but not the magnitude) is significant at the 95% confidence level.

For the curious: there were 736 participants in this test, split roughly 50/50, as you would expect.  I love the Internet because where else can you get 736 people to help you improve your website while you sleep, work at the day job on Saturday, have an evening out with friends, and then sleep some more?

Anyhow: test ended, not touching the madlibs idea again.  Before adopting this or any other fad (or good suggestion, for that matter): do your own A/B tests.

A/Bingo 1.0.0 Official Release

Back in August I released A/Bingo, an MIT-licensed OSS Rails A/B testing framework.  I have been using it continuously on Bingo Card Creator, and judging from the support requests I’ve been getting it has gotten some traction in the Rails world.  The 5,000 or so people seeing A/B tests on my site on Valentine’s Day are almost certainly less than 1% of the beneficiaries of the software now. Yay.

As A/Bingo has grown in popularity, I have begun to get requests for features that I did not need urgently for my own development, as well as the usual support requests, patches, and the like.  I want to make your use of the software as pleasant as possible to further evangelize the cause of A/B testing, so here you go:

New features:

A/Bingo now ships with a default dashboard.  Previously, I assumed that everyone would be writing their own dashboard code, so I just included the absolute minimum to show you what you’d need to do to get data out of A/Bingo.  Many people have remarked that they would really appreciate a “works out of the box” solution.  Your wish is my command — you can now enable a default dashboard in about ~30 seconds. It would work totally out of the box, but there are security implications, so I wanted you to have to think for a moment prior to enabling it.

#Create a new controller.  The name is up to you -- this example uses abingo_dashboard_controller.rb
class AbingoDashboardController  :abingo_dashboard

You can customize the dashboard code yourself. Nota bene: it uses your application layout, and has CSS classes applied to most of the elements, so you can style it quickly with CSS if you desire to. By default, it probably looks terrible. If you want to send me a patch to make it pretty, be my guest.

Experiments can now be stopped: Using either the built-in links on the above controller or, if you prefer programmatically scripting things, experiment.end_experiment!(alternative_content), you can now stop an experiment without touching the code.  Stopping an experiment causes all users to get the specified alternative rather than what they would have gotten randomly.  It also ceases stats collection.  Stopping an experiment is irreversible (currently — that might change later).  I tried to make this feature not affect the performance of A/Bingo for larger sites — it makes each test require one extra cache access.  (*cough* Rounding error, hopefully.)

A/Bingo internals are now fairly thoroughly tested: Unit tests are not exactly my cup of tea (“Argh, it works in production, what else do you want from me?!”), but Rails developers look askance at software that does not include them.  So I knuckled down and wrote a test suite.  (Hat tip to Nathaniel Talbott for mentioning A/Bingo in a conference presentation.  The constructive criticism regarding testing drove this change.)

I have not written thorough integration tests for the syntax sugar that you get via the included helper methods, but I’ll fix that eventually.

Named conversions: Previously, all A/Bingo tests required one line to add the test and one line somewhere else to track conversions.  Typically, since businesses have very many tests and fairly few conversion events, this resulted in code like:

#A controller method
def purchase
#Business logic goes here.

That isn’t very DRY at all.

Now, A/Bingo will take an optional parameter :conversion (or :conversion_name) when you’re defining a test, telling it to listen to a particular named conversion. This way, you can reuse the same conversion for as many tests as you want, decreasing the lines of code needed to create most new tests from two to one.

def some_method_with_a_test
  @alternative = ab_test("some_test_name", %w{altA altB}, :conversion => "purchase")

def some_other_method_with_a_test
  @foo = ab_test("bar_test", %w{coke water}, :conversion => "purchase")

def purchase
  #Business logic goes here!
  bingo!("purchase")  #Calls conversions for both of the above tests.

A/Bingo handles tests with spaces in them more gracefully: Although I still don’t recommend doing it, A/Bingo has been improving its handling of test names which have a space in them.  (The reason I don’t recommend it is because some cache stores — particularly memcached — do not support this well.)

Official support for Redis: Assaf Arkin picked Redis for his awesome Vanity project (which also does A/B testing for Rails, among other things), which inspired me to take a look into it.  It appears to be a much, much better alternative for a key/value store than Memcachedb, which is what I use for persistence.  A/Bingo has always accepted any cache store that Rails does, but I want to make it explicitly clear that I run tests against Redis, Memcached, and MemcacheDB. Just add the following to your environment:

#Goes in environment.rb
config.gem  'ezmobius-redis-rb',
  :source => '',
  :lib => false

config.gem  'jodosha-redis-store',
  :source => '',
  :lib => 'redis-store'

#Goes in whatever environment you're using:
require 'redis-store'
Abingo.cache =

I intend to migration my own deployment to Redis when it becomes reasonably convenient for doing so.

Versioning: Previously I’ve just released patches to the A/Bingo git repository when I got done coding them, but I feel that is suboptimal now that there are substantial deployments which I could potentially break with changes.  So, here’s the skinny: A/Bingo is now, as of this blog post, 1.0.0.  I’ll communicate breaking changes by bumping that number up.  If it goes up by a tenth or more, expect that you need to re-run the migrations and that you will probably lose data on any tests in-progress, so plan ahead for that.  Version increases in that last number should be safe to apply directly.

I do not anticipate breaking the published A/Bingo API (i.e. methods mentioned in the docs) until at least v2.0.0, if ever, so upgrading A/Bingo should almost never cause you to need to update your own code.

How To Contribute

I would like to thank everyone who has submitted bug reports and patches. As usual, I’m always happy to get bug reports or feature requests. If you’d like to contribute code, make it available via git anywhere you please, and then send me an email telling me about it.

How Do I…

If the question isn’t answered in the (copious) documentation, feel free to ask me over email. If your business has particular needs for A/Bingo or you just want to talk A/B testing strategy with somebody who breathes it, I’m available for consulting engagements starting April 1st.

You Should Be Doing A/B Testing

I really can’t stress this enough: A/B testing is an easy, reproducible process that you can use to improve your marketing, website copy, product, user experience, etc. If you haven’t started yet, take A/Bingo, Vanity, or your other framework of choice for a spin. It won’t take you five minutes until you’re getting actionable data which you can use to make money.

Free video + email advice on making & selling software:
(1~2 emails a week.)