Availability 101
Speaking about availability, Live was down for a couple of hours on Friday when I went to post this – ace timing…
The next principle of building megaprise software is availability. That one in particular is really important to us; some recent market research we commissioned concluded that a 1% increase in customer confidence in our site resulted in a 16% increase in ARPU (average revenue per user). Simplified – people spend more when they’re surer you’ll be up and a 1% improvement isn’t out of anyone’s reach, especially when it pays back by a factor of 16.
So how would a user story for availability look? Let’s start with:
As the CTO, I want to look out onto a resilient product suite that never stops even under catastrophic events, So that we never need to close our business and are always able to meet our users needs no matter what time zone they live in, As proven by appropriate and justifiable answers to the availability questions.
The time zone bit is an import part for us – we started off as a UK business, crept determinedly across Europe and with the recent inclusion of southern hemisphere markets I feel quite justified using the term global. This has meant our traditional maintenance slots (the unsociable GMT hours) have evaporated and we’re left with the challenge of keeping trading going 24×7 releases and all. Appropriate and justifiable? Well put it this way, I want that 16%…
What can we ask ourselves to make sure we’ve thought about availability enough? A good start would be:
- How will I make this resilient?
- How will this recover from failure?
- How will it behave when it loses connection to it’s data?
- How will it behave when its dependencies are missing?
- How will the whole system behave when my feature fails?
- How will this features status be monitored?
- What are the major events in the system and how do I make these visible?
- Are the error messages the events produce meaningful in diagnosis?
- How will the system survive the loss of a feature?
- How will the system survive the loss of a node?
- How will the system survive the loss of a data centre?
- Can the system recover from common failure scenarios automatically?
- What tolerance to network and system maintenance have I built in?
- Does my feature meet any SLAs imposed?
Plan for failure, after all it’s guaranteed.