In the book club at work, we recently finished reading Release It! by Michael T. Nygard. It is a book I have been meaning to read for a long time, but somehow I never got around to it until now. It was written in 2007, and it is starting to show its age in several respects. Despite this, there is still a lot of relevant advice on how to make software work well in production.
The book is divided into four parts: Stability, Capacity, General Design Issue and Operations. Three of the sections start off with a case study of a serious problem in a production system. Each story has details of the symptoms, the trouble shooting steps, and how it was solved. I really enjoyed reading those stories. They illustrate many of the points made in the book, and show that the recommendations come from hard won experience in the field. I also liked reading the stories because I agree with the author that trouble shooting is bit like solving a murder mystery.
This section starts with the basics – having a stable system. The author’s philosophy (which I whole-heartedly agree with) is that it is impossible to get rid of all bugs before the software is in production. Therefore, you must be able to survive bugs without them taking down the whole system. The case studies show how failures in one part can jump from one system or layer to the next, causing cascading failures. Here are the parts I liked the most relating to stability:
Timeouts/fail fast. Surprisingly often, cascading failures are caused by operations hanging forever, or for a very long time. Thus resources, like thread pools, get exhausted quickly. Timeouts make sure you give up when appropriate for outgoing requests. For incoming requests, you should fail fast if you are not able to process the request.
Circuit breaker. A circuit breaker allows you to prevent requests to a failing system. When the target system is operating normally, the circuit breaker is transparent. However, once requests towards the target system fail too often, the circuit breaker trips. Subsequent calls to the target system are not attempted. Instead, an error response is returned immediately, without using up resource trying to reach the failing target system. After a while, requests are allowed to go through, to gauge whether normal operations should resume, or whether the breaker should keep blocking the calls.
Bulkheads. Bulkheads (from watertight compartments in a ship) means arranging your system so a failure in one part doesn’t bring the whole system down. An example is separate thread pools for different parts of the system, instead of one common pool. If a common pool is used, then an error in one part of the system could cause all the threads to be used up, thus disrupting all parts of the system. If dedicated pools are used, exhausting one pool does not stop the parts that use the other pools.
Steady state. The system must be set up so no manual intervention is needed when it is operating normally. For example, automatic log rotation must be used to avoid filling up disks. Any data purging jobs should be automatic, and there should be limits to caches, to avoid the need for manual (error prone) intervention. This point really resonates with me. Many times when I have been on call, the root cause of problems has been full disks for various reasons.
Unbounded result sets. Always put limits on the size of query results, input queues etc, otherwise you can easily hang or run out of memory if you get unusually large results or input.
Synthetic transactions. An excellent strategy to check the health of a system is to periodically send a synthetic job through it, and monitor the result. For the fake job to be processed, there must be resource available to process it at all stages. This checking becomes much more comprehensive than simply checking that the server is up.
SLA inversion. If your system has a Service Level Agreement (SLA) of 99.99% availability, but depends on an external system with lower availability, you have SLA inversion. This simply means that you can never guarantee higher availability than the lowest value of any system you depend on.
Test harness. To really test how your system behaves in the face of network problems, you need a test harness that can respond in all sorts of nasty ways. Some examples of how the test harness could act: refuse all connections, accept connections but never send any data back, responding with one byte every 30 seconds, sending random bytes back etc.
The capacity section has less interesting recommendations than the stability section. However, the first few pages contain a good definition of what capacity is. Throughput describes that number of transactions the system can handle in a given time. A system’s capacity is the maximum throughput it can sustain with acceptable response times. The capacity is always limited by exactly one constraint at a time, for example memory, number of threads, network bandwidth etc. Also, there are often non-linear effects, so that when you hit a constraint, you get a “knee” where the capacity levels off.
General Design Issues
This was the least interesting section for me. The first two chapters (on networking and security) were too short for anything other than an overview. The chapter on availability has a good discussion of how to define availability. For example: What are acceptable response times? What if some less important parts of the system are down? How often should the system be checked for availability, and from what locations?
In the administration chapter, there is a discussion on how to handle configuration files. There is also a good recommendation to ensure that the start-up sequence is complete before processing work. If work gets in before all parts of the system are up and properly configured, you can get some tricky bugs.
In Transparency, the author describes how monitoring the system is useful for spotting trends and for forecasting, as well as showing the present status of the system. The present status is useful to show on a dashboard, to give people a sense of what the normal operation of the system looks like.
In addition, you want to tie the monitored values to alarms. There should be checks that all expected events are happening, that no unexpected events occur, and that all parameters are in the expected range (too low values can indicate a problem, just as too high values can).
In the section on logging, there is good advice on making logs scannable by humans, to write log messages that are easy to interpret, and to include identifiers to make it easy to track requests through the system.
Finally, in Adaptations, the author discusses the continuous development of the system, including the advice that “releases shouldn’t hurt” and how to do zero downtime releases, including database migrations.
For me, the best parts of the book were the cases studies, and the recommendations in the Stability chapters. There is some repetition in parts of the book, and it feels a bit dated at times. For example, several issues in the last chapter, like monitoring and continuous delivery seem to have become standard practice by now. Nevertheless, it is a worthwhile read, especially these days when more and more developers also operate the systems they develop.