For the system at work, I am on call one week every seven weeks. For most of the past ten years, I have been on organized on call rotations for the systems I have been developing. I think being on call is a logical way of taking responsibility for your work. You also learn a lot from it. However, it is stressful and an inconvenience, so you should get paid for it.
Why Developers Should Be On Call
Many systems need to be available around the clock, so somebody has to be on call for support. In my mind, it is natural that the people developing it also support it.
Skin in the game. As a developer, taking responsibility for your code means not just writing it, but testing it and making sure it does what it is supposed to, even when it is running in production. When you know you might get woken up in the middle of the night if there is a problem, you become more careful coding and testing. In his book Antifragile, Nassim Nicholas Taleb mentions how Roman engineers had to spend some time under the bridges they built – to ensure they did a good job.
Holistic view. Troubleshooting the system forces you to get a good understanding of how it all works. If you don’t work with support, it is easy to end up knowing some parts really well, but not seeing the whole picture of how all the parts connect and interact. Also, you get to see firsthand all the ways in which the customers are using the system.
Drives improvements. When you work on fixing problems in the system, it is immediately clear if the logging or troubleshooting tools aren’t as good as they should be. As a developer of the system, you can fix these problems as you run in to them. You can improve the logging by adding more relevant information, or by removing log statements that only clog up the logs. You can write scripts and other tools that help in troubleshooting and recovery. You can remove false alarms. Over time, many such small improvements add up to a much better and more resilient system.
Phone calls and alarms. At TriOptima, we have an emergency number that can be called if there is a problem with the system. The number is forwarded to the developer on call. Over time, we have also added more and more monitoring of, for example, exceptions thrown, queue lengths and memory, CPU and disk utilization. The monitoring will generate alarms that will wake up the developer on call. These days we are almost never woken up by phone calls, because we get the alarms from the monitoring before most customers notice a problem.
Heartbeat jobs. For two of the major interacting systems, we run periodic heartbeat jobs (every minute and every five minutes). If they are not executed as expected, alarms are raised. These have proven to be extremely useful, because they tend to catch a whole range of problems. For example, recently the server sending us data ran out of threads. One effect of that was that our heartbeat job did not run, so we got an alarm. On other occasions, we have become aware of connectivity problems through the heartbeat jobs. In the book Release It!, Michael T. Nygard calls these synthetic transactions.
Easy roll backs. At work, we deploy new features and bug fixes as they are ready. This means many small deploys throughout the day. If we get woken up at night because of a problem, the solution is often to roll back the most recent deploy. This is an easy and quick operation thanks to Kubernetes. A few changes, such as database migrations, can be harder to roll back. But since we know this, and have skin in the game, we try to write these to be as easy as possible to roll back. We also make high risk deployments in the morning, to have the whole day to monitor the system.
Improving the alarms. We are continuously tuning the alarms to make them more precise, and to weed out false alarms. One example is the queue length alarms. We started out by raising an alarm if the average length during the last 10 minutes was over a threshold. However, after fixing such problems, the queue length alarm did not clear until the average came down below the threshold. Not clearing it right away could mask a new problems happening. So we changed it to be raised as soon as the threshold was reached, and to clear it as soon as the queue length got down to zero. This made that particular alarm more precise and responsive. Only last Friday I was tweaking another alarm – some synchronization problems can wait until regular office hours, so those no longer wake anybody up.
Compensation and Scheduling
Pay. People on call should get paid extra for it. There is a significant impact on your life if you have to be ready to log on and troubleshoot issues at any time during a week, so you deserve to be compensated for that. I think the best system is when you both get a fixed amount just for being on call, whether there are incidents or not, and you also get paid every time you get called out. Getting time off in lieu is also a possibility.
Scheduling. When I have been on call, it has always been one week at a time, and I think that is a good duration. It is good to draw up the schedule far in advance (many months ahead). That way it is easier to plan other activities in your life. It is also really good to be able to swap weeks with other developers in the team. It is hard to say what the ideal cycle time for on call is. Too often, and it is hard to live a normal life. Too seldom and you don’t get enough practice and get rusty. Every seven or eight weeks is a good interval for me.
Escalation. There should always be an escalation path if there is a real crisis. If the whole system is down and you are not making any progress fixing the problem, you need to bring in more help. Typically a manager will be the escalation point, and they will try to get hold of whoever else can help. I also tell my colleagues that I don’t mind being called, even if I am not on call that week, if there is something I can help with. I have only been called a few times in ten years, but just knowing that you can call other team members if necessary helps lower the stress of being on call.
Onboarding. New members on the team need time to get to know the system and practice troubleshooting before they join the on call rotation. After that, we pair up with the new member for a week of on call together, before they are scheduled on their own.
Even though being on call can be quite stressful at times, I think it is really beneficial. Being on call means taking responsibility for what you develop. In the process you learn how the system works (or doesn’t work), and can use that feedback to improve the system. However, the burden of being on call needs to be shared between team members, and you should get paid properly for it.
This is the standard model at most places I’ve worked. It will continue to become even more common if devops culture succeeds in producing more teams that own the systems they write. The tradition AM model is finally dying.
Hacker News discussion: https://news.ycombinator.com/item?id=18586743
Reddit discussion: https://www.reddit.com/r/programming/comments/a2lrrh/
Thanks for this post. I learned alot.
1. for every week of oncall a developer should get 1 day off.
2. If developer had to work nights, he should be compensation with additional days off.
3. no payment would reduce the stress so we should not ask for payment compensation.
And yet, there are places that claim since US engineering is typically an “exempt” position, “it’s just a part of the job.”
We need to push to not accept that anymore.
At my last company we were considered on call every day of the year and every hour of the week. It was quite stressful.
Paying for firefighting is a bad idea. It incentivizes the wrong things.
My experience is that everyone on our team works to minimize being called, despite being paid when we *do* get called. Everybody wants a stable system. Is your experience different?
What I meant by paying for firefighting, is paying extra for incident handling. If an on call shift is part of the work every N weeks, bake that into the salary, don’t incentivize overtime pay.
Software development and Operations of that software includes keeping it functional and supporting the customer. If you fail to deliver a working product to the customer, your pay should be impacted.
– you might join a team or existing project, so by that logic, your pay is impacted immediately for a new project you have taken and improving it will take you roughly half a year, shouldn’t you get extra for that.
– the company might put you on 2-4 people on-call rotation that is insane.
– the oncall is the easiest solution, you just put developers on call, do you support them for adding tasks to sprint to minimize the call? in many cases I have seen that was not the case, the pressure from above to add features at the expense of firefighting at nights is sometimes something a company prefers.
– the above is not anecdotal
– why are you expected to work a day after a stressful 24/7 week? the low stress even if there is no call is hurting most of our health, the recuperate it we must have time off immediately after the shift.
Pingback: Professional Development – 2018 – Week 49 – Geoff Mazeroff
Pingback: SRE Weekly Issue #155 – SRE WEEKLY