You have developed a new feature. The code has been reviewed, and all the tests pass. You have just deployed this new feature to production. So on to the next task, right? Wrong. Most of the time, you should check that the feature behaves as expected in production. For a simple feature it may be enough to just try it out. But many features are not easily testable. They may be just one part of a complex flow of actions. Or they deal with external data fed into the system.
In such cases, checking if the feature is working means looking at the logs. Yet another reason for checking the logs is that the feature may be working fine most of the time, but given unanticipated data, it fails. Usually when I deploy something new to production, I follow up by looking at the logs. Often I find surprising behavior or unexpected data.
Many developers simply assume that the new feature they are deploying will work as expected. Ideally, the new feature has been tested in a production-like test environment. In my experience, it is not enough that all automatic tests pass. If the new feature has not been explored in a test system, there is a risk that it is not working properly. This is because the automatic tests focus on the code, but when exploring the feature in a test system, you consider the whole picture. It is the difference between checking and exploring.
But even when the feature has been tested properly before, there is a risk that it won’t work as intended in production. The main reason for this is that the environment in production is more complex than in the test system. There is usually more traffic, more concurrency, and more diverse data.
The key to finding out how the new feature behaves in the more complex production environment is logging. I have already written about what I think is needed for good logging. Of course, there needs to be logging in the first place. If you don’t log anything about how your feature is behaving, you are effectively blind. The only way to know if it is working is to test it, or to wait for trouble reports from customers. If you do log how the feature behaves, you can be proactive. After I have deployed a new feature, I usually look at the logs. Typically, most log entries are the expected cases. The interesting part is when you exclude all the expected cases, or search for error cases. That is usually when I find the corner cases that I had not anticipated.
I sometimes hear people saying that nothing should be logged when everything is working as expected. However, that stops you from finding all the cases where your code “works” but gives the wrong result. Often this is due to unanticipated data. For example, suppose an agreement should be deactivated when a user sets the exposure value to zero. However, what if the exposure value is sometimes set to zero by a system user too. Is that correct? If you are logging when it happens, you will notice that it is sometimes set by the system user as well, which is perhaps not the intended behavior. Without logging, you would not be able to see this difference. You could say the requirements were incomplete, but to discover that, logging was needed.
Another reason for logging what is happening (even if it is not errors) is that it helps trouble shooting. The system may be behaving as expected, but people are misunderstanding what should happen (not uncommon for a complex system). Checking the logs to see what happened will demonstrate that the system did the correct thing, even though it was not what we expected it to be.
Continuing Testing In Production
Often, checking how a new feature behaves in production is referred to as “testing in production”. I think this label is misleading. It makes it sound like there is no testing done before deploying to production. But of course the responsible thing is to test thoroughly before deploying. I think continuing testing in production better describes what it is. Checking the logs after deployment is one aspect of this. But there are other ways of making sure that deploying a new feature does not cause any problems. Here are some ways we are using at work:
Gradual rollout. When you introduce a new feature, it doesn’t have to be all or nothing. If you want to be cautious, only turn it on for a small subset of users at a time. For example, we have used the starting letter of the party group name to decide if the new feature should be used. First, only party group names starting with A – D, then A – H and so on.
Feature flags. Another way of doing a gradual roll out. Only users that have the flag enabled get the new feature. If there are many different feature flags at the same time, there can be problems that only show up for a given combination of flags that are on and off. Therefore, it is good to remove feature flags that are no longer needed.
Test accounts. Having users in the production system that are only used for testing is really good. Then you can test features in production without impacting real customers.
Fall back to previous. If a feature introduces a new way of doing something, it is good to keep the old way of doing it for a while. Then you can add code that will fall back to the old solution if anything goes wrong with the new solution. These fallbacks should be tracked (metrics or logging), so that you can decide when it is safe to remove the old solution.
Compare results. A variation of fallback to previous. If you introduce a new way of, for example, calculating a margin call amount, then it is good to do it both ways and compare the results. If the results differ, you need to investigate why.
Easy enable/disable. When you introduce a new feature, it is good if it can be easily disabled, for example with a flag. That way, if there is a problem with the feature, it can just be disabled. An alternative is to roll back the change (deploying the old code again), but sometimes that is more complicated, for example if the database schema was changed.
Heartbeat jobs. Performing some basic round-trip action every few minutes or so, and raising an alarm if it fails, is really helpful. There are many ways a system can fail (connectivity issues, overload, bugs in the code etc.), and all of them are noticed if a heartbeat job then fails.
Deploying to production is not the last step. You need to be proactive and make sure that what you have deployed works as expected, without any surprises. Also, there are many strategies to lower the risk when deploying new features to production.
Reddit discussion: https://www.reddit.com/r/programming/comments/ijrzha/deployed_to_production_is_not_enough/