and why you should not care too much about your servers uptime
During my career I’ve been involved quite often in operations related to critical digital properties, being B2B or B2C Web Applications, Mobile Applications or APIs. On the base of this experience, I am now invited by my customers to support handover procedures between project based teams building a solution and the service operation team in charge of supporting it, as a supervisor, or sometime explicitly a trainer, for this process and for the service operation team in general.
I am supposed to help defining monitoring probes and check lists to detect incidents and identify the root cause for each of them and to guide on how to write procedures to allow the support team to quickly restore the service in the form of actionable instruction (“if incident A happens, execute activity 1 to restore service; if incident B happens, execute activity 2 instead”).
With exceptions (luckily), this kind of expectation is quite invariant in companies with small technological footprint (i.e.: ICT department of mechanicals factories) as in big technology based companies (i.e.: digital agencies and software factories).
It comes quite as a surprise for this kind of customers of mine when I start explaining that there will be no “root cause” to the majority of their incidents and that there will be no immediately actionable procedure to be executed by their operators.
The second surprise for them is when I start telling them that their project team will permanently be involved with the service operation team to run the solution for the following months or years and they should have budgeted and planned for this.
Final surprise is when I describe how it is irrelevant to monitor if servers they are using are up or down.
No “root cause” = many “con-causes”
In modern digital architecture we implement and manage complex distributed systems. In complex systems usually you will not find a single and simple root cause of an issue but in the majority of cases the sum of multiple factors, factors that alone could even cause no arm, is at the source of an incident.
For each kind of issues (let’s say an API non available) there are many definitions (not available using the public hostname pointing to the CDN? Not working at the hosting origin? Some specific entry point or entity of each and every request or the majority of them? Returning an error or nothing or a wrong data or wrong data format or time-out?) and for each of the error definition I can easily though of hundreds of different scenarios (election of a new MongoDb primary node in a remote region on a multi-region setup increasing network latency between application and data layer; high CPU in the database server; errors on the messaging system; public or private DNS error; false positive on a Web Application Firewall; errors with the data set; increased latency on an eventually consistent data set; exceptions on the asynchronous service taking care of the projections for a CQRS implementation; etc…) and for each scenario tens of possible causes (considering only the election of a new MongoDb primary, it could be due to network latency on the pervious primary; or due to CPU saturated in previous primary; or due to crash of the OS on the VM for an update; or due to restart of the VM caused by underlying hardware errors; etc…).
Let’s imagine that the reported incident is unavailability of a specific section on a Web Site or Mobile Application. This could be caused by increased response time of a specific entry point of an API; the API could be perfectly working, just 6% slower than before.
We could say that the “root cause” of the incident is the increased response time of the API, but if we look at this more in details and we try to answer to the question “why the response time increased” we could find a change on the Web Site on the parameters passed to the API, causing a cache burst to bypass the CDN or causing the data query on the backend not to use the indexes that were defined. This change was already tested in the QA environment and the response time was increased only by 2%. Concurrently a data import caused the size of the dataset to increase; also this was tested in a test environment before being applied to production and the increased was only by 1%. Unfortunately an incident on a specific availability zone by the Cloud Provider caused the failover of the database server into a different location, having higher network latency with the services where APIs are hosted; this emergency condition was correctly tested in the past and could cause an increase by 3% of the response time. Each of these three actions was designed and tested not to cause any incident, but the sum of the three happening at the same time caused the 6% increase of response time that caused the visible errors for the end users. But the problem is really the 6% increase of the response time? What if finally you discover that at the same time a delivery of a new release of Web Site have reduced the time-out before abandoning the call to the API and that without that change the 6% increase would have not caused any visible impact? On a situation like this, which is the “root cause” of the incident?
The importance of Business Context and Solution Architecture
During my training sessions I always try to describe the architecture of the overall solution, to list each component and service and how they do relate each other. I try to provide a strong focus on the business needs that the solution try to solve and which role each service composing it has in this.
Only when the overall solution and list of modules and their relationship is clear I start to describe for each service which kind of issues could happen and which contexts could lead to the issue and so what to check (and with which tools) and which considerations to be made and so which kind of actions and the risk and consequences.
Doing this, the focus again is on the business impact of each kind of failure and not on the number of instances or hosts or their technical setup and configuration.
In order for an operation team to be able to support a solution, it is critical that the team has a full understanding of the business values of each element of the solution and of the impacts in business operations of every incident.
The problems the operation team has to consider are not the ones as a server out of tens in a balanced pool serving a stateless application not responding correctly. The operation team has to consider how the business operates and consider incidents from that perspective. If a server is not responding but the Web Application, hosted by the pool the server belongs to, is responding perfectly there is nothing critical; a degradation on response time of that Web Application is instead a critical alert to manage no matter if all servers are reported being in perfect status by infrastructure metrics.
We should shift our view from infrastructure (and application) monitoring to transactions and real user monitoring. Why shift the view? Because in a system with five VMs serving a rest API behind a CDN there is no reason to concentrate our focus on a failure of one server, it will not have any effect, but we need to be alerted by the fact that real users usage of service is receiving a trend on increasing response time moved from 2 seconds average to 3 seconds average, something users even didn’t noticed yet but that require attention and intervention before becoming a serious performance (and so availability from an end-user perspective) issue.
Automation of failures detections and reaction and the role of Operation Team
Honestly I do not believe in detailed run books when it comes to complex systems but I do believe on educated rational people having a methodology to execute analysis and take free decision (and I believe junior people trained on the methodology can do that).
For each solution we can define a few standard cases of incidents with standard checks to be executed and in case of expected result from the checks a procedure to be executed to restore service, but in the majority of cases this will not be the case. So what is key is for operation teams to understand how each component is connected to each other and which tools they have to look at what is happening in order to take educated decision on reaction.
Where we have predefined actions to be taken on specific events happening on the system they should be fully automated (and some of this automation comes out of the box using certain Cloud Services or using Kubernetes clusters with auto-healing mechanisms).
The project team , having the in deep knowledge of the business and technology of the project, should constantly work to enhance the health checks and so the auto-healing procedures, but this has nothing to do with a separate operational support team (if not as source in the continuous flow where that support team reports which incidents have been seen and the project team executes problem analysis to identify any optimisation in auto-healing and failure detection, a never ending process; reason many today strongly suggest to have permanent multi disciplinary teams to take care forever for a specific group of projects from pre-sales to decommissioning, without having different people belonging to different teams covering various projects in sequence).
Where an operation and support team enters into place could be on communication with customers, and gathering the most comprehensive view of what is happening on the platform to take (or better help project team to take) educated decision on corrective actions in case of unexpected issues (being able to understand risk, in terms of business impact, of each action). All the expected ones should have been subjected to automated correction.
When requested to support handover phases, I do try to put emphasis on the following two organisational and cultural aspects:
- Support Team and people responsible for operations are putting too much emphasis on having run books in the form “if this happens do this”, I do not believe this is functional on complex systems; they should work more on how to reach a methodological approach to analysis and to understand the architecture and the business more than not to learn instructions to copy\paste.
- Project team are not considering as their duty the problem management and continuous improvement of a project when it is in an operation phase and expect Support and Operation Team to take care of this. The only case managed by the project team are “bug fixing” for bugs found in code, but I think this is only a small portion of managing something that is in operation.