Distributed Systems Observability

A recent project my team and I worked on involved the re-architecture of a globally distributed system to facilitate a deployment in public cloud. We learnt a lot completing this project, the most important thing being that it never ends up being a ‘lift and shift’ exercise. Many times we faced a decision to leave something as-is that was not quite as optimal as it should be, or change it during the project, potentially impacting agreed timelines. Ultimately, the decision always ended up being to go ahead and make the improvement. I am a big fan of not falling into the trap of never time to do it right, always time to fix it later.

Something else I learnt a lot about during this project is the importance of being able to observe complex system behaviors, ideally in as close to real time as possible. This is ever more important these days as the paradigm shifts to containers and serverless. Combine this with a globally distributed system and bring elements like auto-scaling into the mix and you have got a challenge on your hands in terms of system observability.

So what is observability and is it the same as monitoring the service? The definition of the term as it applies to distributed systems seems to mean different things to different people. I really like the definition that Cindy Sridharan uses in the book Distributed Systems Observability (O’Reilly, 2018):

In its most complete sense, observability is a property of a system that has been designed, built, tested, deployed, operated, monitored, maintained, and evolved in acknowledgment of the following facts:

  • No complex system is ever fully healthy.
  • Distributed systems are pathologically unpredictable.
  • It’s impossible to predict the myriad states of partial failure various parts of the system might end up in.
  • Failure needs to be embraced at every phase, from system design to implementation, testing, deployment, and, finally, operation.
  • Ease of debugging is a cornerstone for the maintenance and evolution of robust systems.

No complex system is ever fully healthy.
At first glance, this might look like a bold claim, but it is absolutely true. There will always be a component that is performing in a sub-optimal fashion, or a component that is currently on fail-over to a secondary instance. The key thing here is that when issues occur, action can be taken automatically (ideally), or manually to address the issue and ensure the overall system remains stable and within any agreed performance indicators.

Distributed systems are pathologically unpredictable.
Consider a large scale cloud service with differing traffic profiles each day. Such a system may perform very well with one traffic profile, and perform sub-optimally with another. In this example, again knowing an issue exists is critical. Some of these types of issues can be difficult to spot if the relevant observability functionality has not been built-in. Performance issues in production especially can be hidden if the right observability tools are not in place and constantly reviewed.

It’s impossible to predict the myriad states of partial failure various parts of the system might end up in.
This is especially true of complex distributed systems, and it is definitely impossible to test all failure scenarios in a very complex system in my opinion. However, the key failure scenarios that can be identified, must be tested and mitigations put in place as necessary. For anything else, monitoring points should be in place to detect as many issues as possible.

Failure needs to be embraced at every phase, from system design to implementation, testing, deployment, and, finally, operation.
There will always be issues that occur which are not caught in monitoring. Sometimes these are minor with no customer impact, sometimes not. It is important when these issue occur to learn from them, and make the necessary updates to detect them should they occur again. System monitoring points should be defined early in the project lifecycle, and tested multiple times throughout the project development lifecycle.

Ease of debugging is a cornerstone for the maintenance and evolution of robust systems.
Perhaps one of the most critical points here. When problems occur, engineers will need the necessary information to be able to debug effectively. Consider a service crash in production where you don’t get a core dump, and service logs have been rotated to save disk space. When issues occur, you must ensure that the necessary forensics are available to diagnose the issue.

So, observability is not something that we add in the final stages of a project, but something that must be thought of as a feature of a distributed system from the beginning of the project. It should also be a team concern, not just an operational concern.

Observability must be designed. The design must be facilitated in the service architecture. Observability must also be tested, something that can be neglected when the team is heads-down trying to deliver user visible features with a customer benefit. But, not to suggest that observability doesn’t have a customer benefit – in fact it is critically important not to be blind in production to issues like higher than normal latency that might be impacting customer experience negatively. In a future post, I’ll go more in-depth into the types of observability which I believe should be built-in from the start.