Distributed Systems Observability

A recent project my team and I worked on involved the re-architecture of a globally distributed system to facilitate a deployment in public cloud. We learnt a lot completing this project, the most important thing being that it never ends up being a ‘lift and shift’ exercise. Many times we faced a decision to leave something as-is that was not quite as optimal as it should be, or change it during the project, potentially impacting agreed timelines. Ultimately, the decision always ended up being to go ahead and make the improvement. I am a big fan of not falling into the trap of never time to do it right, always time to fix it later.

Something else I learnt a lot about during this project is the importance of being able to observe complex system behaviors, ideally in as close to real time as possible. This is ever more important these days as the paradigm shifts to containers and serverless. Combine this with a globally distributed system and bring elements like auto-scaling into the mix and you have got a challenge on your hands in terms of system observability.

So what is observability and is it the same as monitoring the service? The definition of the term as it applies to distributed systems seems to mean different things to different people. I really like the definition that Cindy Sridharan uses in the book Distributed Systems Observability (O’Reilly, 2018):

In its most complete sense, observability is a property of a system that has been designed, built, tested, deployed, operated, monitored, maintained, and evolved in acknowledgment of the following facts:

  • No complex system is ever fully healthy.
  • Distributed systems are pathologically unpredictable.
  • It’s impossible to predict the myriad states of partial failure various parts of the system might end up in.
  • Failure needs to be embraced at every phase, from system design to implementation, testing, deployment, and, finally, operation.
  • Ease of debugging is a cornerstone for the maintenance and evolution of robust systems.

No complex system is ever fully healthy.
At first glance, this might look like a bold claim, but it is absolutely true. There will always be a component that is performing in a sub-optimal fashion, or a component that is currently on fail-over to a secondary instance. The key thing here is that when issues occur, action can be taken automatically (ideally), or manually to address the issue and ensure the overall system remains stable and within any agreed performance indicators.

Distributed systems are pathologically unpredictable.
Consider a large scale cloud service with differing traffic profiles each day. Such a system may perform very well with one traffic profile, and perform sub-optimally with another. In this example, again knowing an issue exists is critical. Some of these types of issues can be difficult to spot if the relevant observability functionality has not been built-in. Performance issues in production especially can be hidden if the right observability tools are not in place and constantly reviewed.

It’s impossible to predict the myriad states of partial failure various parts of the system might end up in.
This is especially true of complex distributed systems, and it is definitely impossible to test all failure scenarios in a very complex system in my opinion. However, the key failure scenarios that can be identified, must be tested and mitigations put in place as necessary. For anything else, monitoring points should be in place to detect as many issues as possible.

Failure needs to be embraced at every phase, from system design to implementation, testing, deployment, and, finally, operation.
There will always be issues that occur which are not caught in monitoring. Sometimes these are minor with no customer impact, sometimes not. It is important when these issue occur to learn from them, and make the necessary updates to detect them should they occur again. System monitoring points should be defined early in the project lifecycle, and tested multiple times throughout the project development lifecycle.

Ease of debugging is a cornerstone for the maintenance and evolution of robust systems.
Perhaps one of the most critical points here. When problems occur, engineers will need the necessary information to be able to debug effectively. Consider a service crash in production where you don’t get a core dump, and service logs have been rotated to save disk space. When issues occur, you must ensure that the necessary forensics are available to diagnose the issue.

So, observability is not something that we add in the final stages of a project, but something that must be thought of as a feature of a distributed system from the beginning of the project. It should also be a team concern, not just an operational concern.

Observability must be designed. The design must be facilitated in the service architecture. Observability must also be tested, something that can be neglected when the team is heads-down trying to deliver user visible features with a customer benefit. But, not to suggest that observability doesn’t have a customer benefit – in fact it is critically important not to be blind in production to issues like higher than normal latency that might be impacting customer experience negatively. In a future post, I’ll go more in-depth into the types of observability which I believe should be built-in from the start.

Thoughts on AWS re:Invent 2018



I’ve just returned from AWS re:Invent 2018, Amazon Web Services’ yearly conference showcasing new services, features, and improvements to the AWS cloud. This was the 7th year of re:Invent, and my first time attending.

The scale of the conference is staggering – held across six different Las Vegas hotels over five days, with almost 60,000 attendees this year. I expected queues, and got them. Overall though logistically the conference was well organized. Pending I queued at least 30 minutes beforehand, I was able to to make it to 95% of the sessions I planned on attending across the week.

In terms of the sessions themselves, most were very good. Over the week, I attended sixteen different sessions, made up of talks, demos, chalk talks, and hands-on sessions.

Two of my favorite sessions were ‘Optimizing Costs as you Scale on AWS’ and ‘AIOps: Steps Towards Autonomous Operations’. The former described the 5 pillars of cost optimization – Right sizing, Increasing Elasticity, Picking the Right Pricing Model, Matching Usage to Storage Class, and Measuring and Monitoring. These may seem obvious, but can often be forgotten in instances where the project is a POC that becomes production for example, or a team is not too familiar with AWS and how costs can increase as you scale up an applications usage in production. This session also included insights from an AWS customer who talked through how they had applied and governed this model in their organization, which was interesting to compare and contrast to how I’ve seen it done in the past.

I also attended numerous sessions on SageMaker, AWS’s managed machine learning service (think AML on steroids). I’m looking forward to starting to play around with SageMaker, now that I have attended a hands-on lab I am more confident beginning to look at some of the ideas I have where this could be applied. I looked at this earlier this year while completing my Masters Thesis, but ended up using Amazon Machine Learning instead in the interest of time (AML is a lot simpler to get up and running). AWS also announced Amazon SageMaker Ground Truth, which can be used to streamline the labeling process for machine learning models, via human labelling and automated labelling. One other cool announcement around ML was the launch of AWS Marketplace for Machine Learning, where you can browse 150+ pre-created algorithms and models that can be deployed directly to SageMaker. Someone may have already solved your problem!

If I was to retrospectively give myself some advice for attending re:Invent, it would be:

  1. Try to organize session by hotel. Moving hotels between sessions can take a long time (especially at some points of the day due to Las Vegas traffic). Organizing your sessions so that you are in the same hotel for most of the day can be beneficial. A good thing though is that there is a regular shuttle between conference venues.
  2. Don’t assume you will make every session. Colleagues who had previously been to re:Invent gave me this advice, but I still assumed I would make everything. Traffic, queues or something else will inevitably disrupt your schedule at some point during the week.
  3. Leave time for lunch! Easy to forget when you’ve got a menu of exciting talks to attend. AWS provided a grab-n-go lunch option which was very handy to just grab something between sessions.

If I had one criticism of re:Invent, it would be that some of the talks labelled as advanced did not go as deep as I expected into the technical detail. I thought the hands-on labs did a good job of this though, especially the two I attended on AWS SageMaker.

Overall, re:Invent is a significant investment in the attendees you send (tickets are not cheap, not to mind accommodation, food etc. – remember it’s held in Vegas), but a good idea if you are taking first steps with AWS, looking at getting in deeper or optimizing your usage, or thinking about migrating existing on-premise services to the public cloud.

See here for a good summary of all the re:Invent announcements, as well as the keynote videos.