Distributed Systems Observability

This post was also featured in Issue #103 of the Distributed Systems Newsletter.

A recent project my team and I worked on involved the re-architecture of a globally distributed system to facilitate a deployment in public cloud. We learnt a lot completing this project, the most important thing being that it never ends up being a ‘lift and shift’ exercise. Many times we faced a decision to leave something as-is that was not quite as optimal as it should be, or change it during the project, potentially impacting agreed timelines. Ultimately, the decision always ended up being to go ahead and make the improvement. I am a big fan of not falling into the trap of never time to do it right, always time to fix it later.

Something else I learnt a lot about during this project is the importance of being able to observe complex system behaviors, ideally in as close to real time as possible. This is ever more important these days as the paradigm shifts to containers and serverless. Combine this with a globally distributed system and bring elements like auto-scaling into the mix and you have got a challenge on your hands in terms of system observability.

So what is observability and is it the same as monitoring the service? The definition of the term as it applies to distributed systems seems to mean different things to different people. I really like the definition that Cindy Sridharan uses in the book Distributed Systems Observability (O’Reilly, 2018):

In its most complete sense, observability is a property of a system that has been designed, built, tested, deployed, operated, monitored, maintained, and evolved in acknowledgment of the following facts:

  • No complex system is ever fully healthy.
  • Distributed systems are pathologically unpredictable.
  • It’s impossible to predict the myriad states of partial failure various parts of the system might end up in.
  • Failure needs to be embraced at every phase, from system design to implementation, testing, deployment, and, finally, operation.
  • Ease of debugging is a cornerstone for the maintenance and evolution of robust systems.

No complex system is ever fully healthy.
At first glance, this might look like a bold claim, but it is absolutely true. There will always be a component that is performing in a sub-optimal fashion, or a component that is currently on fail-over to a secondary instance. The key thing here is that when issues occur, action can be taken automatically (ideally), or manually to address the issue and ensure the overall system remains stable and within any agreed performance indicators.

Distributed systems are pathologically unpredictable.
Consider a large scale cloud service with differing traffic profiles each day. Such a system may perform very well with one traffic profile, and perform sub-optimally with another. In this example, again knowing an issue exists is critical. Some of these types of issues can be difficult to spot if the relevant observability functionality has not been built-in. Performance issues in production especially can be hidden if the right observability tools are not in place and constantly reviewed.

It’s impossible to predict the myriad states of partial failure various parts of the system might end up in.
This is especially true of complex distributed systems, and it is definitely impossible to test all failure scenarios in a very complex system in my opinion. However, the key failure scenarios that can be identified, must be tested and mitigations put in place as necessary. For anything else, monitoring points should be in place to detect as many issues as possible.

Failure needs to be embraced at every phase, from system design to implementation, testing, deployment, and, finally, operation.
There will always be issues that occur which are not caught in monitoring. Sometimes these are minor with no customer impact, sometimes not. It is important when these issue occur to learn from them, and make the necessary updates to detect them should they occur again. System monitoring points should be defined early in the project lifecycle, and tested multiple times throughout the project development lifecycle.

Ease of debugging is a cornerstone for the maintenance and evolution of robust systems.
Perhaps one of the most critical points here. When problems occur, engineers will need the necessary information to be able to debug effectively. Consider a service crash in production where you don’t get a core dump, and service logs have been rotated to save disk space. When issues occur, you must ensure that the necessary forensics are available to diagnose the issue.

So, observability is not something that we add in the final stages of a project, but something that must be thought of as a feature of a distributed system from the beginning of the project. It should also be a team concern, not just an operational concern.

Observability must be designed. The design must be facilitated in the service architecture. Observability must also be tested, something that can be neglected when the team is heads-down trying to deliver user visible features with a customer benefit. But, not to suggest that observability doesn’t have a customer benefit – in fact it is critically important not to be blind in production to issues like higher than normal latency that might be impacting customer experience negatively. In a future post, I’ll go more in-depth into the types of observability which I believe should be built-in from the start.

Using a Doomsday Clock to Track Technical Debt Risk

Every software team has technical debt, and those who say they don’t are lying. Even for new software, there are always items in the backlog that need attention, be they architecture trade-offs or areas of the code which are not as easy to maintain as they should be. Unless you have unlimited time and resources to deliver a project, which in reality is never, you will always have items such as these in the backlog that need to be addressed, outside of new features that need to be implemented. Mostly, but not always, technical debt items are deprioritized over such new features that generate visible outcomes, value for the customer, and revenue for the business. In my opinion, this is OK – technical debt in software projects is a fact of life, and as long as it is not recklessly introduced, and there is a plan to address it later, it is fine. It is good to look at how technical debt gets introduced. In my experience, it is mostly down to time constraints i.e. having a delivery deadline that means trade-offs must be made. Martin Fowler introduced us to the Technical Debt Quadrant which is a nice way of looking at how technical debt gets introduced. You would hope that you never end up anywhere in the top left.

tech debt quadrant

There are a few different ways of tracking technical debt, such as keeping items as labeled stories in your JIRA backlog, or using a separate technical debt register. The most important thing is that you actually track these items – and wherever you track them it is critical to continuously review and prioritize them. It is also key that you address items as you iterate on new releases of your software. When you do not address technical debt and use all your team’s work cycles to add new features (and also likely new technical debt), you will come to a tipping point. You will find it takes ever longer to add new features, or worse some technical debt items may begin to impact your production software – think of that performance trade-off you made a few years ago when you were sure the software workload would never reach this scale – now it has, and customers are being impacted. So, neglecting technical debt items that have a potential to be very impactful to your customer base is not a good idea, and these are the type of items I will discuss here.

Recently I was reading about the Doomsday clock. If you are not familiar with this:

Founded in 1945 by University of Chicago scientists who had helped develop the first atomic weapons in the Manhattan Project, the Bulletin of the Atomic Scientists created the Doomsday Clock two years later, using the imagery of apocalypse (midnight) and the contemporary idiom of nuclear explosion (countdown to zero) to convey threats to humanity and the planet. The decision to move (or to leave in place) the minute hand of the Doomsday Clock is made every year by the Bulletin’s Science and Security Board in consultation with its Board of Sponsors, which includes 13 Nobel laureates. The Clock has become a universally recognized indicator of the world’s vulnerability to catastrophe from nuclear weapons, climate change, and disruptive technologies in other domains.

So I thought, why not take this model and use it to track not existential risks to humanity, but technical debt items that pose a known catastrophic risk to a software product or service, be it a complex desktop or mobile application or a large cloud service. The items I am considering here are not small issues such as ‘I made a change and ignored the two failing unit tests‘. While these issues are still important, the items I am thinking about here are things that would cause a catastrophic failure of your software in a production environment should a certain condition or set of conditions arise, let’s take a two examples.

For the first example, let us consider a popular desktop application that relies on a third party library to operate successfully. From inception, the application has used the free version of the library, and there has always been an item in the backlog to migrate to the enterprise version to ensure long term support. Now there is a hard date for end of support six months from now, and you need to migrate to the enterprise version before that date to continue to receive security patches – which are regular, or you risk exposing your customer base.

Let us think about another example, consider a popular cloud service. The service uses a particular relational database that is key to the operation of the service and the customer value it provides. For sometime, the scaling limits of this database have been known, and due to growth and expansion into international markets, these limits are closer than ever.

The main thing here is that I am talking about known technical debt items that will cause catastrophe at some point in the future. It is important here to draw the distinction between those and unknown items to which teams will always need to be reactive.

The method I had in mind for tracking such items, taking the Doomsday Clock analogy, was as follows:

  1. You take your top X (in order of priority) technical debt items – big hitting items like those described above, you might have 5, you might even have 10.
  2. The doomsday clock starts at the same number of minutes from midnight as you have items – e.g. if you have 5 items you start at 11:55pm.
  3. Each time one of these items causes a real issue, or an issue is deemed imminent, move the time 1 minute closer to midnight. Moving the clock closer to midnight should be decided by your most senior engineers and architects.
  4. The closer you get to midnight, the more danger you are in of having these items effect your customer base or revenue.

Reaching midnight manifests in a catastrophic production issue, unhappy customers, and potential loss of revenue. Executives, especially those without an engineering background, can easily grasp the severity of a situation if you explain using the above method in my opinion. This method also keeps a focus by Product Management or whoever decides your road-map on the key items that need to be addressed – it is easy to address the small items like fixing that unit test, and think you are addressing technical debt, but in reality you are just fooling yourself and your team.

How does your team track these items?


If you are an Iron Maiden fan, their song ‘2 Minutes to Midnight’ is a reference to the Doomsday clock being set to 2 minutes to midnight in 1953, the closest it had been at that time, after the US and Soviet Union tested H-bombs within nine months of one another.

What Facebook Knows About You


There’s yet more hype this week around Facebook and privacy, coming out of the release of a new feature ‘Off-Facebook Activity’, which is now available in some regions, Ireland being one of them. This new feature allows you to view (and clear) activity from non-Facebook entities. So, this is basically information about third-party websites or applications that share your visit history with Facebook.

For example, you visit the Harvey Norman website and buy a laptop. Harvey Norman shares this information with Facebook, and the next time you visit Facebook you see an advertisement for a laptop bag. This is one of the main ways that Facebook will use to target advertising. Now by going to Settings -> Your Facebook Information -> Off-Facebook Activity you can see each site that has shared information with Facebook in this way. Most normal users aren’t even aware that this is happening, and that sites they visit completely independently of Facebook will drive the ads they see on the platform.

When I checked this out of my own profile, I was not surprised to see that 152 apps and websites had shared information about my browsing habits with Facebook. The most recent activity was from Microsoft, where I had recently been looking to buy a new Surface Pro on Microsoft.com:


This is a step in the right direction in terms of transparency of this behavior, and I like the fact that I can now remove this data if I chose to also. But what else does Facebook know about me?

For a while now, Facebook has provided the ability to request a download of all of the information that it stores about you as a user of the platform. All you need to do is request it, and about an hour or so later you’ll receive a link to download a compressed (ZIP) file that contains a treasure trove of your personal information.

To generate your download:

  • Go to Settings
  • Go to Your Facebook Information
  • Go to Download Your Information
  • Under Request Copy, select Create File

I decided to give this a try to see exactly what information Facebook has collected from my 12 years of being an active user. The file itself can be large, mine was around 500MB. But what exactly does Facebook store about me? It intrigued me to think that all this data is sitting in some Facebook data center, so I wanted to know exactly what was there. Let’s delve into the download and see exactly the type of information that Facebook has stored on me long term.

The structure of the downloaded file looks something like the below, containing a bunch of folders each containing information relating to specific areas:


I spent a while digging through the information. There are quite a few areas that concerned me. Firstly, the ‘ads’ folder. This contained three files:

  • ads_interests – a large list of what Facebook perceives my ad interests to be.
  • advertisers_who_uploaded_a_contact_list_with_your_information – a list of advertisers who uploaded a list to Facebook with my email address.
  • advertisers_you’ve_interacted_with – a list of every ad I’ve ever clicked on within Facebook.

The information stored here is very valuable to Facebook in terms of its advertising business – for example, let’s say I clicked on a craft beer ad (which I often do), and a new craft beer business wants to target relevant users in my region, then I would be highly likely to be in that list of targeted users based on the information that Facebook has on me. This rudimentary approach to targeted advertising contributed to Facebook surpassing $16 billion in advertising revenue as of the end of 2018.

What else do we have in the download? Digging further, I discovered that the following information was present:

  • Every event from Facebook that I have ever been invited to, attended or setup.
  • My friends list along with all the friend requests I have ever made or rejected.
  • A list of all the groups I’ve ever joined.
  • Every page and comment I have ever liked on Facebook.
  • Every messenger thread I have ever been involved in, with all the private conversation content.
  • Everything I’ve ever posted to my Facebook profile.
  • Within the ‘about_you’ folder, I found a file called ‘your_address_books’ which contained all the contacts and phone numbers from my iPhone – this was alarming as I never remember allowing any application or Facebook access to this data.
  • All photos and videos including all my photo album content came in the download (this explains the large size).

My ‘location’ folder was empty, as I had disabled location tracking on Facebook long ago, but if you didn’t this folder would contain a list of the locations (including GPS coordinates) where you have ever logged on to Facebook.

What’s the bottom line here? Facebook stores a crap load of data about you and uses it to drive its advertising business. Like it or not, that’s the truth. If someone had access to the ZIP file that I downloaded, they could likely build a complete profile on me, see all my previous private conversations with friends, access friends phone numbers, see ads that I clicked on, and also determine sites that I have visited separately from Facebook.

There are a few things you can do to ensure that you lock down your advertising settings, which I recommend that you do:

  • Clear your Off-Facebook Activity regularly.
  • Turn off Location History.
  • In Ad Settings, set ‘Ads based on data from partners’ to ‘Not Allowed’.
  • In Ad Settings, set ‘Ads based on your activity on Facebook Company Products that you see elsewhere’ to ‘Not Allowed’.
  • In Ad Settings, set ‘Ads that include your social actions’ to ‘No One’.

These can help, but ultimately Facebook is constantly updating a profile on you based on your browsing activity. We all take Facebook usage at face value, but we forget that at the end of the day, Facebook is a business and is using all of our personal data to drive one of its main revenue sources – advertising.

I am reminded of my favorite comedian Bill Hicks’ thoughts on advertising.

Transitioning from an Individual Contributor to a Leader

I made a transition from an individual contributor to a leadership role 6 years ago, in February 2013. If you’ve made this transition in the software development world, you probably know that it can be difficult and there can be a few things that people may struggle with at the beginning. If you are thinking about moving to a leadership role, or have recently moved, this post may help. Here I’ll share some of my thoughts and some of the advice that helped me successfully make the transition.

An element of my new role I recall struggling with very early-on was the feeling that I was no longer making a tangible contribution to a software development project i.e. committing code to Git or testing other developers code. It took me a while to realize that I needed to now begin focusing on the project delivery as a whole and ensuring that was successful, and not necessarily get down into the weeds unless I really had to. The other item a new manager needs to be able to do early-on is to trust his or her team, and some may struggle with this, especially if they have come from a technical lead position in the team. This can manifest in a dictatorial style of management, which is not good for the team, or the business, and ultimately will not end in a successful outcome.

I’d like to share a few key pieces of advice I was given early on in my career, that have helped me make the transition from individual contributor.

“Be a leader, not a manager.”

This is rather cliché, and if you have ever attended any management training over the years, or read certain authors, you’ve probably heard this a lot. In my opinion, a manager is a strictly human-resources centric term – they handle general people management items like holding 1-1 meetings, performance reviews, they track leave balances, and perhaps annoy developers when they are 5 minutes late back from lunch. I picture someone with a clip-board and a pen every time I hear the word ‘manager’.

A leader is engaged day-to-day on the projects their team is working on. They know what is going on in each one, not necessarily every line of code, but they are familiar with each feature and the status. They are technical and can weigh in on technical discussions if required (although they generally don’t, as they trust their team’s ability to make the right decisions, and learn from their mistakes when they don’t). They have a continual eye on quality – quality of ongoing projects, and quality processes and how they can be improved. Leaders care about people’s development and ensuring that their team are working towards their ultimate career goals (for employees who have them, for those who don’t, that’s OK too). Leaders build a strong working relationship with their direct reports. Leaders are also looking to future – what are the technologies their team should be investing in? What opportunities are we not exploiting? How can we do what we do better?

“If your team is successful, you will be successful.”

This is a simple piece of advice, yet very, very powerful. As an individual contributor, you can ensure that your assigned work is completed to the highest level of quality, but the project may still fail due to another area not being implemented successfully, or some external dependency not being fulfilled. As a leader, you need to ensure you have oversight in all areas that may impact or impede your team’s progress, and be actively working to ensure that any unknowns are clarified, any blockers are removed, any external issues are resolved quickly, and ensuring your team members are focused on what they do best. At times, you will also need to protect your team from external distractions (e.g. getting pulled into unplanned work owned by another team). In my experience, your team will be in a better position to execute successfully if you are actively looking at these items (daily), and by extension you will be successful.

“Your network is important.”

I’ve been told this many times over the years, and only realized the importance of it a few years ago when I wanted to transition to a new role. It’s easy to neglect this, especially if you are new to a leadership role, but I would stress the importance of growing your network with relevant contacts, as you never know what opportunities those contacts may present in the future, or what synergies you can create between for example, different teams based in the same location.


There will be hard times, and there will always be challenges.

As a software development leader, people will look to you for answers that you don’t always have. You will need to assimilate and recall large amounts of information. You will need to be able to account for your teams time when needed. You will need to explain your failures. You will always need to champion the team’s achievements, stepping back and giving your team the credit that they deserve. You will need to continually learn about new technologies and understand them to enable you to take part in technical discussions. You will need to manage demanding executives, other leaders, stakeholders and customers, and always protect your team in the process.

In a challenging time, I always remember the Winston Churchill quote:

If You’re Going Through Hell, Keep Going.

One of my favorite TED talks by Richard St. John in which he outlines his thoughts on the secrets of success, captures the essence of day-to-day leadership excellently I think – you basically have to persist through CRAP – Criticism, Rejection, Assholes, and Pressure.

The key for me to any leadership position is ensuring that you are continually learning. Remember that in any situation, good or bad, you can learn – especially from the bad situations.

Thoughts on AWS re:Invent 2018

I’ve just returned from AWS re:Invent 2018, Amazon Web Services’ yearly conference showcasing new services, features, and improvements to the AWS cloud. This was the 7th year of re:Invent, and my first time attending.

The scale of the conference is staggering – held across six different Las Vegas hotels over five days, with almost 60,000 attendees this year. I expected queues, and got them. Overall though logistically the conference was well organized. Pending I queued at least 30 minutes beforehand, I was able to to make it to 95% of the sessions I planned on attending across the week.

In terms of the sessions themselves, most were very good. Over the week, I attended sixteen different sessions, made up of talks, demos, chalk talks, and hands-on sessions.

Two of my favorite sessions were ‘Optimizing Costs as you Scale on AWS’ and ‘AIOps: Steps Towards Autonomous Operations’. The former described the 5 pillars of cost optimization – Right sizing, Increasing Elasticity, Picking the Right Pricing Model, Matching Usage to Storage Class, and Measuring and Monitoring. These may seem obvious, but can often be forgotten in instances where the project is a POC that becomes production for example, or a team is not too familiar with AWS and how costs can increase as you scale up an applications usage in production. This session also included insights from an AWS customer who talked through how they had applied and governed this model in their organization, which was interesting to compare and contrast to how I’ve seen it done in the past.

I also attended numerous sessions on SageMaker, AWS’s managed machine learning service (think AML on steroids). I’m looking forward to starting to play around with SageMaker, now that I have attended a hands-on lab I am more confident beginning to look at some of the ideas I have where this could be applied. I looked at this earlier this year while completing my Masters Thesis, but ended up using Amazon Machine Learning instead in the interest of time (AML is a lot simpler to get up and running). AWS also announced Amazon SageMaker Ground Truth, which can be used to streamline the labeling process for machine learning models, via human labelling and automated labelling. One other cool announcement around ML was the launch of AWS Marketplace for Machine Learning, where you can browse 150+ pre-created algorithms and models that can be deployed directly to SageMaker. Someone may have already solved your problem!

If I was to retrospectively give myself some advice for attending re:Invent, it would be:

  1. Try to organize session by hotel. Moving hotels between sessions can take a long time (especially at some points of the day due to Las Vegas traffic). Organizing your sessions so that you are in the same hotel for most of the day can be beneficial. A good thing though is that there is a regular shuttle between conference venues.
  2. Don’t assume you will make every session. Colleagues who had previously been to re:Invent gave me this advice, but I still assumed I would make everything. Traffic, queues or something else will inevitably disrupt your schedule at some point during the week.
  3. Leave time for lunch! Easy to forget when you’ve got a menu of exciting talks to attend. AWS provided a grab-n-go lunch option which was very handy to just grab something between sessions.

If I had one criticism of re:Invent, it would be that some of the talks labelled as advanced did not go as deep as I expected into the technical detail. I thought the hands-on labs did a good job of this though, especially the two I attended on AWS SageMaker.

Overall, re:Invent is a significant investment in the attendees you send (tickets are not cheap, not to mind accommodation, food etc. – remember it’s held in Vegas), but a good idea if you are taking first steps with AWS, looking at getting in deeper or optimizing your usage, or thinking about migrating existing on-premise services to the public cloud.

See here for a good summary of all the re:Invent announcements, as well as the keynote videos.

Connecting to the SharePoint 2013 REST API from C#

Today I was updating an internal application we use for grabbing lots of Terminology data from SharePoint lists, and exporting it as TBX files for import into CAT tools etc.

This was required as the SharePoint on which it was hosted previously was upgraded from 2010 to 2013.

A small job I thought.

Then I discovered the the ASMX Web Service in SharePoint I used to grab the data previously, are deprecated in SharePoint 2013, probably not a surprise to anyone in the know, but SharePoint happens to be one of my pet hates, so development of it is not something that I tend to keep up to date with.

Anyway, I had to re-jig our application to use the SharePoint REST API, and I thought I’d provide the code here for connecting, as it look a little bit of figuring out.

The below (after you fill in your SharePoint URL, username, password, domain, and name of the list you want to extract data from), will connect and pull back the list contents to an XmlDocument object that you can parse.

XmlNamespaceManager xmlnspm = new XmlNamespaceManager(new NameTable());
Uri sharepointUrl = new Uri("SHAREPOINT URL);

xmlnspm.AddNamespace("atom", "http://www.w3.org/2005/Atom");
xmlnspm.AddNamespace("d", "http://schemas.microsoft.com/ado/2007/08/dataservices");
xmlnspm.AddNamespace("m", "http://schemas.microsoft.com/ado/2007/08/dataservices/metadata");

NetworkCredential cred = new System.Net.NetworkCredential("USERNAME", "PASSWORD", "DOMAIN");

HttpWebRequest listRequest = (HttpWebRequest)HttpWebRequest.Create(sharepointUrl.ToString() + "_api/lists/getByTitle('" + "LIST NAME" + "')/items");
listRequest.Method = "GET";
listRequest.Accept = "application/atom+xml";
listRequest.ContentType = "application/atom+xml;type=entry";

listRequest.Credentials = cred;
HttpWebResponse listResponse = (HttpWebResponse)listRequest.GetResponse();
StreamReader listReader = new StreamReader(listResponse.GetResponseStream());
XmlDocument listXml = new XmlDocument();