AWS SageMaker

I have played around with AWS SageMaker a bit more recently. This is Amazon’s managed machine learning service that allows you to build and run machine learning models in the AWS public cloud. The nice thing about this is that you can productionize a machine learning solution very quickly because the operational aspects – namely hosting the model and scaling an endpoint to allow inferences against the model – are removed. So called ‘MLOps’ has almost become a field of its own, so abstracting all this complexity away and just focusing on the core of the problem you are trying to solve is very beneficial. Of course, like everything else in public cloud, this comes at a monetary cost, but it is well worth the cost if you don’t have specialists in this area, or just want to do a fast proof-of-concept.

I will discuss here the basic flow of creating a model in SageMaker – of course some of these are general things that would be done as part of any machine learning project. The first setup you will need to do is head on over to AWS and create a new Jupyter Notebook instance in AWS SageMaker, this is where the logic for the training of the model, and deployment of the ML endpoint will reside.

Assuming you have identified the problem you are trying to solve, you will need to identify the dataset which you will use for training and evaluation of the model. You will want to read the AWS documentation for the algorithm you choose, as this will likely require the data to be in a specific format for the training process. I have found that many of the built-in algorithms in SageMaker require data in different formats, which has been a bit frustrating. I recommend looking at the AWS SageMaker examples repository, as it has detailed examples of all the available algorithms, and examples you can walk through that solve real world problems.

Once you have the dataset gathered and in the correct format, and you have identified the algorithm you want to use, the next step is to kick off a training job. It is likely your data will be stored on AWS S3, and as usual you would split into training data and data you will use later for model evaluation. Make sure that the S3 bucket where you store your data is located in the same AWS region as your Jupyter Notebook instance or you may see issues. SageMaker makes it very easy to kick off a training job. Let’s take a look at an example.

Here, I’m setting up a new training job for some experiments I was doing around anomaly detection using the Random Cut Forest (RCF) algorithm provided by AWS SageMaker. This is an unsupervised algorithm for detecting anomalous data points within a dataset.


SageMaker1

Above we are specifying things like the EC2 instance type we want the training to execute on, the number of EC2 instances, and the input and output locations of our data. The final parameters above where we are specifying the number of samples per tree and the number of trees are specific to the RCF algorithm. These are known as hyperparameters. Each algorithm will have its own hyperparameters that can be tuned, for example see here for the list available when using RCF. When the above is executed, the training process starts and you will see some output in the console, note that you will be charged for the model training time, once the job completes you will see the amount of seconds you have been billed for.

At this point, you have a model, but now you want to productionize it and allow endpoints to run inferences against it. Of course, it is not as easy as train and deploy – I am completely ignoring the testing/validation of the model and tuning based on that, as here I just want to show how SageMaker is effective at abstracting away the operational aspects of deploying a model. With SageMaker, you can deploy an endpoint, which is essentially your model hosted on a server with an API that allows queries to be run against it, with a prediction returned to the requester. The endpoint can be spun up in a few lines of code:


SageMaker2

Once you get confirmation that the endpoint is deployed – this will generally take a few minutes – you can use the predict function to run some inference, for example:


SageMaker3

Once you are done playing around with your model and endpoint, don’t forget to turn off your Jupyter instance (you don’t need to delete it), and to destroy any endpoints that you have created or you will continue to be charged.

Conclusions

AWS SageMaker is powerful in terms of putting the ability to create machine learning models and setup endpoints to serve requests to them in anybody’s hands. It is still a complex beast that requires knowledge of the machine learning process in order for you to be successful. However, in terms of being able to train a model quickly and put it into production, it is a very cool offering from AWS. You also get benefits like autoscaling of your endpoints should you need to scale up to meet demand. There is a lot to learn about SageMaker, and I’m barely scratching the surface here, but if you are interested in ML I highly recommend you take a look.

Using a Doomsday Clock to Track Technical Debt Risk

Every software team has technical debt, and those who say they don’t are lying. Even for new software, there are always items in the backlog that need attention, be they architecture trade-offs or areas of the code which are not as easy to maintain as they should be. Unless you have unlimited time and resources to deliver a project, which in reality is never, you will always have items such as these in the backlog that need to be addressed, outside of new features that need to be implemented. Mostly, but not always, technical debt items are deprioritized over such new features that generate visible outcomes, value for the customer, and revenue for the business. In my opinion, this is OK – technical debt in software projects is a fact of life, and as long as it is not recklessly introduced, and there is a plan to address it later, it is fine. It is good to look at how technical debt gets introduced. In my experience, it is mostly down to time constraints i.e. having a delivery deadline that means trade-offs must be made. Martin Fowler introduced us to the Technical Debt Quadrant which is a nice way of looking at how technical debt gets introduced. You would hope that you never end up anywhere in the top left.


tech debt quadrant

There are a few different ways of tracking technical debt, such as keeping items as labeled stories in your JIRA backlog, or using a separate technical debt register. The most important thing is that you actually track these items – and wherever you track them it is critical to continuously review and prioritize them. It is also key that you address items as you iterate on new releases of your software. When you do not address technical debt and use all your team’s work cycles to add new features (and also likely new technical debt), you will come to a tipping point. You will find it takes ever longer to add new features, or worse some technical debt items may begin to impact your production software – think of that performance trade-off you made a few years ago when you were sure the software workload would never reach this scale – now it has, and customers are being impacted. So, neglecting technical debt items that have a potential to be very impactful to your customer base is not a good idea, and these are the type of items I will discuss here.

Recently I was reading about the Doomsday clock. If you are not familiar with this:

Founded in 1945 by University of Chicago scientists who had helped develop the first atomic weapons in the Manhattan Project, the Bulletin of the Atomic Scientists created the Doomsday Clock two years later, using the imagery of apocalypse (midnight) and the contemporary idiom of nuclear explosion (countdown to zero) to convey threats to humanity and the planet. The decision to move (or to leave in place) the minute hand of the Doomsday Clock is made every year by the Bulletin’s Science and Security Board in consultation with its Board of Sponsors, which includes 13 Nobel laureates. The Clock has become a universally recognized indicator of the world’s vulnerability to catastrophe from nuclear weapons, climate change, and disruptive technologies in other domains.

So I thought, why not take this model and use it to track not existential risks to humanity, but technical debt items that pose a known catastrophic risk to a software product or service, be it a complex desktop or mobile application or a large cloud service. The items I am considering here are not small issues such as ‘I made a change and ignored the two failing unit tests‘. While these issues are still important, the items I am thinking about here are things that would cause a catastrophic failure of your software in a production environment should a certain condition or set of conditions arise, let’s take a two examples.

For the first example, let us consider a popular desktop application that relies on a third party library to operate successfully. From inception, the application has used the free version of the library, and there has always been an item in the backlog to migrate to the enterprise version to ensure long term support. Now there is a hard date for end of support six months from now, and you need to migrate to the enterprise version before that date to continue to receive security patches – which are regular, or you risk exposing your customer base.

Let us think about another example, consider a popular cloud service. The service uses a particular relational database that is key to the operation of the service and the customer value it provides. For sometime, the scaling limits of this database have been known, and due to growth and expansion into international markets, these limits are closer than ever.

The main thing here is that I am talking about known technical debt items that will cause catastrophe at some point in the future. It is important here to draw the distinction between those and unknown items to which teams will always need to be reactive.

The method I had in mind for tracking such items, taking the Doomsday Clock analogy, was as follows:

  1. You take your top X (in order of priority) technical debt items – big hitting items like those described above, you might have 5, you might even have 10.
  2. The doomsday clock starts at the same number of minutes from midnight as you have items – e.g. if you have 5 items you start at 11:55pm.
  3. Each time one of these items causes a real issue, or an issue is deemed imminent, move the time 1 minute closer to midnight. Moving the clock closer to midnight should be decided by your most senior engineers and architects.
  4. The closer you get to midnight, the more danger you are in of having these items effect your customer base or revenue.

Reaching midnight manifests in a catastrophic production issue, unhappy customers, and potential loss of revenue. Executives, especially those without an engineering background, can easily grasp the severity of a situation if you explain using the above method in my opinion. This method also keeps a focus by Product Management or whoever decides your road-map on the key items that need to be addressed – it is easy to address the small items like fixing that unit test, and think you are addressing technical debt, but in reality you are just fooling yourself and your team.

How does your team track these items?

Aside

If you are an Iron Maiden fan, their song ‘2 Minutes to Midnight’ is a reference to the Doomsday clock being set to 2 minutes to midnight in 1953, the closest it had been at that time, after the US and Soviet Union tested H-bombs within nine months of one another.

Wanderlust in Lockdown

I should be travelling to Düsseldorf in a few weeks, but in the current Covid-19 infected climate that’s not going to happen. This trip was to celebrate my birthday, which now I’ll be celebrating in lockdown at home instead. Things have changed so fast, and it really makes you appreciate things that you previously wouldn’t have thought twice about – small things like going out to dinner, and slightly bigger things like exploring Germany for a week for your birthday.

I’ve been fascinated with Germany since my first visit, and if I wasn’t Irish, I think my next most preferable nationality would be German, followed by Dutch. Düsseldorf was supposed to be our base for a week in Germany in April, and we had a few more cities on the list we were also going to visit, namely Mönchengladbach, Duisburg, Essen, and maybe Dortmund. We have traveled to Düsseldorf previously, in 2016, and loved it so much we always said we would go back. This year was supposed to be our chance, but it is not to be, at least in the first half of the year.

The city of Düsseldorf is compact, and in some ways reminds me of Cork, easily explorable for the most part on foot. I loved walking around the Alt Stadt (old town) on our first visit, sampling the local beers and food, and taking in the atmosphere, especially on Wednesday, when it’s traditional to go for drinks after work. Düsseldorf’s main city park, the Hofgarten, is something not to be missed, as is the Rheinpromenade, the promenade which runs along the River Rhine in the old town, and is lined with bars and restaurants. From the Rheinpromenade you will find it hard to miss the Rheinturm (Rhine Tower), this huge structure is visible from most anywhere on the promenade. At the top, there is an observation deck and a revolving restaurant which serves excellent food (expect to pay for it though), from where you can see excellent views of the city and beyond. Düsseldorf is also the home of one of my favorite varieties of German beer – ‘Altbier’, named for the old style of brewing used in the production process. Altbier is native to this part of Germany, and I highly recommend a few pints of ‘Frankenheim Alt’ if you find yourself in this region. I’ve been buying this beer online frequently since I first visited Düsseldorf, but there’s nothing like the taste of it while actually there. Hopefully we ‘flatten the curve’ enough for me to make a trip this year! For now I’ll need to satisfy myself with this picture from my 2016 trip.



We had also planned to go back to Holland and visit Amsterdam (again!), Utrecht and Maastricht in May, but that’s also not likely now. At the moment I’m satisfying my wanderlust by following lots of European travel Instagrammers, and using Google Maps and Earth to plan our next trips.

On the plus side of being in lockdown for a few weeks, I’ve pulled out the electric guitar again and found a suitable online course. Hopefully this time I have the patience to stick with it. I’m also studying for my ‘Amazon Web Services – Certified Solutions Architect’ exam, and playing around with Amazon’s managed machine learning service, AWS Sagemaker (expect a post on this later). I’ve also ordered a Dutch language audio course, so I can try speaking it on my next trip to Holland. Maybe a bonus of the lockdown will be brushing up on old skills, or learning new ones. We’re staying positive anyway and viewing this as an opportunity to learn.