Tag Archives: Chaos Engineering

Introduction to Azure Chaos Studio

Some time ago, I investigated the concept of chaos engineering. The principle behind Chaos Engineering is a very simply one: since your software is likely to encounter hostile conditions in the wild, why not introduce those conditions while (and when) you can control them, and then deal with the fallout then, instead of at 3am on a Sunday.

At the time, I was trying to deal with an on-site issue where the connection seemed to be randomly dropping. In the end, I solved this by writing something similar to Polly – albeit a much simpler version.

Microsoft have recently released a preview of something called Chaos Studio. It’s very much in its infancy now, but what is there looks very interesting.

The product is essentially divided into two sections: targets and experiments. Targets represent the thing that you intend to wrought chaos upon, and experiments are how that chaos will be wrought.

Scope

For this test, I’m going to use a VM. That’s mainly because what you can do with this product is currently limited to VMs, AKS, and Redis.

Create a VM and Check Availability

The first step is to create a VM. To be honest, it doesn’t matter what the VM is, because all we’ll be doing is switching it off. Start by checking the availability – you should be able to do that in Logs – and you should notice 100% availability, unless something has gone catastrophically wrong with your deployment.

Targets

The next step is to configure our target. In chaos studio, select Targets and pick the new VM:

Not that you’ve enabled the targets, you’ll need to grant permission to the chaos studio for the VMs. Inside the VM blade, select Access Control:

If you don’t grant this access, you’ll get a permissions error when you run the experiment. The next step is to create the experiment. In Chaos Studio, select Experiments and then Create:

This will bring up a screen similar to the following:

Let’s discuss a little the concepts here: we have step, branch, and fault. A step is a sequential action that you will execute, whilst a branch is a parallel action; that is, actions in different branches can happen at the same time. A fault is what you actually do – so the fault is the chaos! Let’s add a fault:

This asks me two things, what do I want the fault to happen on (you can only select targets that have previously been created) and what do I want the fault to be. In my case, I’ve created a two step process that turns the machine off, waits a minute, then turns it off again:

Now that the experiment is created, you can start it. You get a warning at this point that basically says “it’s your foot, and you’re currently pointing a high powered rifle at it!”:

If you now run this, and it’s worth bearing in mind that there’s no simulation here – if you do this on production infrastructure it will shut it down for you, then you’ll see the update of it running:

You can drill down into the details to see exactly what it’s doing, what stage, etc.:

The experiment kills the machine for 1 minute, then waits for a minute, then kills it again. If you have a look at the availability graph, you should be able to see that:

Summary

So far, I’m pretty impressed with this tool. When they’ve finished (and by that, I mean, they’ve given the ability to create your own chaos, and have expanded the targets to cover the entire Azure ecosystem), it’s going to be a really interesting testing tool.

References

Azure Friday Introduction to Chaos Studio

Chaos Monkey – Part 4 – Creating an Asp.Net 6 Application that Caches an Error

This is a really strange post, but it’s a line up for a different post; however, I felt it made sense to be a post in its own right – it follows on from a trend I have of creating things that break on purpose. For example, here’s a post from a few years ago where I discussed how you might force a machine to run out of memory.

In this case, I’m creating a simple application that runs fine, but at a random point, it generates an error, which it caches, and then is broken until the application is restarted.

Why?

I’m working on some alerting and resilience experiments at the minute, and having an unstable application is useful for those tests. Also, this is not an unusual scenario – I mean, obviously, writing an application that purposes crashes after it’s broken, and from then on, is unusual; but having an application that does this somewhere in your estate may not be so unusual.

How

I’ve set-up a bog standard Asp.Net MVC 6 application. I then installed the following package:

Install-Package System.Runtime.Caching

Finally, I changed the default Privacy controller action to potentially crash:

public IActionResult Privacy()
{
    string result = Crash();
    return View(model: result);
}

Here, I’m feeding a string into the privacy view as its model. The Crash method has a 1 in 10 chance of caching an error:

        private string Crash()
        {
            if (!_memoryCache.TryGetValue("Error", out string errorCache))
            {
                if (_random.Next(10) == 1)
                {
                    _memoryCache.Set("Error", "Now broken!");
                    return "Now broken";
                }
            }
            else
            {
                throw new Exception("Some exception");
            }

            return "Working fine";
        }

I then just display the model in the view (privacy.cshtml):

@model string
@{
    ViewData["Title"] = "Privacy Policy";
}
<h1>@ViewData["Title"]</h1>
<h1>@Model</h1>

<p>Use this page to detail your site's privacy policy.</p>

Now, if you run it, somewhere between 2 and 15 times, you’re likely to see it break, and need to restart to fix.