.NET app resiliency with Polly

Intro

One aspect of application development that is often overlooked, especially by beginner developers is application resilience.
A lot of tutorials often focus on the happy path of execution, omitting the details of potential errors that can occur.

Example

Consider the following, although a bit simplified example:

[HttpPost]
public async Task<ActionResult<ResponseModel>> CreateOrderAsync(OrderModel order)
{
    var cart = await _cartService.GetCartItemsAsync(UserId);
    if (cart.Items.Count == 0)
    {
        return new ResponseModel
        {
            // ... omitted for brevity ...
        }
    }

    var orderEntries = cart.Items.Select(c => c.ToDbModel(UserId));
    var order = new Order
    {
        UserId = UserId,
        DatePlaced = DateTime.UtcNow,
        Entries = orderEntries,
        CartIdempotencyToken = cart.IdempotencyToken
    };

    _context.Orders.Add(order);
    await _context.SaveChangesAsync();
    // User should no longer have the items in his cart after they've placed an order
    await _cartService.EmptyAsync(UserId, cart.IdempotencyToken);
    return new ResponseModel
    {
        // ... omitted for brevity ...
    };
}

Apart from absence of some obvious error handling(what happens if the user’s cart can’t be found), the code looks decent enough at first glance. It is able to retrieve the entities from the CartService, map them to the database entities and store them as a part of Order entity.

I’ve tested it, it works!

Sure enough, the code is correct algorithmically – it does exactly what you’ve asked it to do. You have tested it with various different inputs and came to the conclusion that no matter what data you give it – the processing will be done correctly. So what’s the problem?

Async – state, trapped in time

Request/response model and async/await makes the code look linear. It’s pretty obvious where the data is coming from, and where it goes. And if we are not concerned about the nature of asynchronous processing – makes it easy to miss a very important detail – asynchronous processing is stretched in time and usually involves 3rd party resources that can potentially fail.

This service is not alone in the world – it this case it interacts with the CartService(which may make calls to a microservice over the network) and the Database. It becomes pretty obvious that the author of this code example was focused on an ideal condition, when both of them are always available and don’t return any errors. However, the reality is a lot more complicated than that – there may be network problems, when the service becomes unreachable or straight up refuses to process the request correctly due to issues of it’s own (for example – upstream service connectivity problems).

Although the result of happy path is correct, we haven’t even thought about a plethora of potential issues:

What is the time requirement for this endpoint? May it be the case, that after a certain period of time it’s better to just straight up give up on the request processing and return an error informing the client to try again at a later time (timeout).
What happens if the cart data request fails? Is it safe for us to retry it? How many times? What kind of retry intervals are safe to use without overwhelming the upstream service?
What if database store operation fails? What kind of a response the user should get? Are we able to retry it too? (For example if it failed due to a network problem).
What if the cart cleaning operation fails? Is it essential to clear it, or in worst case scenario we can keep it? Can we retry it?

Ok, It’s complicated, is there a better way?

Sure is! Polly comes to the rescue!
Polly is resilience and transient-fault-handling library that allows us to very easily express the policies that will help to deal with various issues.
With Polly, it becomes very easy to describe retries, timeout, caching, and many more policies or their combinations.

Building and using policies

One thing that you should decide right away – is your policy going to be asynchronous or synchronous one, because depending on your choice of a policy builder method you will get back either Policy or AsyncPolicy instance and using these two together can be quite challenging.

Usually though – you’ll be making asynchronous calls through your policies so let’s use that as an example.

// Let's build our simple timeout policy
// This policy will timeout after 3 seconds
var timeoutPolicy = Policy.TimeoutAsync(3);

// Note that this also supports optimistic cancellation
var res = await timeoutPolicy.ExecuteAsync(ct => TestAsync(ct), CancellationToken.None);

Ok, so what is going on in this example? We build a policy that specifies a timeout rule, and on the next line we are using that policy to call an asynchronous method called TestAsync(...). This method also supports optimistic cancellation(we explicitly notify it when it’s time to stop through the CancellationToken) and we are making use of that. AsyncPolicy.ExecuteAsync has an overload that gives us access to an internal CancellationToken of the policy and we can pass that to our method to achieve the desired result. However, notice how I’ve passed CancellationToken.None as a second parameter? That’s right, if you wish, Polly allows you to also use your own CancellationToken that will be linked to the internal one to terminate the execution even sooner. Pretty awesome!

Basic retries

As discussed earlier, Polly supports a lot of things out of the box, but for now let’s focus on the most basic example – retries with exponential backoff.
From the official Polly wiki:

// Retry a specified number of times, using a function to 
// calculate the duration to wait between retries based on 
// the current retry attempt (allows for exponential backoff)
// In this case will wait for
//  2 ^ 1 = 2 seconds then
//  2 ^ 2 = 4 seconds then
//  2 ^ 3 = 8 seconds then
//  2 ^ 4 = 16 seconds then
//  2 ^ 5 = 32 seconds
Policy
  .Handle<SomeExceptionType>()
  .WaitAndRetryAsync(5, retryAttempt => 
    TimeSpan.FromSeconds(Math.Pow(2, retryAttempt)) 
  );

In this example, we can see a policy that will retry to execute your code at most 5 times, each time increasing the delay between calls. This is very useful in situations when you don’t want to overwhelm the upstream servers with retries. Exponential backoff mechanism will allow your system to balance out and find a suitable rate of calls to upstream servers, even if they are experiencing temporary problems/load spikes. Do note, that it will be beneficial to introduce some randomness(jitter) into the retry policy to avoid all of the retries happening at the same time. This also partially helps to reduce the possibility that your service will cause Denial Of Service for the upstream server. More on that in the next chapter.

Circuit breaker

Sometimes, if the rate of failures is too high, it’s probably a good idea to give the upstream servers some time to recover and partially degrade the functionality of your own application.

Imagine this scenario – we have a factory that makes car engines. These travel through various assembly steps on a conveyor belt, and then checked at the end to ensure quality. In case manufacturing yield is high enough (say only 1 in a 10000 engines is defective) – just removing defective part is a good enough solution. On the other hand, if the failure rate is above 30% – something is definitely wrong and it’s worth stopping the whole conveyor for an inspection.

With web services we can do exactly the same thing – if we see that the failure rate of our requests is too high, maybe it’s not worth making a request at all? Let’s give our upstream servers some time to deal with whatever issue they are having, while degrading our application a little bit.
It may not be suitable for all scenarios, but if it’s intended to provide non-critical functionality (say recommendations for a purchased product in an e-shop) – it’s is usefull to temporarily disable that feature while showing the users a pop-up with an explanation that the service is experiencing some temporary high load.

Circuit breaker policy does exactly that – it allows us to temporarily stop making upstream calls in case the failure rate is above a certain threshold, or certain amount of consecutive exceptions of a handled type occur.

var policy = Policy
  .Handle<HttpRequestException>()
  .CircuitBreaker(
    exceptionsAllowedBeforeBreaking: 2, 
    durationOfBreak: TimeSpan.FromMinutes(1)
  );

In this example – if two consecutive calls through this policy throw exceptions of type HttpRequestException, the circuit will break and stay broken for a duration of 1 minute, meaning that any call made through this policy in that interval will throw a BrokenCircuitException. If the application developer properly handles this exception – system degradation can be handled gracefully, returning some kind of meaningful response to the client that describes what exactly has happened.

Policy wrapping

I won’t be explaining all of the policy variants available, but by this time I hope you already saw how powerful these are. But wait, there is more! You can wrap one policy with the other to achieve even more complex behavior.

Consider this example with two separate policies:

// Timeout policy, 
// requests cancellation when the execution time exceeds a specified amount.
var timeoutPolicy = Policy.TimeoutAsync(3);

// Fallback policy
// If an exception of a specified type occurs during method execution,
// it will return a predefined result
var fallbackPolicy = Policy<string>
    .Handle<SomeException>()
  .Or<OperationCancelledException>()
    .FallbackAsync("Fallback result");

The first one will just cancel the execution of a method as discussed earlier, while the other one is a bit more interesting – in case the method throws an exception with type SomeException or OperationCancelledException, it will return a predefined result "Fallback result" instead. But what if we could combine these two? Can we do that? Easy!

var combined = fallbackPolicy.WrapAsync(timeoutPolicy);
var result = await combined.ExecuteAsync(...);

And that’s it, now we have a policy that will ether return the original result of a method, or a fallback result if the operation times out or throws SomeException. The order of the wraps will actually affect the behavior, so make sure to pay close attention to it because it may give you unexpected results. In the example above – fallbackPolicy is the outer one (will operate on the results returned or exceptions thrown by the timeoutPolicy), and the timeoutPolicy will operate on the results of the method passed to .ExecuteAsync(...).

Integrations with HttpClient

With Microsoft.Extensions.Http.Polly package installed you can call .AddPolicyHandler(...) method on your IHttpClientBuilder‘s to handle some trivial cases like responses with 5XX or 408 status codes and retry with a chosen strategy, lifting this concern from the layers that use these HttpClient‘s. Polly.Extensions.Http package even provides that behavior out of the box with it’s HttpPolicyExtensions.HandleTransientHttpError().

Some tools have built-in resilience mechanisms

Not only can you use Polly to get resilience, but if you look closely into some of the tools you are using already, you might discover that they provide resilience mechanisms too. One notable example of such mechanisms is available can be found in EF Core, when using MS SQL Server.
Let’s take a look at the configuration:

// Startup.cs from any ASP.NET Core Web API
public class Startup
{
    // Other code ...
    public IServiceProvider ConfigureServices(IServiceCollection services)
    {
        // ...
        services.AddDbContext<CatalogContext>(options =>
        {
            options.UseSqlServer(Configuration["ConnectionString"],
            sqlServerOptionsAction: sqlOptions =>
            {
                sqlOptions.EnableRetryOnFailure(
                maxRetryCount: 10,
                maxRetryDelay: TimeSpan.FromSeconds(30),
                errorNumbersToAdd: null);
            });
        });
    }
//...
}

In this example, connection will be reattempted no more than 10 times, with a maximum delay of 30 seconds. errorNumbersToAdd specifies additional SQL Server error codes that will be handled by this retry policy.

Final thoughts

First and foremost – get to know your tools. Some of them already provide near effortless ways to improve your app stability.

Second – make sure to focus not only on the happy path of execution, but carefully plan the failure and graceful degradation strategies.

I hope that with this brief introduction to resilience policies I’ve persuaded you to go through your code and identify spots for potential improvement, where any potential errors are simply dismissed and not handled properly.