Skip to content

Resiliency Patterns in Microservice Architecture

Obviously, I was planning to write an article on this topic. After my last blackfriday experience, I decided to write something about on this topic. Yes, the topic is the importance of resilience and fault tolerance in microservice architecture and how can we provide them.

The story

In my previous articles, I have always mentioned about the advantages that the microservice architecture brings to the system. If you are already reading this article, I guess, you already have an experience with the microservice architecture. I have been working on the microservice architecture for the last 3 years. Yes, I am in that transformation world too.

Life, we have already known that there is no such thing as a perfect. (isn’t it?) Every perfect thing brings a new challenge for us. We have already known and accepted all of the problems and responsibilities that comes with perfect things. Of course, sometimes we can not foresee these problems.

Anyway, actually the microservice architecture has happened to me like this. We/I have accepted some challenges that distributed systems bring to us. But we could not foresee some challenges. Yes, microservices are naturally resilient to some of the faults that can occur. When we looked at monolith applications, I guess it is not possible to ignore that the entire flow of the application is affected by a single error when an error occurs. If we think simple, we can give examples of these errors such as third-party APIs which do not respond, network splits, or no effective use of infrastructure resources. Because of these reasons, we try to build our applications in small pieces and to make sure that the entire application flow is not affected from such errors. Thus we can provide a fault tolerance to these small pieces when any error occurs. Especially in today’s age of technology, against money loss.

In summary, with microservice approach, we have totally or partially prevented the entire system flow to affected by a single error. (Of course, this is just one of the advantages.) I guess the key question here is “what should we do to make our applications, which builds of smaller pieces, more resilience and fault tolerance against these kinds of errors that can occur?”.

In this article, I will try to talk about some patterns and implementations, based on my experiences in our microservice adventure at my company I’m working at, such as Circuit breaker, Retry mechanism and Fallback operations in order to provide resilience in our applications.

Importance of Circuit breaker

We know that in microservice world, applications usually work together with one another or with some remote service. So, this situation is inevitable when we face network splits, timeouts and transient errors. At this point, we need a circuit breaker to prevent the application from repeatedly trying to execute an operation, while an error occurs in the application.
How?
If you look at the above diagram, basically the circuit breaker has 3 basic modes.
  • Closed: In this mode, the circuit breaker is not open and all requests are executed.
  • Open: Now the circuit breaker is open and it prevents the application from repeatedly trying to execute an operation while an error occurs.
  • Half-Open: In this mode, the circuit breaker executes a few operations to identify if an error still occurs. If errors occur, then the circuit breaker will be opened, if not it will be closed.

Let’s implement a sample circuit breaker after the terminology.

First, create a class called “CircuitBreakerOptions“.

With this class, we will get the options. We will specify with “ExceptionThreshold” property when the circuit breaker will be opened and “SuccessThresholdWhenCircuitBreakerHalfOpenStatus” property determines when the circuit breaker will be closed. We will use “DurationOfBreak” property to determine how long the circuit breaker will remain in open mode.

During the lifecycle of the application, we will define with properties from the “CircuitBreakerOptions” class, when the circuit breaker will be opened. In order to be able to define this, we have to store errors, which will occur in the application and also check the value of the “ExceptionThreshold“.

Let’s define “CircuitBreakerStateEnum” and “CircuitBreakerStateModel” class as below.

We defined states of the circuit breaker in the enum. Also, we will use “CircuitBreakerStateModel” class to store events such as exception and success which will occur in the application.

Now let’s create the part which we will use to store the “CircuitBreakerStateModel“.

The only thing we did in the “CircuitBreakerStateStore” class is to store the function-based the “CircuitBreakerStateModel” class as in-memory. We will use the other methods to update or delete the state of the function-based operation.

Now we can look at the coding part of the circuit breaker. Let’s create a new class called “CircuitBreakerHelper” and implement as below.

The whole story takes place in the “ExecuteAsync” method. First, we look if the circuit breaker state for the corresponding function is open or not. If it is not in open state, we invoke the corresponding function in the below try-catch block. If any error occurs, we will catch it in the catch block, then increase the count of exception attempt in the “Trip” method and also check the value of the exception threshold. If the exception threshold value is exceeded, the state of the circuit breaker will be opened and the date will be updated on the model.

If we look at the “ExecuteAsync” method for the second flow again, we check the expire time of the circuit breaker. At the end of the expire time we create a lock to understand if the errors are still ongoing instead of closing the circuit breaker. Then we execute the operation with a single thread once again. In the “Reset” method, we check the count of the successful operations and we decide whether we close the circuit breaker or not.

So, how?

In the above usage, when the circuit breaker exception threshold reaches to “5”, it will stop executing the function for “5” minutes. Thus we will ensure that infrastructure resources are not used unnecessarily and the application will be prevented from some cascading failures.

Well! Retry Mechanism?

In my opinion, retry operations are important, especially if we are working with remote resources. In many cases, unsuccessful operations usually execute successfully in the second or third retries.

Especially in distributed systems, retry operations are one of the best options that we can use against transient faults.

So how we can implement it?

Create a class called “RetryMechanismOptions“.

We will use this class to get some parameters for retry operations. We will define back-off scenarios with the “RetryPolicies” enum. In this implementation, we will only implement “Linear” policy. Also, we will determine how many times we will perform the retry operation with the “RetryCount” property.

Now let’s create an abstract class called “RetryMechanismBase“.

We will perform the retry operation in the “ExecuteAsync” method with parameters, which we will get from the “RetryMechanismOptions” class. Also, we will handle back-offs in concrete classes. With the “IsTransient” method, we will decide whether the exception that might occur in the application is transient or not.

NOTE: If we want, we can provide the user to inject transient exception types in the “IsTransient” method.

Now we can implement a retry strategy. Let’s create a new class called “RetryLinearMechanismStrategy” and implement as below.

In the “HandleBackOff” method, We delayed the task with the “Interval” value set in the “RetryMechanismOptions” class.

Now we need a wrapper class to use retry operations simply. So then, let’s create a class called “RetryHelper“.

We are done. So, how can we use it?

In a usage of the code sample above, if a web-based transient error occurs in the application, the operation will be retried 3 times with a 5-second interval. Hence the corresponding request will not be lost immediately.

What if affairs do not go as planned? Fallbacks!

I guess we can say that fallback is a backup strategy. In my opinion, if we design a microservice architecture, so fallback strategies are very important.

Imagine that we are working on an e-commerce website. When an order is created in the website, payment operation will be processed through the X bank’s API. Let’s assume that we were unable to process the payment operation through the X bank’s API. So, what happens now? The system has retry operations and the payment operation still cannot be processed. In such situations, fallback strategies become more important. So, instead of X bank’s API, maybe we can perform the payment operation through another bank’s API.

Summarize, fallback operations are what we decide to do when the services which we use are unavailable.

We looked at the circuit breaker, retry mechanism and fallback operations. Well, how can we use fallback operations with these patterns together?

In the method above, we use the retry operations first. If any problem occurs, we send function delegate to the circuit breaker. In case of an exception in the circuit breaker, we use the fallback function.

For better illustration, I tried to draw a sequence diagram as follow.

As I mentioned at the beginning of this article, I wanted to write this article for a long time. Now I finally did it. I hope, this article would help who needs any information about the resilience in a microservice architecture.

I have tried to talk about the importance of applications with resilience and fault-tolerance when designing microservice architecture, also how to implement them.

As conclusion as in addition how we design our applications, designing resilience capabilities of our applications in unexpected situations is also very important.

Sample project: https://github.com/GokGokalp/Luffy

References

https://docs.microsoft.com/en-us/azure/architecture/patterns/retry
https://docs.microsoft.com/en-us/azure/architecture/patterns/circuit-breaker

Bu makale toplam (4234) kez okunmuştur.

50
0



Published in.NETArchitecturalMicroservices

8 Comments

  1. Met. Met.

    Mükemmel bir yazı. Türkçe dilinde ve başlangıç seviyesinin üzerinde makale görmek gerçekten umut verici, seçtiğiniz konular da çok güzel, devamını dilerim.

  2. MxLabs MxLabs

    Microservice mimarisi kategorisindeki en iyi yazılardan birisi olduğunu sanıyorum 🙂 Açıkçası biz de aynı problemi yaşadık. Polly kütüphanesi ile bunu tamamen aştık. Retry, Circuit Breaker, Timeout, Bulkhead Isolation, ve Fallback patternlerini de Polly ile kullanabilirsiniz.

    Güzel bir yazı olmuş. Elinize sağlık

    • Merhaba, güzel yorumunuz için teşekkür ederim. Evet, bir dönem Polly’i bende incelemiştim. Bakalım bu sıkıntılarımızın tamamen son bulduğu bir dönem gelecek mi? :))

  3. Kerim Kerim

    Her zaman ki gibi süpersin. Paylaşımların icin teşekkürler.

  4. Sefer ALGAN Sefer ALGAN

    Konuların başına kazanımlar ve gereksinimler gibi küçük bir açıklama yaparsan daha güzel olcağına inanıyorum.
    Melesa şu konuyu daha iyi idrak edebilmek için x,y bilmeniz gerekir.

    • Öneriniz için teşekkür ederim Sefer hocam. Dikkate alacağım diğer yazılarımda.

Leave a Reply

Your email address will not be published. Required fields are marked *

*