Home | Send Feedback | Share on Bluesky |

Resilience Patterns in Go with failsafe-go

Published: 7. April 2026  •  go

Applications often depend on databases, internal APIs, queues, and third-party systems. These dependencies fail, slow down, or become overloaded. So an application needs ways to handle those failures gracefully, without crashing or causing a bad experience for users.

failsafe-go helps you implement resilience patterns in Go, such as retries, circuit breakers, fallbacks, timeouts, hedging, caching, rate limiting, bulkheads, and adaptive limiters. It provides a consistent API for defining these patterns and composing them together to build robust applications.

Install

go get github.com/failsafe-go/failsafe-go

failsafe-go uses the same pattern for all policies. The With function takes one or more policies and returns an executor.

executor := failsafe.With(fallbackPolicy, retryPolicy, breaker)
result, err := executor.Get(first.Fetch)

You call Get or Run on that executor to execute your code synchronously with the attached policies. Get is for functions that return a value and an error, and Run is for functions that only return an error.

Async methods like GetAsync and RunAsync are also available. They return an ExecutionResult, which contains a channel (Done()) to wait for completion plus methods to retrieve the result and error once the execution is done.

Additional methods include RunWithExecution and GetWithExecution, which pass a failsafe.Execution object into the function for more control and observability. For example, in a retry policy, the execution object can tell you how many attempts have been made so far.

These executor calls return a result and/or an error depending on which method you use. The error can come from the function itself, or it can be a wrapped policy error if the execution fails due to retries, timeouts, or other conditions. For example, the retry policy returns ErrExceeded when the maximum number of attempts is reached without success.

The WithContext method allows you to pass a context that can be used for cancellation and deadlines. This is especially important for policies like timeouts and hedging, which need to be able to cancel in-flight executions.

Retry

Retry handles transient failures by trying the same operation again. Retries should usually be used with a backoff strategy that spaces out attempts over time so they do not send a flood of traffic to an already struggling dependency. Retry should also not retry every operation that fails, so the most important part of a retry policy is deciding which errors are safe to retry and which ones should fail immediately.

Common use cases for retry are outbound HTTP calls, database connections, and message publishing. When using retry, you must be careful with non-idempotent operations. Depending on the failure, it might be possible that the operation succeeded but the response was lost. This could result in duplicate work. Common solutions are idempotency keys or another deduplication mechanism.

Retries can improve success rates, but they also increase latency and may add load during an outage.

Yes

No

Yes

No

Call dependency

Success?

Return result

Wait with backoff

Attempts left?

Return error

The following example retries any non-nil error, aborts on known non-retryable failures, and emits retry lifecycle events.

  retryPolicy := retrypolicy.NewBuilder[rolloutPlan]().
    HandleIf(func(_ rolloutPlan, err error) bool {
      return err != nil
    }).
    AbortOnErrors(errInvalidConfig, circuitbreaker.ErrOpen).
    WithMaxAttempts(3).
    WithBackoff(20*time.Millisecond, 80*time.Millisecond).
    WithJitter(5 * time.Millisecond).
    OnRetryScheduled(func(event failsafe.ExecutionScheduledEvent[rolloutPlan]) {
      fmt.Printf("  retry scheduled: attempt=%d delay=%s\n", event.Attempts()+1, event.Delay)
    }).
    OnRetry(func(event failsafe.ExecutionEvent[rolloutPlan]) {
      fmt.Printf("  retrying after: %v\n", event.LastError())
    }).
    OnAbort(func(event failsafe.ExecutionEvent[rolloutPlan]) {
      fmt.Printf("  retry aborted on: %v\n", event.LastError())
    }).
    Build()

retry_circuit_fallback.go

Other options include:

Circuit Breaker

Circuit breaking stops requests from repeatedly hitting a dependency that is already failing. Unlike a retry, which always calls the dependency, a circuit breaker can stop calls to the dependency when it detects a problem. This can help the dependency to recover.

The breaker usually has three states: closed, open, and half-open. In the closed state, calls are allowed and failures are counted. When the failure threshold is reached, the breaker moves to open and rejects calls immediately. After a delay, it moves to half-open, allows a probe call, and then closes again on success or reopens on failure.

A circuit breaker can be useful in scenarios where a dependency is hard down or times out under load. Breakers reduce load during incidents, but choosing the right thresholds is important to avoid opening too early or too late.

failures exceed threshold

delay expires

probe succeeds

probe fails

Closed

Open

HalfOpen

The following example opens after two handled failures, transitions to half-open after a short delay, and logs each state change.

  breaker := circuitbreaker.NewBuilder[rolloutPlan]().
    HandleErrors(errUpstreamUnavailable).
    WithFailureThreshold(2).
    WithSuccessThreshold(1).
    WithDelay(120 * time.Millisecond).
    OnStateChanged(func(event circuitbreaker.StateChangedEvent) {
      fmt.Printf("  breaker state: %s -> %s\n", event.OldState, event.NewState)
    }).
    OnOpen(func(event circuitbreaker.StateChangedEvent) {
      fmt.Printf("  breaker opened after %d upstream failures\n", event.Metrics().Failures())
    }).
    OnHalfOpen(func(event circuitbreaker.StateChangedEvent) {
      fmt.Println("  breaker is probing the planner again")
    }).
    OnClose(func(event circuitbreaker.StateChangedEvent) {
      fmt.Println("  breaker closed after a healthy probe")
    }).
    Build()

retry_circuit_fallback.go

Other options include:

Fallback

The fallback policy can return a predefined result or error when the primary execution fails. For example, a fallback can return a cached value or a default response. When using a fallback, it is important to make sure that the caller can tell that this result came from a fallback and not the primary execution.

Yes

No

Primary call

Success?

Return primary result

Fallback handler

Return degraded result

The fallback policy most often makes sense as the outermost policy in a composition of multiple policies, so it can catch failures from retries, circuit breakers, timeouts, and other inner policies. This way the caller gets a gracefully degraded response instead of an error when something goes wrong. See the section below on policy composition for more on how to combine multiple policies together.

The following code snippet returns a degraded rollout plan and adjusts the message based on the last failure.

  fallbackPolicy := fallback.NewBuilderWithFunc(func(exec failsafe.Execution[rolloutPlan]) (rolloutPlan, error) {
    note := "served a cached rollout plan after retries were exhausted"
    if errors.Is(exec.LastError(), circuitbreaker.ErrOpen) {
      note = "served a cached rollout plan because the breaker is open"
    }
    return rolloutPlan{
      Service: "checkout-api",
      Region:  "us-east-1",
      Source:  "degraded-cache",
      Note:    note,
    }, nil
  }).
    HandleErrors(retrypolicy.ErrExceeded, circuitbreaker.ErrOpen).
    OnFallbackExecuted(func(event failsafe.ExecutionDoneEvent[rolloutPlan]) {
      fmt.Printf("  fallback served: %s\n", event.Result.Source)
    }).
    Build()

retry_circuit_fallback.go

Other options include:

Timeout

Timeout sets an upper bound for how long an execution is allowed to run. When setting a timeout, make sure that the value is based on real measurements in production. If it's too low, you will see a lot of timeouts that look like failures but are actually just normal latency.

Timeouts are essential for keeping a service responsive when dependencies are slow or unresponsive. They prevent requests from hanging indefinitely and allow the system to degrade gracefully.

Yes

No

Start call

Finished before deadline?

Return result

Cancel execution

Return timeout error or fallback

The following code snippet puts a hard 180 millisecond deadline around a replica probe and records when the deadline is exceeded.

  timeoutPolicy := timeout.NewBuilder[probeResult](180 * time.Millisecond).
    OnTimeoutExceeded(func(event failsafe.ExecutionDoneEvent[probeResult]) {
      fmt.Printf("  timeout fired after %s\n", event.ElapsedTime().Round(time.Millisecond))
    }).
    Build()

hedge_timeout_async.go

Hedge

Hedging starts a second equivalent request when the first request looks slow. This policy is only useful when you have multiple backend targets to call in parallel, such as replicas or caches. It first sends the request to one target, then launches a second request to another target if the first one does not finish within the configured delay. In the common case, the caller gets the first successful result, and the other request is canceled.

This pattern is a good fit for latency-sensitive reads. Avoid it for non-idempotent operations, because you end up doing duplicate work or can create side effects multiple times. Real-world examples include replicated search backends, read replicas, cache clusters, and geo-distributed lookup services.

No

Yes

Start primary read

Slow?

Primary returns

Launch hedge read

Who finishes first?

Return fastest result

The following example launches one additional read after 30 milliseconds, attaches a hedge budget, and logs when the hedge is created.

  hedgeBudget := budget.NewBuilder().WithMaxRate(1.0).WithMinConcurrency(0).Build()

  hedgePolicy := hedgepolicy.NewBuilderWithDelay[probeResult](30 * time.Millisecond).
    WithMaxHedges(1).
    WithBudget(hedgeBudget).
    OnHedge(func(event failsafe.ExecutionEvent[probeResult]) {
      fmt.Printf("  hedge launched: attempts=%d hedges=%d\n", event.Attempts(), event.Hedges())
    }).
    Build()

hedge_timeout_async.go

Other options include:


If the protected function needs to know whether it is running as the original request or as a hedge, use GetWithExecution or RunWithExecution and inspect the execution metadata. This is useful when you want the first attempt to prefer a primary replica and a hedge to go to a secondary replica, or when you want to tag hedge traffic separately in logs and metrics.

When you use GetWithExecution, the failsafe.Execution object passed into the function exposes the following hedge-specific methods:

  note := "primary"
  if exec.IsHedge() {
    note = fmt.Sprintf("hedge-%d", exec.Hedges())
  }

shared.go

The important limitation is that the original attempt cannot know in advance whether a hedge will be launched later. It only knows that it is not itself a hedge. If you need to observe hedge creation centrally, use OnHedge on the policy builder.

Cache

Cache is a straightforward pattern where a result is stored and reused for subsequent calls with the same cache key. Operations that are expensive or have stable results are good candidates for caching. These operations either return the same result for the same input or change so infrequently that stale reads are acceptable.

The key to effective caching is choosing an appropriate amount of time that data can be cached. Some data is safe to cache forever because it never changes. Other data may be safe to cache for only a short time, such as a few seconds or minutes, because it changes infrequently.

Yes

No

Request

Cache hit?

Return cached result

Call dependency

Store result

Return fresh result

The following example caches only non-error snapshots that contain assets, records hits and misses, and reuses the same cache key across calls.

Note that the cache policy from failsafe-go is not a cache implementation itself, but a policy that can be used with any cache that implements the cachepolicy.Cache interface. This design allows you to use your existing cache or choose one that fits your needs, while still benefiting from the policy features like cache key management, hit/miss listeners, and conditional caching.

The usage pattern is a bit different from the other policies, because we need to provide the cache key for each execution. This can be done through a context (ContextWithCacheKey), which allows the cache policy to look up the right value for each call.

  cache := newMemoryCache[configSnapshot]()
  controlPlaneCalls := 0

  cacheExecutor := failsafe.With(cachepolicy.NewBuilder[configSnapshot](cache).
    CacheIf(func(result configSnapshot, err error) bool {
      return err == nil && len(result.Assets) > 0
    }).
    OnCacheMiss(func(event failsafe.ExecutionEvent[configSnapshot]) {
      fmt.Printf("  cache miss on attempt %d\n", event.Attempts())
    }).
    OnResultCached(func(event failsafe.ExecutionEvent[configSnapshot]) {
      fmt.Printf("  cached snapshot from %s\n", event.LastResult().Source)
    }).
    OnCacheHit(func(event failsafe.ExecutionDoneEvent[configSnapshot]) {
      fmt.Printf("  cache hit for %s\n", event.Result.Service)
    }).
    Build())

  ctx := cachepolicy.ContextWithCacheKey(context.Background(), "snapshot:checkout-api")
  loader := func(exec failsafe.Execution[configSnapshot]) (configSnapshot, error) {
    controlPlaneCalls++
    return configSnapshot{
      Service: "checkout-api",
      Assets:  []string{"feature-flags", "routing-rules", "slo-budgets"},
      Source:  fmt.Sprintf("control-plane call %d", controlPlaneCalls),
    }, nil
  }

  first, err := cacheExecutor.WithContext(ctx).GetWithExecution(loader)
  if err != nil {
    fmt.Printf("  first snapshot error: %v\n", err)
  } else {
    fmt.Printf("  first snapshot source: %s\n", first.Source)
  }

  second, err := cacheExecutor.WithContext(ctx).GetWithExecution(loader)
  if err != nil {
    fmt.Printf("  second snapshot error: %v\n", err)
  } else {
    fmt.Printf("  second snapshot source: %s\n", second.Source)
  }

  fmt.Printf("  supplier was called %d time(s)\n", controlPlaneCalls)

cache_showcase.go

Other options include:

Rate Limiter

Rate limiting is a preventive control that limits the number of operations that can be executed over a time window. It is used to protect downstream systems. For example, if a system can only handle 100 requests per second, rate limiting helps you enforce that limit by rejecting or delaying requests that exceed the threshold.

Rate limiting is also a useful pattern when you want to enforce fair usage policies, for example in a public API, or when you want to enforce different pricing tiers. Basic users might be allowed only 100 requests a day, while premium users can make 1,000 requests a day.

Yes

No

Incoming request

Token available?

Allow execution

Reject or wait

The following example allows two immediate executions every 120 milliseconds, logs when capacity is exhausted, and then succeeds again after the refill interval.

  limiter := ratelimiter.NewBurstyBuilder[string](2, 120*time.Millisecond).
    OnRateLimitExceeded(func(event failsafe.ExecutionEvent[string]) {
      fmt.Printf("  rate limited at attempt %d\n", event.Attempts())
    }).
    Build()

  executor := failsafe.With(limiter)
  for attempt := 1; attempt <= 3; attempt++ {
    result, err := executor.Get(func() (string, error) {
      return fmt.Sprintf("request-%d", attempt), nil
    })
    if err != nil {
      fmt.Printf("  request %d rejected: %v\n", attempt, err)
      continue
    }
    fmt.Printf("  request %d accepted with %s\n", attempt, result)
  }

  time.Sleep(130 * time.Millisecond)
  result, err := executor.Get(func() (string, error) {
    return "request-after-refill", nil
  })
  if err != nil {
    fmt.Printf("  refill request failed: %v\n", err)
  } else {
    fmt.Printf("  refill request accepted with %s\n", result)
  }

limits_showcase.go

Other options include:

Bulkhead

Bulkhead is very similar to a rate limiter, but instead of limiting the number of executions over time, it limits the number of concurrent executions. Like a rate limiter, bulkheads help protect a downstream system from too much traffic and becoming overwhelmed. Bulkheads are especially useful for protecting resources that have limited concurrency, such as a database connection pool or a third-party API with strict concurrency limits.

Requests that exceed the concurrency limit are either rejected immediately or queued briefly until capacity is available.

Yes

No

Incoming work

Free slot?

Execute inside bulkhead

Reject quickly

The following example uses one permit, forces saturation by pre-acquiring that permit, and then shows the request succeeding again after the permit is released.

  gate := bulkhead.NewBuilder[string](1).
    OnFull(func(event failsafe.ExecutionEvent[string]) {
      fmt.Println("  bulkhead is full")
    }).
    Build()

  if err := gate.AcquirePermit(context.Background()); err != nil {
    fmt.Printf("  failed to prefill bulkhead: %v\n", err)
    return
  }

  _, err := failsafe.With(gate).Get(func() (string, error) {
    return "worker-slot-1", nil
  })
  fmt.Printf("  while saturated: %v\n", err)

  gate.ReleasePermit()

  result, err := failsafe.With(gate).Get(func() (string, error) {
    return "worker-slot-1", nil
  })
  if err != nil {
    fmt.Printf("  recovered bulkhead failed: %v\n", err)
  } else {
    fmt.Printf("  recovered bulkhead accepted %s\n", result)
  }

limits_showcase.go

Other options include:

Adaptive Limiter

An adaptive limiter is a more sophisticated version of a bulkhead that adjusts its concurrency limit based on observed conditions. Instead of having a fixed number of concurrent executions, an adaptive limiter can increase or decrease the limit in response to changes in latency, error rates, or other signals. When an overload is detected, the limiter reduces the concurrency limit, and when conditions improve, it can increase the limit.

Yes

No

Observe latency and load

Adjust concurrency limit

Under current limit?

Allow execution

Reject execution

The following example starts with a concurrency limit of one, rejects work while the only permit is held, and then allows work again after that permit is released.

  limiter := adaptivelimiter.NewBuilder[string]().
    WithLimits(1, 3, 1).
    WithRecentWindow(time.Second, 2*time.Second, 50).
    OnLimitExceeded(func(event failsafe.ExecutionEvent[string]) {
      fmt.Printf("  adaptive limiter rejected attempt %d\n", event.Attempts())
    }).
    Build()

  heldPermit, err := limiter.AcquirePermit(context.Background())
  if err != nil {
    fmt.Printf("  failed to acquire warm-up permit: %v\n", err)
    return
  }

  fmt.Printf("  limit=%d inflight=%d queued=%d\n", limiter.Limit(), limiter.Inflight(), limiter.Queued())

  _, err = failsafe.With(limiter).Get(func() (string, error) {
    return "background-sync", nil
  })
  fmt.Printf("  saturated execution: %v\n", err)

  heldPermit.Drop()

  result, err := failsafe.With(limiter).Get(func() (string, error) {
    return "background-sync", nil
  })
  if err != nil {
    fmt.Printf("  post-release execution failed: %v\n", err)
  } else {
    fmt.Printf("  post-release execution accepted %s\n", result)
  }

limits_showcase.go

Other options include:

Adaptive Throttler

Adaptive throttling is similar to circuit breaking, but instead of rejecting all traffic when it detects a problem, it sheds load more gradually by rejecting a percentage of requests based on recent failure rates. It looks at recent outcomes and adjusts the rejection rate over time, allowing some traffic through even when conditions are bad, which can help a struggling dependency recover without being completely cut off.

Use it when a backend is returning too many errors and sending every request through would only deepen the outage. This is a good fit for large traffic flows where partial shedding is better than a binary open-or-closed decision. Avoid this pattern when the dependency is completely down and every request will fail anyway, because a circuit breaker may be simpler. The trade-off is finer-grained load shedding, but it requires good signals and careful tuning.

Yes

No

Request arrives

Observe recent failure rate

Should reject?

Throttle request

Send to dependency

The following example shows an adaptive throttler that treats HTTP 503 as a failure signal and gradually raises its rejection rate when failures stay above the threshold.

  throttler := adaptivethrottler.NewBuilder[int]().
    HandleResult(503).
    WithFailureRateThreshold(0.2, 1, time.Minute).
    WithMaxRejectionRate(1.0).
    Build()

  executor := failsafe.With(throttler)
  for attempt := 1; attempt <= 8; attempt++ {
    result, err := executor.Get(func() (int, error) {
      return 503, nil
    })
    if errors.Is(err, adaptivethrottler.ErrExceeded) {
      fmt.Printf("  attempt %d rejected with rejection rate %.2f\n", attempt, throttler.RejectionRate())
      return
    }
    fmt.Printf("  attempt %d recorded result %d, rejection rate %.2f\n", attempt, result, throttler.RejectionRate())
  }

  fmt.Printf("  throttler never rejected within the sample window; final rate %.2f\n", throttler.RejectionRate())

limits_showcase.go

Other options include:

Composing Policies

In the previous examples, we have seen each policy on its own, but in a real service you often need to combine multiple policies. For example, you might want retries with backoff and jitter, but also a circuit breaker to stop retrying when the dependency is down and a fallback to return a cached value when retries are exhausted or the breaker is open.

failsafe-go makes it very convenient to compose policies. You can use With to combine multiple policies into one executor, and the order of the policies in the With call determines how they interact with each other.

The following example configures a retry, circuit breaker, and fallback policy. Fallback is outermost, retry sits inside it, and the breaker is the innermost policy in that stack.

executor := failsafe.With(fallbackPolicy, retryPolicy, breaker).
    WithContext(ctx).
    OnDone(func(event failsafe.ExecutionDoneEvent[rolloutPlan]) {
        if event.Error != nil {
            fmt.Printf("  done in %s after %d attempts with error=%v\n", event.ElapsedTime().Round(time.Millisecond), event.Attempts(), event.Error)
            return
        }
        fmt.Printf("  done in %s after %d attempts with source=%s\n", event.ElapsedTime().Round(time.Millisecond), event.Attempts(), event.Result.Source)
    })

plan, err := executor.GetWithExecution(first.Fetch)

Other options include:

The order matters because outer policies observe the results of inner ones. A good rule is to decide first what the caller should see, then compose from the outside in.

You can find more information about policy composition in the official docs: Policies Overview and Policy Composition.

HTTP Support

failsafe-go provides a convenient way to apply resilience policies to HTTP clients and servers through the failsafehttp package.

On the client side, it provides a RoundTripper that can wrap any existing transport and apply policies to all outgoing requests.

The following example wraps a transport with an HTTP-aware retry policy, then sends a request to a test server that returns two 503 responses before succeeding.

  var attempts atomic.Int32
  server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
    attempt := attempts.Add(1)
    if attempt <= 2 {
      w.WriteHeader(http.StatusServiceUnavailable)
      _, _ = w.Write([]byte("control plane warming up"))
      return
    }
    w.WriteHeader(http.StatusOK)
    _, _ = w.Write([]byte("configuration applied"))
  }))
  defer server.Close()

  retryPolicy := failsafehttp.NewRetryPolicyBuilder().
    WithBackoff(20*time.Millisecond, 80*time.Millisecond).
    OnRetryScheduled(func(event failsafe.ExecutionScheduledEvent[*http.Response]) {
      status := 0
      if resp := event.LastResult(); resp != nil {
        status = resp.StatusCode
      }
      fmt.Printf("  retrying outbound request: status=%d next-attempt=%d delay=%s\n", status, event.Attempts()+1, event.Delay)
    }).
    Build()

  client := &http.Client{
    Transport: failsafehttp.NewRoundTripper(nil, retryPolicy),
  }

  req, err := http.NewRequest(http.MethodGet, server.URL, nil)
  if err != nil {
    fmt.Printf("  request build failed: %v\n", err)
    return
  }

  resp, err := client.Do(req)
  if err != nil {
    fmt.Printf("  outbound request failed: %v\n", err)
    return
  }
  defer resp.Body.Close()

  body, err := io.ReadAll(resp.Body)
  if err != nil {
    fmt.Printf("  response read failed: %v\n", err)
    return
  }

  fmt.Printf("  final HTTP status=%d body=%q attempts=%d\n", resp.StatusCode, string(body), attempts.Load())

http_showcase.go

Other options include:


On the server side, you can use failsafehttp.NewHandler to wrap an existing http.Handler with policies that protect your endpoints. This is useful for applying timeouts, bulkheads, or adaptive limiters to incoming requests.

The following example wraps a slow handler with a timeout policy, uses httptest to exercise it locally, and prints the final status and body.

  timeoutPolicy := timeout.NewBuilder[*http.Response](150 * time.Millisecond).Build()

  handler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
    select {
    case <-time.After(250 * time.Millisecond):
      w.WriteHeader(http.StatusOK)
      _, _ = w.Write([]byte("inventory refreshed"))
    case <-r.Context().Done():
      fmt.Printf("handler canceled: %v\n", r.Context().Err())
    }
  })

  protected := failsafehttp.NewHandler(handler, timeoutPolicy)
  server := httptest.NewServer(protected)
  defer server.Close()

  resp, err := http.Get(server.URL)
  if err != nil {
    fmt.Printf("request failed: %v\n", err)
    return
  }
  defer resp.Body.Close()

  body, err := io.ReadAll(resp.Body)
  if err != nil {
    fmt.Printf("read failed: %v\n", err)
    return
  }

  fmt.Printf("status=%d body=%q\n", resp.StatusCode, strings.TrimSpace(string(body)))

main.go

Other options include:

The second argument to NewHandler is variadic, so you can attach multiple policies to the same handler. For example, you could add a bulkhead to limit concurrent requests and a fallback to return a default response when the server is overloaded.

Wrapping Up

Resilience patterns are essential for building robust systems. failsafe-go provides a rich set of policies that you can compose to handle retries, timeouts, load shedding, and more. The key is to understand the trade-offs of each pattern and design your policies around the specific failure modes of your dependencies.