Resilience Patterns in Go with failsafe-go

Applications often depend on databases, internal APIs, queues, and third-party systems. These dependencies fail, slow down, or become overloaded. So an application needs ways to handle those failures gracefully, without crashing or causing a bad experience for users.

failsafe-go helps you implement resilience patterns in Go, such as retries, circuit breakers, fallbacks, timeouts, hedging, caching, rate limiting, bulkheads, and adaptive limiters. It provides a consistent API for defining these patterns and composing them together to build robust applications.

Install ¶

go get github.com/failsafe-go/failsafe-go

failsafe-go uses the same pattern for all policies. The With function takes one or more policies and returns an executor.

executor := failsafe.With(fallbackPolicy, retryPolicy, breaker)
result, err := executor.Get(first.Fetch)

You call Get or Run on that executor to execute your code synchronously with the attached policies. Get is for functions that return a value and an error, and Run is for functions that only return an error.

Async methods like GetAsync and RunAsync are also available. They return an ExecutionResult, which contains a channel (Done()) to wait for completion plus methods to retrieve the result and error once the execution is done.

Additional methods include RunWithExecution and GetWithExecution, which pass a failsafe.Execution object into the function for more control and observability. For example, in a retry policy, the execution object can tell you how many attempts have been made so far.

These executor calls return a result and/or an error depending on which method you use. The error can come from the function itself, or it can be a wrapped policy error if the execution fails due to retries, timeouts, or other conditions. For example, the retry policy returns ErrExceeded when the maximum number of attempts is reached without success.

The WithContext method allows you to pass a context that can be used for cancellation and deadlines. This is especially important for policies like timeouts and hedging, which need to be able to cancel in-flight executions.

Retry ¶

Retry handles transient failures by trying the same operation again. Retries should usually be used with a backoff strategy that spaces out attempts over time so they do not send a flood of traffic to an already struggling dependency. Retry should also not retry every operation that fails, so the most important part of a retry policy is deciding which errors are safe to retry and which ones should fail immediately.

Common use cases for retry are outbound HTTP calls, database connections, and message publishing. When using retry, you must be careful with non-idempotent operations. Depending on the failure, it might be possible that the operation succeeded but the response was lost. This could result in duplicate work. Common solutions are idempotency keys or another deduplication mechanism.

Retries can improve success rates, but they also increase latency and may add load during an outage.

The following example retries any non-nil error, aborts on known non-retryable failures, and emits retry lifecycle events.

HandleIf defines which errors should trigger a retry. In this case, any non-nil error will be retried. The parameters are the last successful result (if any) and the error that occurred.
AbortOnErrors stops retrying for known fatal conditions. In this example, invalid config responses and an open circuit fail immediately.
WithMaxAttempts sets the maximum number of attempts, including the initial try. In this example, it allows for two retries after the initial attempt.
WithBackoff configures the backoff strategy. Here it starts with a 20ms delay and doubles it up to a maximum of 80ms.
WithJitter adds random jitter to the delay to prevent thundering herd problems when many requests fail at the same time. Here it adds up to 5ms of random jitter.
OnRetryScheduled fires before the next attempt is queued.
OnRetry runs when a retry attempt starts.
OnAbort runs when the retry policy gives up early because of an abort condition.

  retryPolicy := retrypolicy.NewBuilder[rolloutPlan]().
    HandleIf(func(_ rolloutPlan, err error) bool {
      return err != nil
    }).
    AbortOnErrors(errInvalidConfig, circuitbreaker.ErrOpen).
    WithMaxAttempts(3).
    WithBackoff(20*time.Millisecond, 80*time.Millisecond).
    WithJitter(5 * time.Millisecond).
    OnRetryScheduled(func(event failsafe.ExecutionScheduledEvent[rolloutPlan]) {
      fmt.Printf("  retry scheduled: attempt=%d delay=%s\n", event.Attempts()+1, event.Delay)
    }).
    OnRetry(func(event failsafe.ExecutionEvent[rolloutPlan]) {
      fmt.Printf("  retrying after: %v\n", event.LastError())
    }).
    OnAbort(func(event failsafe.ExecutionEvent[rolloutPlan]) {
      fmt.Printf("  retry aborted on: %v\n", event.LastError())
    }).
    Build()

retry_circuit_fallback.go

Other options include:

HandleErrors to specify a list of error values or types that should trigger retries.
HandleResult to retry based on the returned value instead of the error.
WithMaxRetries to specify the maximum number of retries instead of the maximum number of attempts.
WithMaxDuration to stop retrying after a total time limit, in addition to any attempt or retry limit.
WithDelay, WithRandomDelay, and WithDelayFunc to use fixed, random, or computed delays between attempts. For example, WithDelayFunc could be useful for implementing a retry that honors Retry-After headers from an HTTP response.
WithJitterFactor as an alternative to time-based jitter.
WithBudget to limit outstanding retries across a system.
AbortOnResult and AbortIf to stop retrying for specific outcomes.
ReturnLastFailure to return the last result and error instead of an exceeded error wrapper.
OnRetriesExceeded to add a listener when the retry limit is reached.

Circuit Breaker ¶

Circuit breaking stops requests from repeatedly hitting a dependency that is already failing. Unlike a retry, which always calls the dependency, a circuit breaker can stop calls to the dependency when it detects a problem. This can help the dependency to recover.

The breaker usually has three states: closed, open, and half-open. In the closed state, calls are allowed and failures are counted. When the failure threshold is reached, the breaker moves to open and rejects calls immediately. After a delay, it moves to half-open, allows a probe call, and then closes again on success or reopens on failure.

A circuit breaker can be useful in scenarios where a dependency is hard down or times out under load. Breakers reduce load during incidents, but choosing the right thresholds is important to avoid opening too early or too late.

The following example opens after two handled failures, transitions to half-open after a short delay, and logs each state change.

HandleErrors defines which errors count as breaker failures. In this case, only errUpstreamUnavailable contributes to opening the breaker.
WithFailureThreshold sets the number of failures required to open the breaker. Here it opens after two handled failures.
WithSuccessThreshold sets how many successful half-open probe calls are needed before the breaker closes again. Here a single successful probe is enough.
WithDelay sets how long the breaker stays open before transitioning to half-open. Here it waits 120 milliseconds.
OnStateChanged records every breaker state transition.
OnOpen runs when the breaker opens.
OnHalfOpen runs when the breaker allows probe traffic again.
OnClose runs when the breaker returns to healthy closed state after a successful probe.

  breaker := circuitbreaker.NewBuilder[rolloutPlan]().
    HandleErrors(errUpstreamUnavailable).
    WithFailureThreshold(2).
    WithSuccessThreshold(1).
    WithDelay(120 * time.Millisecond).
    OnStateChanged(func(event circuitbreaker.StateChangedEvent) {
      fmt.Printf("  breaker state: %s -> %s\n", event.OldState, event.NewState)
    }).
    OnOpen(func(event circuitbreaker.StateChangedEvent) {
      fmt.Printf("  breaker opened after %d upstream failures\n", event.Metrics().Failures())
    }).
    OnHalfOpen(func(event circuitbreaker.StateChangedEvent) {
      fmt.Println("  breaker is probing the planner again")
    }).
    OnClose(func(event circuitbreaker.StateChangedEvent) {
      fmt.Println("  breaker closed after a healthy probe")
    }).
    Build()

retry_circuit_fallback.go

Other options include:

HandleErrorTypes, HandleResult, and HandleIf to define failures by error type, return value, or predicate.
WithFailureThresholdRatio, WithFailureThresholdPeriod, and WithFailureRateThreshold for ratio-based or time-window-based opening rules.
WithDelayFunc to compute the open-state delay dynamically, for example from a Retry-After header.
WithSuccessThresholdRatio to require a ratio of successful probe calls before closing.
OnSuccess and OnFailure for additional logging and metrics around handled outcomes.

Fallback ¶

The fallback policy can return a predefined result or error when the primary execution fails. For example, a fallback can return a cached value or a default response. When using a fallback, it is important to make sure that the caller can tell that this result came from a fallback and not the primary execution.

The fallback policy most often makes sense as the outermost policy in a composition of multiple policies, so it can catch failures from retries, circuit breakers, timeouts, and other inner policies. This way the caller gets a gracefully degraded response instead of an error when something goes wrong. See the section below on policy composition for more on how to combine multiple policies together.

The following code snippet returns a degraded rollout plan and adjusts the message based on the last failure.

NewBuilderWithFunc creates the fallback from a function, which is useful when the replacement value depends on the execution context or the last failure.
HandleErrors limits the fallback to specific policy failures. Here it only activates for retrypolicy.ErrExceeded and circuitbreaker.ErrOpen.
OnFallbackExecuted runs after the fallback supplies a replacement result.

  fallbackPolicy := fallback.NewBuilderWithFunc(func(exec failsafe.Execution[rolloutPlan]) (rolloutPlan, error) {
    note := "served a cached rollout plan after retries were exhausted"
    if errors.Is(exec.LastError(), circuitbreaker.ErrOpen) {
      note = "served a cached rollout plan because the breaker is open"
    }
    return rolloutPlan{
      Service: "checkout-api",
      Region:  "us-east-1",
      Source:  "degraded-cache",
      Note:    note,
    }, nil
  }).
    HandleErrors(retrypolicy.ErrExceeded, circuitbreaker.ErrOpen).
    OnFallbackExecuted(func(event failsafe.ExecutionDoneEvent[rolloutPlan]) {
      fmt.Printf("  fallback served: %s\n", event.Result.Source)
    }).
    Build()

retry_circuit_fallback.go

Other options include:

NewWithResult and NewBuilderWithResult to always return a fixed fallback result.
NewWithError and NewBuilderWithError to convert failures into a specific fallback error.
NewWithFunc when you want the fallback policy directly without further builder customization.
HandleErrorTypes, HandleResult, and HandleIf to trigger fallback from error types, return values, or predicates.
OnSuccess and OnFailure to instrument what the fallback handled.

Timeout ¶

Timeout sets an upper bound for how long an execution is allowed to run. When setting a timeout, make sure that the value is based on real measurements in production. If it's too low, you will see a lot of timeouts that look like failures but are actually just normal latency.

Timeouts are essential for keeping a service responsive when dependencies are slow or unresponsive. They prevent requests from hanging indefinitely and allow the system to degrade gracefully.

The following code snippet puts a hard 180 millisecond deadline around a replica probe and records when the deadline is exceeded.

timeout.NewBuilder[probeResult](180 * time.Millisecond) sets the timeout duration for the execution.
OnTimeoutExceeded records when the protected execution ran past the deadline.

  timeoutPolicy := timeout.NewBuilder[probeResult](180 * time.Millisecond).
    OnTimeoutExceeded(func(event failsafe.ExecutionDoneEvent[probeResult]) {
      fmt.Printf("  timeout fired after %s\n", event.ElapsedTime().Round(time.Millisecond))
    }).
    Build()

hedge_timeout_async.go

Hedge ¶

Hedging starts a second equivalent request when the first request looks slow. This policy is only useful when you have multiple backend targets to call in parallel, such as replicas or caches. It first sends the request to one target, then launches a second request to another target if the first one does not finish within the configured delay. In the common case, the caller gets the first successful result, and the other request is canceled.

This pattern is a good fit for latency-sensitive reads. Avoid it for non-idempotent operations, because you end up doing duplicate work or can create side effects multiple times. Real-world examples include replicated search backends, read replicas, cache clusters, and geo-distributed lookup services.

The following example launches one additional read after 30 milliseconds, attaches a hedge budget, and logs when the hedge is created.

NewBuilderWithDelay sets a fixed delay before each hedge is launched.
WithMaxHedges caps how many extra concurrent attempts the policy may create. Here it launches at most one additional hedge, so there are at most two concurrent executions of the protected function.
WithBudget caps duplicate work across executions so hedging does not expand unchecked under load.
OnHedge runs when the extra request is launched.

  hedgeBudget := budget.NewBuilder().WithMaxRate(1.0).WithMinConcurrency(0).Build()

  hedgePolicy := hedgepolicy.NewBuilderWithDelay[probeResult](30 * time.Millisecond).
    WithMaxHedges(1).
    WithBudget(hedgeBudget).
    OnHedge(func(event failsafe.ExecutionEvent[probeResult]) {
      fmt.Printf("  hedge launched: attempts=%d hedges=%d\n", event.Attempts(), event.Hedges())
    }).
    Build()

hedge_timeout_async.go

Other options include:

NewBuilderWithDelayFunc to compute the hedge delay dynamically, for example from recent latency.
NewWithDelayQuantile to hedge when executions exceed an observed latency quantile such as p95.
CancelOnResult, CancelOnErrors, CancelOnErrorTypes, and CancelIf to control which outcome cancels outstanding hedges.
CancelOnSuccess when you want successful outcomes specifically to cancel outstanding hedges.

If the protected function needs to know whether it is running as the original request or as a hedge, use GetWithExecution or RunWithExecution and inspect the execution metadata. This is useful when you want the first attempt to prefer a primary replica and a hedge to go to a secondary replica, or when you want to tag hedge traffic separately in logs and metrics.

When you use GetWithExecution, the failsafe.Execution object passed into the function exposes the following hedge-specific methods:

IsHedge returns true only for the additional hedged attempts, not for the original attempt.
Hedges returns how many hedge attempts have been started so far for the overall execution.

  note := "primary"
  if exec.IsHedge() {
    note = fmt.Sprintf("hedge-%d", exec.Hedges())
  }

shared.go

The important limitation is that the original attempt cannot know in advance whether a hedge will be launched later. It only knows that it is not itself a hedge. If you need to observe hedge creation centrally, use OnHedge on the policy builder.

Cache ¶

Cache is a straightforward pattern where a result is stored and reused for subsequent calls with the same cache key. Operations that are expensive or have stable results are good candidates for caching. These operations either return the same result for the same input or change so infrequently that stale reads are acceptable.

The key to effective caching is choosing an appropriate amount of time that data can be cached. Some data is safe to cache forever because it never changes. Other data may be safe to cache for only a short time, such as a few seconds or minutes, because it changes infrequently.

The following example caches only non-error snapshots that contain assets, records hits and misses, and reuses the same cache key across calls.

cachepolicy.NewBuilder[configSnapshot](cache) creates a cache policy for the supplied cache backend.
CacheIf limits caching to successful snapshots that contain usable data.
cachepolicy.ContextWithCacheKey sets the cache key per execution. Here the key is snapshot:checkout-api, which is what makes repeated calls hit the same cached value.
OnCacheMiss records when the loader had to run because no cached value was present.
OnResultCached records when a fresh result is written into the cache.
OnCacheHit records when a cached result is served instead of calling the loader again.

Note that the cache policy from failsafe-go is not a cache implementation itself, but a policy that can be used with any cache that implements the cachepolicy.Cache interface. This design allows you to use your existing cache or choose one that fits your needs, while still benefiting from the policy features like cache key management, hit/miss listeners, and conditional caching.

The usage pattern is a bit different from the other policies, because we need to provide the cache key for each execution. This can be done through a context (ContextWithCacheKey), which allows the cache policy to look up the right value for each call.

  cache := newMemoryCache[configSnapshot]()
  controlPlaneCalls := 0

  cacheExecutor := failsafe.With(cachepolicy.NewBuilder[configSnapshot](cache).
    CacheIf(func(result configSnapshot, err error) bool {
      return err == nil && len(result.Assets) > 0
    }).
    OnCacheMiss(func(event failsafe.ExecutionEvent[configSnapshot]) {
      fmt.Printf("  cache miss on attempt %d\n", event.Attempts())
    }).
    OnResultCached(func(event failsafe.ExecutionEvent[configSnapshot]) {
      fmt.Printf("  cached snapshot from %s\n", event.LastResult().Source)
    }).
    OnCacheHit(func(event failsafe.ExecutionDoneEvent[configSnapshot]) {
      fmt.Printf("  cache hit for %s\n", event.Result.Service)
    }).
    Build())

  ctx := cachepolicy.ContextWithCacheKey(context.Background(), "snapshot:checkout-api")
  loader := func(exec failsafe.Execution[configSnapshot]) (configSnapshot, error) {
    controlPlaneCalls++
    return configSnapshot{
      Service: "checkout-api",
      Assets:  []string{"feature-flags", "routing-rules", "slo-budgets"},
      Source:  fmt.Sprintf("control-plane call %d", controlPlaneCalls),
    }, nil
  }

  first, err := cacheExecutor.WithContext(ctx).GetWithExecution(loader)
  if err != nil {
    fmt.Printf("  first snapshot error: %v\n", err)
  } else {
    fmt.Printf("  first snapshot source: %s\n", first.Source)
  }

  second, err := cacheExecutor.WithContext(ctx).GetWithExecution(loader)
  if err != nil {
    fmt.Printf("  second snapshot error: %v\n", err)
  } else {
    fmt.Printf("  second snapshot source: %s\n", second.Source)
  }

  fmt.Printf("  supplier was called %d time(s)\n", controlPlaneCalls)

cache_showcase.go

Other options include:

WithKey to set a fixed default cache key on the policy instead of providing one through context.
cachepolicy.New when the key always comes from execution context and you do not need builder customization.

Rate Limiter ¶

Rate limiting is a preventive control that limits the number of operations that can be executed over a time window. It is used to protect downstream systems. For example, if a system can only handle 100 requests per second, rate limiting helps you enforce that limit by rejecting or delaying requests that exceed the threshold.

Rate limiting is also a useful pattern when you want to enforce fair usage policies, for example in a public API, or when you want to enforce different pricing tiers. Basic users might be allowed only 100 requests a day, while premium users can make 1,000 requests a day.

The following example allows two immediate executions every 120 milliseconds, logs when capacity is exhausted, and then succeeds again after the refill interval.

ratelimiter.NewBurstyBuilder[string](2, 120*time.Millisecond) configures the burst size and refill period. Here it allows two executions to go through immediately, then refills the bucket every 120 milliseconds.
OnRateLimitExceeded runs when an execution is rejected because no capacity is available.

  limiter := ratelimiter.NewBurstyBuilder[string](2, 120*time.Millisecond).
    OnRateLimitExceeded(func(event failsafe.ExecutionEvent[string]) {
      fmt.Printf("  rate limited at attempt %d\n", event.Attempts())
    }).
    Build()

  executor := failsafe.With(limiter)
  for attempt := 1; attempt <= 3; attempt++ {
    result, err := executor.Get(func() (string, error) {
      return fmt.Sprintf("request-%d", attempt), nil
    })
    if err != nil {
      fmt.Printf("  request %d rejected: %v\n", attempt, err)
      continue
    }
    fmt.Printf("  request %d accepted with %s\n", attempt, result)
  }

  time.Sleep(130 * time.Millisecond)
  result, err := executor.Get(func() (string, error) {
    return "request-after-refill", nil
  })
  if err != nil {
    fmt.Printf("  refill request failed: %v\n", err)
  } else {
    fmt.Printf("  refill request accepted with %s\n", result)
  }

limits_showcase.go

Other options include:

NewSmooth, NewSmoothBuilder, and NewSmoothBuilderWithMaxRate to spread executions more evenly instead of allowing bursts.
NewBursty when you want the limiter directly without extra builder configuration.
WithMaxWaitTime to wait for capacity instead of failing immediately with ratelimiter.ErrExceeded.

Bulkhead ¶

Bulkhead is very similar to a rate limiter, but instead of limiting the number of executions over time, it limits the number of concurrent executions. Like a rate limiter, bulkheads help protect a downstream system from too much traffic and becoming overwhelmed. Bulkheads are especially useful for protecting resources that have limited concurrency, such as a database connection pool or a third-party API with strict concurrency limits.

Requests that exceed the concurrency limit are either rejected immediately or queued briefly until capacity is available.

The following example uses one permit, forces saturation by pre-acquiring that permit, and then shows the request succeeding again after the permit is released.

bulkhead.NewBuilder[string](1) sets the concurrency limit to one in-flight execution.
OnFull runs when an execution is rejected because all permits are in use.

  gate := bulkhead.NewBuilder[string](1).
    OnFull(func(event failsafe.ExecutionEvent[string]) {
      fmt.Println("  bulkhead is full")
    }).
    Build()

  if err := gate.AcquirePermit(context.Background()); err != nil {
    fmt.Printf("  failed to prefill bulkhead: %v\n", err)
    return
  }

  _, err := failsafe.With(gate).Get(func() (string, error) {
    return "worker-slot-1", nil
  })
  fmt.Printf("  while saturated: %v\n", err)

  gate.ReleasePermit()

  result, err := failsafe.With(gate).Get(func() (string, error) {
    return "worker-slot-1", nil
  })
  if err != nil {
    fmt.Printf("  recovered bulkhead failed: %v\n", err)
  } else {
    fmt.Printf("  recovered bulkhead accepted %s\n", result)
  }

limits_showcase.go

Other options include:

WithMaxWaitTime to queue briefly for a permit instead of failing immediately with bulkhead.ErrFull.
bulkhead.New when you only need a fixed concurrency cap with default behavior.

Adaptive Limiter ¶

An adaptive limiter is a more sophisticated version of a bulkhead that adjusts its concurrency limit based on observed conditions. Instead of having a fixed number of concurrent executions, an adaptive limiter can increase or decrease the limit in response to changes in latency, error rates, or other signals. When an overload is detected, the limiter reduces the concurrency limit, and when conditions improve, it can increase the limit.

The following example starts with a concurrency limit of one, rejects work while the only permit is held, and then allows work again after that permit is released.

WithLimits sets the minimum, maximum, and initial concurrency limits. Here the limiter starts at one concurrent execution and may grow to three.
WithRecentWindow configures the min and max durations of the recent sampling window, along with the min number of samples that must be collected before adjusting the limit based on recent conditions. Here the limiter looks at the last 50 executions over a window of 1 to 2 seconds to decide whether to adjust the limit.
OnLimitExceeded runs when the limiter rejects work because current concurrency is already at the computed limit.

  limiter := adaptivelimiter.NewBuilder[string]().
    WithLimits(1, 3, 1).
    WithRecentWindow(time.Second, 2*time.Second, 50).
    OnLimitExceeded(func(event failsafe.ExecutionEvent[string]) {
      fmt.Printf("  adaptive limiter rejected attempt %d\n", event.Attempts())
    }).
    Build()

  heldPermit, err := limiter.AcquirePermit(context.Background())
  if err != nil {
    fmt.Printf("  failed to acquire warm-up permit: %v\n", err)
    return
  }

  fmt.Printf("  limit=%d inflight=%d queued=%d\n", limiter.Limit(), limiter.Inflight(), limiter.Queued())

  _, err = failsafe.With(limiter).Get(func() (string, error) {
    return "background-sync", nil
  })
  fmt.Printf("  saturated execution: %v\n", err)

  heldPermit.Drop()

  result, err := failsafe.With(limiter).Get(func() (string, error) {
    return "background-sync", nil
  })
  if err != nil {
    fmt.Printf("  post-release execution failed: %v\n", err)
  } else {
    fmt.Printf("  post-release execution accepted %s\n", result)
  }

limits_showcase.go

Other options include:

WithMaxLimitFactor, WithMaxLimitFactorDecay, WithMaxLimitFunc, and WithMaxLimitStabilizationWindow to control how much headroom the limiter may add above current inflight work.
WithRecentWindow, WithRecentQuantile, WithBaselineWindow, and WithCorrelationWindow to tune how the limiter detects overload from latency and throughput trends.
WithQueueing and WithMaxWaitTime to absorb short spikes before rejecting with adaptivelimiter.ErrExceeded.
BuildPrioritized with adaptivelimiter.NewPrioritizer or NewPrioritizerBuilder when you want prioritized rejection and shared queue calibration across limiters.
OnLimitChanged and WithLogger for additional instrumentation and debugging.

Adaptive Throttler ¶

Adaptive throttling is similar to circuit breaking, but instead of rejecting all traffic when it detects a problem, it sheds load more gradually by rejecting a percentage of requests based on recent failure rates. It looks at recent outcomes and adjusts the rejection rate over time, allowing some traffic through even when conditions are bad, which can help a struggling dependency recover without being completely cut off.

Use it when a backend is returning too many errors and sending every request through would only deepen the outage. This is a good fit for large traffic flows where partial shedding is better than a binary open-or-closed decision. Avoid this pattern when the dependency is completely down and every request will fail anyway, because a circuit breaker may be simpler. The trade-off is finer-grained load shedding, but it requires good signals and careful tuning.

The following example shows an adaptive throttler that treats HTTP 503 as a failure signal and gradually raises its rejection rate when failures stay above the threshold.

HandleResult(503) defines which returned results count as failures. Here an HTTP 503 response is treated as overload.
WithFailureRateThreshold(0.2, 1, time.Minute) sets the failure-rate threshold, the minimum number of executions before throttling begins, and the observation window. In this case, if at least 20% of requests return 503 over a one-minute window, the throttler starts probabilistically rejecting requests.
WithMaxRejectionRate(1.0) caps how aggressively the throttler may shed traffic. In this example it may reject up to 100% of requests once the rejection rate climbs high enough.

  throttler := adaptivethrottler.NewBuilder[int]().
    HandleResult(503).
    WithFailureRateThreshold(0.2, 1, time.Minute).
    WithMaxRejectionRate(1.0).
    Build()

  executor := failsafe.With(throttler)
  for attempt := 1; attempt <= 8; attempt++ {
    result, err := executor.Get(func() (int, error) {
      return 503, nil
    })
    if errors.Is(err, adaptivethrottler.ErrExceeded) {
      fmt.Printf("  attempt %d rejected with rejection rate %.2f\n", attempt, throttler.RejectionRate())
      return
    }
    fmt.Printf("  attempt %d recorded result %d, rejection rate %.2f\n", attempt, result, throttler.RejectionRate())
  }

  fmt.Printf("  throttler never rejected within the sample window; final rate %.2f\n", throttler.RejectionRate())

limits_showcase.go

Other options include:

HandleErrors, HandleErrorTypes, and HandleIf to treat specific errors or predicates as failure signals.
BuildPrioritized when you want the throttler to reject lower-priority work first.
OnSuccess and OnFailure to observe the signals that drive throttling decisions.

Composing Policies ¶

In the previous examples, we have seen each policy on its own, but in a real service you often need to combine multiple policies. For example, you might want retries with backoff and jitter, but also a circuit breaker to stop retrying when the dependency is down and a fallback to return a cached value when retries are exhausted or the breaker is open.

failsafe-go makes it very convenient to compose policies. You can use With to combine multiple policies into one executor, and the order of the policies in the With call determines how they interact with each other.

The following example configures a retry, circuit breaker, and fallback policy. Fallback is outermost, retry sits inside it, and the breaker is the innermost policy in that stack.

failsafe.With composes policies from left to right, making fallbackPolicy outermost and breaker innermost.
WithContext attaches a request-scoped context to the composed executor.
OnDone runs after the overall execution completes. In this example it records elapsed time, attempt count, and either the error or result source.
GetWithExecution passes the execution object into the protected function so the function can inspect attempt metadata while the policies run.

executor := failsafe.With(fallbackPolicy, retryPolicy, breaker).
    WithContext(ctx).
    OnDone(func(event failsafe.ExecutionDoneEvent[rolloutPlan]) {
        if event.Error != nil {
            fmt.Printf("  done in %s after %d attempts with error=%v\n", event.ElapsedTime().Round(time.Millisecond), event.Attempts(), event.Error)
            return
        }
        fmt.Printf("  done in %s after %d attempts with source=%s\n", event.ElapsedTime().Round(time.Millisecond), event.Attempts(), event.Result.Source)
    })

plan, err := executor.GetWithExecution(first.Fetch)

Other options include:

Compose to add another same-result-type policy step by step instead of passing everything to one With call.
WithAny and ComposeAny to mix result-agnostic shared policies, such as a common circuit breaker, with more specific result types.
Get, Run, and RunWithExecution when you want synchronous execution without returning a value, or you do not need the execution object.

The order matters because outer policies observe the results of inner ones. A good rule is to decide first what the caller should see, then compose from the outside in.

You can find more information about policy composition in the official docs: Policies Overview and Policy Composition.

HTTP Support ¶

failsafe-go provides a convenient way to apply resilience policies to HTTP clients and servers through the failsafehttp package.

On the client side, it provides a RoundTripper that can wrap any existing transport and apply policies to all outgoing requests.

The following example wraps a transport with an HTTP-aware retry policy, then sends a request to a test server that returns two 503 responses before succeeding.

failsafehttp.NewRetryPolicyBuilder creates an HTTP-aware retry builder that already knows about common retryable transport errors, retryable HTTP responses, and Retry-After headers.
WithBackoff adds exponential backoff between retries.
OnRetryScheduled records the response status, next attempt number, and delay before the retry is sent.
failsafehttp.NewRoundTripper attaches the policy to the client transport so the behavior is centralized.

  var attempts atomic.Int32
  server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
    attempt := attempts.Add(1)
    if attempt <= 2 {
      w.WriteHeader(http.StatusServiceUnavailable)
      _, _ = w.Write([]byte("control plane warming up"))
      return
    }
    w.WriteHeader(http.StatusOK)
    _, _ = w.Write([]byte("configuration applied"))
  }))
  defer server.Close()

  retryPolicy := failsafehttp.NewRetryPolicyBuilder().
    WithBackoff(20*time.Millisecond, 80*time.Millisecond).
    OnRetryScheduled(func(event failsafe.ExecutionScheduledEvent[*http.Response]) {
      status := 0
      if resp := event.LastResult(); resp != nil {
        status = resp.StatusCode
      }
      fmt.Printf("  retrying outbound request: status=%d next-attempt=%d delay=%s\n", status, event.Attempts()+1, event.Delay)
    }).
    Build()

  client := &http.Client{
    Transport: failsafehttp.NewRoundTripper(nil, retryPolicy),
  }

  req, err := http.NewRequest(http.MethodGet, server.URL, nil)
  if err != nil {
    fmt.Printf("  request build failed: %v\n", err)
    return
  }

  resp, err := client.Do(req)
  if err != nil {
    fmt.Printf("  outbound request failed: %v\n", err)
    return
  }
  defer resp.Body.Close()

  body, err := io.ReadAll(resp.Body)
  if err != nil {
    fmt.Printf("  response read failed: %v\n", err)
    return
  }

  fmt.Printf("  final HTTP status=%d body=%q attempts=%d\n", resp.StatusCode, string(body), attempts.Load())

http_showcase.go

Other options include:

the full retry builder surface such as HandleIf, HandleErrors, HandleResult, WithMaxAttempts, WithMaxRetries, WithDelayFunc, OnRetry, OnRetriesExceeded, AbortOnErrors, and ReturnLastFailure.
failsafehttp.NewRequest when you want to wrap a single request and client instead of installing a transport-wide round tripper.
failsafehttp.DelayFunc when you want another delay-capable policy, such as a circuit breaker, to honor Retry-After headers.
failsafehttp.NewRoundTripperWithLevel and failsafehttp.NewHandlerWithLevel when you need to propagate priorities through HTTP.

On the server side, you can use failsafehttp.NewHandler to wrap an existing http.Handler with policies that protect your endpoints. This is useful for applying timeouts, bulkheads, or adaptive limiters to incoming requests.

The following example wraps a slow handler with a timeout policy, uses httptest to exercise it locally, and prints the final status and body.

failsafehttp.NewHandler wraps an existing http.Handler with one or more policies.
timeout.NewBuilder[*http.Response] creates a server-side timeout for the whole request handling execution.
The handler listens on r.Context().Done() so canceled work stops early instead of continuing in the background.

  timeoutPolicy := timeout.NewBuilder[*http.Response](150 * time.Millisecond).Build()

  handler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
    select {
    case <-time.After(250 * time.Millisecond):
      w.WriteHeader(http.StatusOK)
      _, _ = w.Write([]byte("inventory refreshed"))
    case <-r.Context().Done():
      fmt.Printf("handler canceled: %v\n", r.Context().Err())
    }
  })

  protected := failsafehttp.NewHandler(handler, timeoutPolicy)
  server := httptest.NewServer(protected)
  defer server.Close()

  resp, err := http.Get(server.URL)
  if err != nil {
    fmt.Printf("request failed: %v\n", err)
    return
  }
  defer resp.Body.Close()

  body, err := io.ReadAll(resp.Body)
  if err != nil {
    fmt.Printf("read failed: %v\n", err)
    return
  }

  fmt.Printf("status=%d body=%q\n", resp.StatusCode, strings.TrimSpace(string(body)))

main.go

Other options include:

failsafehttp.NewHandlerWithExecutor when you want to build the executor once and reuse it across handlers.
failsafehttp.NewHandlerWithLevel when you want to extract request priority or adaptive-limiter level information from headers.

The second argument to NewHandler is variadic, so you can attach multiple policies to the same handler. For example, you could add a bulkhead to limit concurrent requests and a fallback to return a default response when the server is overloaded.

Wrapping Up ¶

Resilience patterns are essential for building robust systems. failsafe-go provides a rich set of policies that you can compose to handle retries, timeouts, load shedding, and more. The key is to understand the trade-offs of each pattern and design your policies around the specific failure modes of your dependencies.