Applications often depend on databases, internal APIs, queues, and third-party systems. These dependencies fail, slow down, or become overloaded. So an application needs ways to handle those failures gracefully, without crashing or causing a bad experience for users.
failsafe-go helps you implement resilience patterns in Go, such as retries, circuit breakers, fallbacks, timeouts, hedging, caching, rate limiting, bulkheads, and adaptive limiters. It provides a consistent API for defining these patterns and composing them together to build robust applications.
Install ¶
go get github.com/failsafe-go/failsafe-go
failsafe-go uses the same pattern for all policies. The With function takes one or more policies and returns an executor.
executor := failsafe.With(fallbackPolicy, retryPolicy, breaker)
result, err := executor.Get(first.Fetch)
You call Get or Run on that executor to execute your code synchronously with the attached policies. Get is for functions that return a value and an error, and Run is for functions that only return an error.
Async methods like GetAsync and RunAsync are also available. They return an ExecutionResult, which contains a channel (Done()) to wait for completion plus methods to retrieve the result and error once the execution is done.
Additional methods include RunWithExecution and GetWithExecution, which pass a failsafe.Execution object into the function for more control and observability. For example, in a retry policy, the execution object can tell you how many attempts have been made so far.
These executor calls return a result and/or an error depending on which method you use. The error can come from the function itself, or it can be a wrapped policy error if the execution fails due to retries, timeouts, or other conditions. For example, the retry policy returns ErrExceeded when the maximum number of attempts is reached without success.
The WithContext method allows you to pass a context that can be used for cancellation and deadlines. This is especially important for policies like timeouts and hedging, which need to be able to cancel in-flight executions.
Retry ¶
Retry handles transient failures by trying the same operation again. Retries should usually be used with a backoff strategy that spaces out attempts over time so they do not send a flood of traffic to an already struggling dependency. Retry should also not retry every operation that fails, so the most important part of a retry policy is deciding which errors are safe to retry and which ones should fail immediately.
Common use cases for retry are outbound HTTP calls, database connections, and message publishing. When using retry, you must be careful with non-idempotent operations. Depending on the failure, it might be possible that the operation succeeded but the response was lost. This could result in duplicate work. Common solutions are idempotency keys or another deduplication mechanism.
Retries can improve success rates, but they also increase latency and may add load during an outage.
The following example retries any non-nil error, aborts on known non-retryable failures, and emits retry lifecycle events.
HandleIfdefines which errors should trigger a retry. In this case, any non-nil error will be retried. The parameters are the last successful result (if any) and the error that occurred.AbortOnErrorsstops retrying for known fatal conditions. In this example, invalid config responses and an open circuit fail immediately.WithMaxAttemptssets the maximum number of attempts, including the initial try. In this example, it allows for two retries after the initial attempt.WithBackoffconfigures the backoff strategy. Here it starts with a 20ms delay and doubles it up to a maximum of 80ms.WithJitteradds random jitter to the delay to prevent thundering herd problems when many requests fail at the same time. Here it adds up to 5ms of random jitter.OnRetryScheduledfires before the next attempt is queued.OnRetryruns when a retry attempt starts.OnAbortruns when the retry policy gives up early because of an abort condition.
retryPolicy := retrypolicy.NewBuilder[rolloutPlan]().
HandleIf(func(_ rolloutPlan, err error) bool {
return err != nil
}).
AbortOnErrors(errInvalidConfig, circuitbreaker.ErrOpen).
WithMaxAttempts(3).
WithBackoff(20*time.Millisecond, 80*time.Millisecond).
WithJitter(5 * time.Millisecond).
OnRetryScheduled(func(event failsafe.ExecutionScheduledEvent[rolloutPlan]) {
fmt.Printf(" retry scheduled: attempt=%d delay=%s\n", event.Attempts()+1, event.Delay)
}).
OnRetry(func(event failsafe.ExecutionEvent[rolloutPlan]) {
fmt.Printf(" retrying after: %v\n", event.LastError())
}).
OnAbort(func(event failsafe.ExecutionEvent[rolloutPlan]) {
fmt.Printf(" retry aborted on: %v\n", event.LastError())
}).
Build()
Other options include:
HandleErrorsto specify a list of error values or types that should trigger retries.HandleResultto retry based on the returned value instead of the error.WithMaxRetriesto specify the maximum number of retries instead of the maximum number of attempts.WithMaxDurationto stop retrying after a total time limit, in addition to any attempt or retry limit.WithDelay,WithRandomDelay, andWithDelayFuncto use fixed, random, or computed delays between attempts. For example,WithDelayFunccould be useful for implementing a retry that honorsRetry-Afterheaders from an HTTP response.WithJitterFactoras an alternative to time-based jitter.WithBudgetto limit outstanding retries across a system.AbortOnResultandAbortIfto stop retrying for specific outcomes.ReturnLastFailureto return the last result and error instead of an exceeded error wrapper.OnRetriesExceededto add a listener when the retry limit is reached.
Circuit Breaker ¶
Circuit breaking stops requests from repeatedly hitting a dependency that is already failing. Unlike a retry, which always calls the dependency, a circuit breaker can stop calls to the dependency when it detects a problem. This can help the dependency to recover.
The breaker usually has three states: closed, open, and half-open. In the closed state, calls are allowed and failures are counted. When the failure threshold is reached, the breaker moves to open and rejects calls immediately. After a delay, it moves to half-open, allows a probe call, and then closes again on success or reopens on failure.
A circuit breaker can be useful in scenarios where a dependency is hard down or times out under load. Breakers reduce load during incidents, but choosing the right thresholds is important to avoid opening too early or too late.
The following example opens after two handled failures, transitions to half-open after a short delay, and logs each state change.
HandleErrorsdefines which errors count as breaker failures. In this case, onlyerrUpstreamUnavailablecontributes to opening the breaker.WithFailureThresholdsets the number of failures required to open the breaker. Here it opens after two handled failures.WithSuccessThresholdsets how many successful half-open probe calls are needed before the breaker closes again. Here a single successful probe is enough.WithDelaysets how long the breaker stays open before transitioning to half-open. Here it waits 120 milliseconds.OnStateChangedrecords every breaker state transition.OnOpenruns when the breaker opens.OnHalfOpenruns when the breaker allows probe traffic again.OnCloseruns when the breaker returns to healthy closed state after a successful probe.
breaker := circuitbreaker.NewBuilder[rolloutPlan]().
HandleErrors(errUpstreamUnavailable).
WithFailureThreshold(2).
WithSuccessThreshold(1).
WithDelay(120 * time.Millisecond).
OnStateChanged(func(event circuitbreaker.StateChangedEvent) {
fmt.Printf(" breaker state: %s -> %s\n", event.OldState, event.NewState)
}).
OnOpen(func(event circuitbreaker.StateChangedEvent) {
fmt.Printf(" breaker opened after %d upstream failures\n", event.Metrics().Failures())
}).
OnHalfOpen(func(event circuitbreaker.StateChangedEvent) {
fmt.Println(" breaker is probing the planner again")
}).
OnClose(func(event circuitbreaker.StateChangedEvent) {
fmt.Println(" breaker closed after a healthy probe")
}).
Build()
Other options include:
HandleErrorTypes,HandleResult, andHandleIfto define failures by error type, return value, or predicate.WithFailureThresholdRatio,WithFailureThresholdPeriod, andWithFailureRateThresholdfor ratio-based or time-window-based opening rules.WithDelayFuncto compute the open-state delay dynamically, for example from aRetry-Afterheader.WithSuccessThresholdRatioto require a ratio of successful probe calls before closing.OnSuccessandOnFailurefor additional logging and metrics around handled outcomes.
Fallback ¶
The fallback policy can return a predefined result or error when the primary execution fails. For example, a fallback can return a cached value or a default response. When using a fallback, it is important to make sure that the caller can tell that this result came from a fallback and not the primary execution.
The fallback policy most often makes sense as the outermost policy in a composition of multiple policies, so it can catch failures from retries, circuit breakers, timeouts, and other inner policies. This way the caller gets a gracefully degraded response instead of an error when something goes wrong. See the section below on policy composition for more on how to combine multiple policies together.
The following code snippet returns a degraded rollout plan and adjusts the message based on the last failure.
NewBuilderWithFunccreates the fallback from a function, which is useful when the replacement value depends on the execution context or the last failure.HandleErrorslimits the fallback to specific policy failures. Here it only activates forretrypolicy.ErrExceededandcircuitbreaker.ErrOpen.OnFallbackExecutedruns after the fallback supplies a replacement result.
fallbackPolicy := fallback.NewBuilderWithFunc(func(exec failsafe.Execution[rolloutPlan]) (rolloutPlan, error) {
note := "served a cached rollout plan after retries were exhausted"
if errors.Is(exec.LastError(), circuitbreaker.ErrOpen) {
note = "served a cached rollout plan because the breaker is open"
}
return rolloutPlan{
Service: "checkout-api",
Region: "us-east-1",
Source: "degraded-cache",
Note: note,
}, nil
}).
HandleErrors(retrypolicy.ErrExceeded, circuitbreaker.ErrOpen).
OnFallbackExecuted(func(event failsafe.ExecutionDoneEvent[rolloutPlan]) {
fmt.Printf(" fallback served: %s\n", event.Result.Source)
}).
Build()
Other options include:
NewWithResultandNewBuilderWithResultto always return a fixed fallback result.NewWithErrorandNewBuilderWithErrorto convert failures into a specific fallback error.NewWithFuncwhen you want the fallback policy directly without further builder customization.HandleErrorTypes,HandleResult, andHandleIfto trigger fallback from error types, return values, or predicates.OnSuccessandOnFailureto instrument what the fallback handled.
Timeout ¶
Timeout sets an upper bound for how long an execution is allowed to run. When setting a timeout, make sure that the value is based on real measurements in production. If it's too low, you will see a lot of timeouts that look like failures but are actually just normal latency.
Timeouts are essential for keeping a service responsive when dependencies are slow or unresponsive. They prevent requests from hanging indefinitely and allow the system to degrade gracefully.
The following code snippet puts a hard 180 millisecond deadline around a replica probe and records when the deadline is exceeded.
timeout.NewBuilder[probeResult](180 * time.Millisecond)sets the timeout duration for the execution.OnTimeoutExceededrecords when the protected execution ran past the deadline.
timeoutPolicy := timeout.NewBuilder[probeResult](180 * time.Millisecond).
OnTimeoutExceeded(func(event failsafe.ExecutionDoneEvent[probeResult]) {
fmt.Printf(" timeout fired after %s\n", event.ElapsedTime().Round(time.Millisecond))
}).
Build()
Hedge ¶
Hedging starts a second equivalent request when the first request looks slow. This policy is only useful when you have multiple backend targets to call in parallel, such as replicas or caches. It first sends the request to one target, then launches a second request to another target if the first one does not finish within the configured delay. In the common case, the caller gets the first successful result, and the other request is canceled.
This pattern is a good fit for latency-sensitive reads. Avoid it for non-idempotent operations, because you end up doing duplicate work or can create side effects multiple times. Real-world examples include replicated search backends, read replicas, cache clusters, and geo-distributed lookup services.
The following example launches one additional read after 30 milliseconds, attaches a hedge budget, and logs when the hedge is created.
NewBuilderWithDelaysets a fixed delay before each hedge is launched.WithMaxHedgescaps how many extra concurrent attempts the policy may create. Here it launches at most one additional hedge, so there are at most two concurrent executions of the protected function.WithBudgetcaps duplicate work across executions so hedging does not expand unchecked under load.OnHedgeruns when the extra request is launched.
hedgeBudget := budget.NewBuilder().WithMaxRate(1.0).WithMinConcurrency(0).Build()
hedgePolicy := hedgepolicy.NewBuilderWithDelay[probeResult](30 * time.Millisecond).
WithMaxHedges(1).
WithBudget(hedgeBudget).
OnHedge(func(event failsafe.ExecutionEvent[probeResult]) {
fmt.Printf(" hedge launched: attempts=%d hedges=%d\n", event.Attempts(), event.Hedges())
}).
Build()
Other options include:
NewBuilderWithDelayFuncto compute the hedge delay dynamically, for example from recent latency.NewWithDelayQuantileto hedge when executions exceed an observed latency quantile such as p95.CancelOnResult,CancelOnErrors,CancelOnErrorTypes, andCancelIfto control which outcome cancels outstanding hedges.CancelOnSuccesswhen you want successful outcomes specifically to cancel outstanding hedges.
If the protected function needs to know whether it is running as the original request or as a hedge, use GetWithExecution or RunWithExecution and inspect the execution metadata. This is useful when you want the first attempt to prefer a primary replica and a hedge to go to a secondary replica, or when you want to tag hedge traffic separately in logs and metrics.
When you use GetWithExecution, the failsafe.Execution object passed into the function exposes the following hedge-specific methods:
IsHedgereturnstrueonly for the additional hedged attempts, not for the original attempt.Hedgesreturns how many hedge attempts have been started so far for the overall execution.
note := "primary"
if exec.IsHedge() {
note = fmt.Sprintf("hedge-%d", exec.Hedges())
}
The important limitation is that the original attempt cannot know in advance whether a hedge will be launched later. It only knows that it is not itself a hedge. If you need to observe hedge creation centrally, use OnHedge on the policy builder.
Cache ¶
Cache is a straightforward pattern where a result is stored and reused for subsequent calls with the same cache key. Operations that are expensive or have stable results are good candidates for caching. These operations either return the same result for the same input or change so infrequently that stale reads are acceptable.
The key to effective caching is choosing an appropriate amount of time that data can be cached. Some data is safe to cache forever because it never changes. Other data may be safe to cache for only a short time, such as a few seconds or minutes, because it changes infrequently.
The following example caches only non-error snapshots that contain assets, records hits and misses, and reuses the same cache key across calls.
cachepolicy.NewBuilder[configSnapshot](cache)creates a cache policy for the supplied cache backend.CacheIflimits caching to successful snapshots that contain usable data.cachepolicy.ContextWithCacheKeysets the cache key per execution. Here the key issnapshot:checkout-api, which is what makes repeated calls hit the same cached value.OnCacheMissrecords when the loader had to run because no cached value was present.OnResultCachedrecords when a fresh result is written into the cache.OnCacheHitrecords when a cached result is served instead of calling the loader again.
Note that the cache policy from failsafe-go is not a cache implementation itself, but a policy that can be used with any cache that implements the cachepolicy.Cache interface. This design allows you to use your existing cache or choose one that fits your needs, while still benefiting from the policy features like cache key management, hit/miss listeners, and conditional caching.
The usage pattern is a bit different from the other policies, because we need to provide the cache key for each execution. This can be done through a context (ContextWithCacheKey), which allows the cache policy to look up the right value for each call.
cache := newMemoryCache[configSnapshot]()
controlPlaneCalls := 0
cacheExecutor := failsafe.With(cachepolicy.NewBuilder[configSnapshot](cache).
CacheIf(func(result configSnapshot, err error) bool {
return err == nil && len(result.Assets) > 0
}).
OnCacheMiss(func(event failsafe.ExecutionEvent[configSnapshot]) {
fmt.Printf(" cache miss on attempt %d\n", event.Attempts())
}).
OnResultCached(func(event failsafe.ExecutionEvent[configSnapshot]) {
fmt.Printf(" cached snapshot from %s\n", event.LastResult().Source)
}).
OnCacheHit(func(event failsafe.ExecutionDoneEvent[configSnapshot]) {
fmt.Printf(" cache hit for %s\n", event.Result.Service)
}).
Build())
ctx := cachepolicy.ContextWithCacheKey(context.Background(), "snapshot:checkout-api")
loader := func(exec failsafe.Execution[configSnapshot]) (configSnapshot, error) {
controlPlaneCalls++
return configSnapshot{
Service: "checkout-api",
Assets: []string{"feature-flags", "routing-rules", "slo-budgets"},
Source: fmt.Sprintf("control-plane call %d", controlPlaneCalls),
}, nil
}
first, err := cacheExecutor.WithContext(ctx).GetWithExecution(loader)
if err != nil {
fmt.Printf(" first snapshot error: %v\n", err)
} else {
fmt.Printf(" first snapshot source: %s\n", first.Source)
}
second, err := cacheExecutor.WithContext(ctx).GetWithExecution(loader)
if err != nil {
fmt.Printf(" second snapshot error: %v\n", err)
} else {
fmt.Printf(" second snapshot source: %s\n", second.Source)
}
fmt.Printf(" supplier was called %d time(s)\n", controlPlaneCalls)
Other options include:
WithKeyto set a fixed default cache key on the policy instead of providing one through context.cachepolicy.Newwhen the key always comes from execution context and you do not need builder customization.
Rate Limiter ¶
Rate limiting is a preventive control that limits the number of operations that can be executed over a time window. It is used to protect downstream systems. For example, if a system can only handle 100 requests per second, rate limiting helps you enforce that limit by rejecting or delaying requests that exceed the threshold.
Rate limiting is also a useful pattern when you want to enforce fair usage policies, for example in a public API, or when you want to enforce different pricing tiers. Basic users might be allowed only 100 requests a day, while premium users can make 1,000 requests a day.
The following example allows two immediate executions every 120 milliseconds, logs when capacity is exhausted, and then succeeds again after the refill interval.
ratelimiter.NewBurstyBuilder[string](2, 120*time.Millisecond)configures the burst size and refill period. Here it allows two executions to go through immediately, then refills the bucket every 120 milliseconds.OnRateLimitExceededruns when an execution is rejected because no capacity is available.
limiter := ratelimiter.NewBurstyBuilder[string](2, 120*time.Millisecond).
OnRateLimitExceeded(func(event failsafe.ExecutionEvent[string]) {
fmt.Printf(" rate limited at attempt %d\n", event.Attempts())
}).
Build()
executor := failsafe.With(limiter)
for attempt := 1; attempt <= 3; attempt++ {
result, err := executor.Get(func() (string, error) {
return fmt.Sprintf("request-%d", attempt), nil
})
if err != nil {
fmt.Printf(" request %d rejected: %v\n", attempt, err)
continue
}
fmt.Printf(" request %d accepted with %s\n", attempt, result)
}
time.Sleep(130 * time.Millisecond)
result, err := executor.Get(func() (string, error) {
return "request-after-refill", nil
})
if err != nil {
fmt.Printf(" refill request failed: %v\n", err)
} else {
fmt.Printf(" refill request accepted with %s\n", result)
}
Other options include:
NewSmooth,NewSmoothBuilder, andNewSmoothBuilderWithMaxRateto spread executions more evenly instead of allowing bursts.NewBurstywhen you want the limiter directly without extra builder configuration.WithMaxWaitTimeto wait for capacity instead of failing immediately withratelimiter.ErrExceeded.
Bulkhead ¶
Bulkhead is very similar to a rate limiter, but instead of limiting the number of executions over time, it limits the number of concurrent executions. Like a rate limiter, bulkheads help protect a downstream system from too much traffic and becoming overwhelmed. Bulkheads are especially useful for protecting resources that have limited concurrency, such as a database connection pool or a third-party API with strict concurrency limits.
Requests that exceed the concurrency limit are either rejected immediately or queued briefly until capacity is available.
The following example uses one permit, forces saturation by pre-acquiring that permit, and then shows the request succeeding again after the permit is released.
bulkhead.NewBuilder[string](1)sets the concurrency limit to one in-flight execution.OnFullruns when an execution is rejected because all permits are in use.
gate := bulkhead.NewBuilder[string](1).
OnFull(func(event failsafe.ExecutionEvent[string]) {
fmt.Println(" bulkhead is full")
}).
Build()
if err := gate.AcquirePermit(context.Background()); err != nil {
fmt.Printf(" failed to prefill bulkhead: %v\n", err)
return
}
_, err := failsafe.With(gate).Get(func() (string, error) {
return "worker-slot-1", nil
})
fmt.Printf(" while saturated: %v\n", err)
gate.ReleasePermit()
result, err := failsafe.With(gate).Get(func() (string, error) {
return "worker-slot-1", nil
})
if err != nil {
fmt.Printf(" recovered bulkhead failed: %v\n", err)
} else {
fmt.Printf(" recovered bulkhead accepted %s\n", result)
}
Other options include:
WithMaxWaitTimeto queue briefly for a permit instead of failing immediately withbulkhead.ErrFull.bulkhead.Newwhen you only need a fixed concurrency cap with default behavior.
Adaptive Limiter ¶
An adaptive limiter is a more sophisticated version of a bulkhead that adjusts its concurrency limit based on observed conditions. Instead of having a fixed number of concurrent executions, an adaptive limiter can increase or decrease the limit in response to changes in latency, error rates, or other signals. When an overload is detected, the limiter reduces the concurrency limit, and when conditions improve, it can increase the limit.
The following example starts with a concurrency limit of one, rejects work while the only permit is held, and then allows work again after that permit is released.
WithLimitssets the minimum, maximum, and initial concurrency limits. Here the limiter starts at one concurrent execution and may grow to three.WithRecentWindowconfigures the min and max durations of the recent sampling window, along with the min number of samples that must be collected before adjusting the limit based on recent conditions. Here the limiter looks at the last 50 executions over a window of 1 to 2 seconds to decide whether to adjust the limit.OnLimitExceededruns when the limiter rejects work because current concurrency is already at the computed limit.
limiter := adaptivelimiter.NewBuilder[string]().
WithLimits(1, 3, 1).
WithRecentWindow(time.Second, 2*time.Second, 50).
OnLimitExceeded(func(event failsafe.ExecutionEvent[string]) {
fmt.Printf(" adaptive limiter rejected attempt %d\n", event.Attempts())
}).
Build()
heldPermit, err := limiter.AcquirePermit(context.Background())
if err != nil {
fmt.Printf(" failed to acquire warm-up permit: %v\n", err)
return
}
fmt.Printf(" limit=%d inflight=%d queued=%d\n", limiter.Limit(), limiter.Inflight(), limiter.Queued())
_, err = failsafe.With(limiter).Get(func() (string, error) {
return "background-sync", nil
})
fmt.Printf(" saturated execution: %v\n", err)
heldPermit.Drop()
result, err := failsafe.With(limiter).Get(func() (string, error) {
return "background-sync", nil
})
if err != nil {
fmt.Printf(" post-release execution failed: %v\n", err)
} else {
fmt.Printf(" post-release execution accepted %s\n", result)
}
Other options include:
WithMaxLimitFactor,WithMaxLimitFactorDecay,WithMaxLimitFunc, andWithMaxLimitStabilizationWindowto control how much headroom the limiter may add above current inflight work.WithRecentWindow,WithRecentQuantile,WithBaselineWindow, andWithCorrelationWindowto tune how the limiter detects overload from latency and throughput trends.WithQueueingandWithMaxWaitTimeto absorb short spikes before rejecting withadaptivelimiter.ErrExceeded.BuildPrioritizedwithadaptivelimiter.NewPrioritizerorNewPrioritizerBuilderwhen you want prioritized rejection and shared queue calibration across limiters.OnLimitChangedandWithLoggerfor additional instrumentation and debugging.
Adaptive Throttler ¶
Adaptive throttling is similar to circuit breaking, but instead of rejecting all traffic when it detects a problem, it sheds load more gradually by rejecting a percentage of requests based on recent failure rates. It looks at recent outcomes and adjusts the rejection rate over time, allowing some traffic through even when conditions are bad, which can help a struggling dependency recover without being completely cut off.
Use it when a backend is returning too many errors and sending every request through would only deepen the outage. This is a good fit for large traffic flows where partial shedding is better than a binary open-or-closed decision. Avoid this pattern when the dependency is completely down and every request will fail anyway, because a circuit breaker may be simpler. The trade-off is finer-grained load shedding, but it requires good signals and careful tuning.
The following example shows an adaptive throttler that treats HTTP 503 as a failure signal and gradually raises its rejection rate when failures stay above the threshold.
HandleResult(503)defines which returned results count as failures. Here an HTTP 503 response is treated as overload.WithFailureRateThreshold(0.2, 1, time.Minute)sets the failure-rate threshold, the minimum number of executions before throttling begins, and the observation window. In this case, if at least 20% of requests return 503 over a one-minute window, the throttler starts probabilistically rejecting requests.WithMaxRejectionRate(1.0)caps how aggressively the throttler may shed traffic. In this example it may reject up to 100% of requests once the rejection rate climbs high enough.
throttler := adaptivethrottler.NewBuilder[int]().
HandleResult(503).
WithFailureRateThreshold(0.2, 1, time.Minute).
WithMaxRejectionRate(1.0).
Build()
executor := failsafe.With(throttler)
for attempt := 1; attempt <= 8; attempt++ {
result, err := executor.Get(func() (int, error) {
return 503, nil
})
if errors.Is(err, adaptivethrottler.ErrExceeded) {
fmt.Printf(" attempt %d rejected with rejection rate %.2f\n", attempt, throttler.RejectionRate())
return
}
fmt.Printf(" attempt %d recorded result %d, rejection rate %.2f\n", attempt, result, throttler.RejectionRate())
}
fmt.Printf(" throttler never rejected within the sample window; final rate %.2f\n", throttler.RejectionRate())
Other options include:
HandleErrors,HandleErrorTypes, andHandleIfto treat specific errors or predicates as failure signals.BuildPrioritizedwhen you want the throttler to reject lower-priority work first.OnSuccessandOnFailureto observe the signals that drive throttling decisions.
Composing Policies ¶
In the previous examples, we have seen each policy on its own, but in a real service you often need to combine multiple policies. For example, you might want retries with backoff and jitter, but also a circuit breaker to stop retrying when the dependency is down and a fallback to return a cached value when retries are exhausted or the breaker is open.
failsafe-go makes it very convenient to compose policies. You can use With to combine multiple policies into one executor, and the order of the policies in the With call determines how they interact with each other.
The following example configures a retry, circuit breaker, and fallback policy. Fallback is outermost, retry sits inside it, and the breaker is the innermost policy in that stack.
failsafe.Withcomposes policies from left to right, makingfallbackPolicyoutermost andbreakerinnermost.WithContextattaches a request-scoped context to the composed executor.OnDoneruns after the overall execution completes. In this example it records elapsed time, attempt count, and either the error or result source.GetWithExecutionpasses the execution object into the protected function so the function can inspect attempt metadata while the policies run.
executor := failsafe.With(fallbackPolicy, retryPolicy, breaker).
WithContext(ctx).
OnDone(func(event failsafe.ExecutionDoneEvent[rolloutPlan]) {
if event.Error != nil {
fmt.Printf(" done in %s after %d attempts with error=%v\n", event.ElapsedTime().Round(time.Millisecond), event.Attempts(), event.Error)
return
}
fmt.Printf(" done in %s after %d attempts with source=%s\n", event.ElapsedTime().Round(time.Millisecond), event.Attempts(), event.Result.Source)
})
plan, err := executor.GetWithExecution(first.Fetch)
Other options include:
Composeto add another same-result-type policy step by step instead of passing everything to oneWithcall.WithAnyandComposeAnyto mix result-agnostic shared policies, such as a common circuit breaker, with more specific result types.Get,Run, andRunWithExecutionwhen you want synchronous execution without returning a value, or you do not need the execution object.
The order matters because outer policies observe the results of inner ones. A good rule is to decide first what the caller should see, then compose from the outside in.
You can find more information about policy composition in the official docs: Policies Overview and Policy Composition.
HTTP Support ¶
failsafe-go provides a convenient way to apply resilience policies to HTTP clients and servers through the failsafehttp package.
On the client side, it provides a RoundTripper that can wrap any existing transport and apply policies to all outgoing requests.
The following example wraps a transport with an HTTP-aware retry policy, then sends a request to a test server that returns two 503 responses before succeeding.
failsafehttp.NewRetryPolicyBuildercreates an HTTP-aware retry builder that already knows about common retryable transport errors, retryable HTTP responses, andRetry-Afterheaders.WithBackoffadds exponential backoff between retries.OnRetryScheduledrecords the response status, next attempt number, and delay before the retry is sent.failsafehttp.NewRoundTripperattaches the policy to the client transport so the behavior is centralized.
var attempts atomic.Int32
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
attempt := attempts.Add(1)
if attempt <= 2 {
w.WriteHeader(http.StatusServiceUnavailable)
_, _ = w.Write([]byte("control plane warming up"))
return
}
w.WriteHeader(http.StatusOK)
_, _ = w.Write([]byte("configuration applied"))
}))
defer server.Close()
retryPolicy := failsafehttp.NewRetryPolicyBuilder().
WithBackoff(20*time.Millisecond, 80*time.Millisecond).
OnRetryScheduled(func(event failsafe.ExecutionScheduledEvent[*http.Response]) {
status := 0
if resp := event.LastResult(); resp != nil {
status = resp.StatusCode
}
fmt.Printf(" retrying outbound request: status=%d next-attempt=%d delay=%s\n", status, event.Attempts()+1, event.Delay)
}).
Build()
client := &http.Client{
Transport: failsafehttp.NewRoundTripper(nil, retryPolicy),
}
req, err := http.NewRequest(http.MethodGet, server.URL, nil)
if err != nil {
fmt.Printf(" request build failed: %v\n", err)
return
}
resp, err := client.Do(req)
if err != nil {
fmt.Printf(" outbound request failed: %v\n", err)
return
}
defer resp.Body.Close()
body, err := io.ReadAll(resp.Body)
if err != nil {
fmt.Printf(" response read failed: %v\n", err)
return
}
fmt.Printf(" final HTTP status=%d body=%q attempts=%d\n", resp.StatusCode, string(body), attempts.Load())
Other options include:
- the full retry builder surface such as
HandleIf,HandleErrors,HandleResult,WithMaxAttempts,WithMaxRetries,WithDelayFunc,OnRetry,OnRetriesExceeded,AbortOnErrors, andReturnLastFailure. failsafehttp.NewRequestwhen you want to wrap a single request and client instead of installing a transport-wide round tripper.failsafehttp.DelayFuncwhen you want another delay-capable policy, such as a circuit breaker, to honorRetry-Afterheaders.failsafehttp.NewRoundTripperWithLevelandfailsafehttp.NewHandlerWithLevelwhen you need to propagate priorities through HTTP.
On the server side, you can use failsafehttp.NewHandler to wrap an existing http.Handler with policies that protect your endpoints. This is useful for applying timeouts, bulkheads, or adaptive limiters to incoming requests.
The following example wraps a slow handler with a timeout policy, uses httptest to exercise it locally, and prints the final status and body.
failsafehttp.NewHandlerwraps an existinghttp.Handlerwith one or more policies.timeout.NewBuilder[*http.Response]creates a server-side timeout for the whole request handling execution.- The handler listens on
r.Context().Done()so canceled work stops early instead of continuing in the background.
timeoutPolicy := timeout.NewBuilder[*http.Response](150 * time.Millisecond).Build()
handler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
select {
case <-time.After(250 * time.Millisecond):
w.WriteHeader(http.StatusOK)
_, _ = w.Write([]byte("inventory refreshed"))
case <-r.Context().Done():
fmt.Printf("handler canceled: %v\n", r.Context().Err())
}
})
protected := failsafehttp.NewHandler(handler, timeoutPolicy)
server := httptest.NewServer(protected)
defer server.Close()
resp, err := http.Get(server.URL)
if err != nil {
fmt.Printf("request failed: %v\n", err)
return
}
defer resp.Body.Close()
body, err := io.ReadAll(resp.Body)
if err != nil {
fmt.Printf("read failed: %v\n", err)
return
}
fmt.Printf("status=%d body=%q\n", resp.StatusCode, strings.TrimSpace(string(body)))
Other options include:
failsafehttp.NewHandlerWithExecutorwhen you want to build the executor once and reuse it across handlers.failsafehttp.NewHandlerWithLevelwhen you want to extract request priority or adaptive-limiter level information from headers.
The second argument to NewHandler is variadic, so you can attach multiple policies to the same handler. For example, you could add a bulkhead to limit concurrent requests and a fallback to return a default response when the server is overloaded.
Wrapping Up ¶
Resilience patterns are essential for building robust systems. failsafe-go provides a rich set of policies that you can compose to handle retries, timeouts, load shedding, and more. The key is to understand the trade-offs of each pattern and design your policies around the specific failure modes of your dependencies.