Implement & monitor circuit breakers in OpenShift Service Mesh

In a distributed microservices architecture, the failure of one service can cascade, leading to system-wide outages. To build resilient and fault-tolerant applications, we must isolate failures and prevent them from spreading. The circuit breaker is a critical design pattern that addresses this challenge by temporarily blocking traffic to a service that it detects as unhealthy, giving it time to recover.

This guide shows you how to configure, trigger, and monitor a circuit breaker using Red Hat OpenShift Service Mesh 3.0. By the end, you'll have a hands-on understanding of how to use Istio's outlier detection to automatically improve your application's stability on OpenShift.

Prerequisites

Before you begin, ensure your environment is fully prepared. This guide assumes you have the following setup:

An OpenShift Container Platform cluster: You will need access to a cluster running version 4.16 or newer with administrator privileges.
Command-line tools: The OpenShift CLI (oc) and Kubernetes CLI (kubectl) must be installed and configured to connect to your cluster.
Red Hat OpenShift Service Mesh and the Bookinfo sample application: You need a project (for example, bookinfo) where the OpenShift Service Mesh control plane is installed and the Bookinfo sample application is deployed. The application's product page must be accessible via the Istio ingress gateway.
If you need to set this up, follow the official Red Hat documentation to install OpenShift Service Mesh and deploy the Bookinfo application. (Complete sections 2.1 through 2.5.3 of the tutorial.)
Kiali for monitoring: This tutorial uses Kiali to visualize the circuit breaker's status. Ensure you have configured access to the Kiali console.
To set this up, follow the official documentation to expose and access the Kiali console. (Complete sections 4.1.1 through 4.1.3).

Step-by-step instructions

Follow these steps to deploy the application, configure the circuit breaker, and monitor the results.

Step 1: Preparation

First, verify that the Bookinfo application is running correctly and that all pods are in a Running state.

oc get pods -n bookinfo

You should see output similar to this, with pods for productpage, details, ratings, and three versions of reviews along with istio-igressgateway.

NAME                                   READY   STATUS    RESTARTS      AGE
details-v1-7c799b8b4b-7npbl            2/2     Running   0             9d
istio-ingressgateway-7bb7fb8fd-8sbxr   1/1     Running   0             9d
productpage-v1-f8479c768-s72st         2/2     Running   0             9d
ratings-v1-7fccfc8b8b-dr6xp            2/2     Running   4 (9d ago)    18d
reviews-v1-8cc49957f-gswj6             2/2     Running   0             9d
reviews-v2-5bf9856f5c-bcswn            2/2     Running   0             9d
reviews-v3-6d8f75d44c-fqmzf            2/2     Running   3 (17h ago)   17h

Generate load and inspect the Kiali graph for traffic (see Figure 1).

while true; do
  echo "$(date) - Status: $(curl -s -o /dev/null -w '%{http_code}' http://istio-ingressgateway-bookinfo.<yourdomainName>/productpage)"
  sleep 1
done

Figure 1: Kiali graph showing the traffic flow for bookinfo.

Step 2: Configure the circuit breaker

Circuit breaking is configured in Istio using a DestinationRule. We will apply a policy that monitors the reviews service. Specifically, we'll target the v3 subset. If an instance in this subset returns even a single 5xx error, the Envoy proxy will "eject" it from the load-balancing pool for 300 seconds.

Apply the following DestinationRule manifest:

apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  creationTimestamp: "2025-07-26T09:28:27Z"
  generation: 3
  name: reviews
  namespace: bookinfo
  resourceVersion: "38980107"
  uid: 27bd5ee9-ffaa-46d2-a75b-dea6db482e4c
spec:
  host: reviews
  subsets:
  - labels:
      version: v1
    name: v1
    trafficPolicy:
      loadBalancer:
        simple: ROUND_ROBIN
  - labels:
      version: v2
    name: v2
    trafficPolicy:
      loadBalancer:
        simple: RANDOM
  - labels:
      version: v3
    name: v3
    trafficPolicy:
      connectionPool:
        http:
          http1MaxPendingRequests: 1
          maxRequestsPerConnection: 1
        tcp:
          maxConnections: 1
      outlierDetection:
        baseEjectionTime: 300s
        consecutive5xxErrors: 1
        interval: 1s
        maxEjectionPercent: 100

consecutive5xxErrors: 1: Trips the circuit after one consecutive 5xx error.
interval: 1s: The time interval for ejection analysis.
baseEjectionTime: 300s: The instance remains ejected for 300 seconds.
maxEjectionPercent: 100: Allows up to 100% of the instances to be ejected.

Next, we will update our traffic policy to route requests only to the v1 and v3 versions of the reviews service. Apply the following VirtualService manifest to implement this rule:

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  creationTimestamp: "2025-07-17T02:07:48Z"
  generation: 8
  name: reviews
  namespace: bookinfo
  resourceVersion: "38981001"
  uid: abd5dfe2-2526-4541-a079-d9194da4f4fb
spec:
  hosts:
  - reviews
  http:
  - route:
    - destination:
        host: reviews
        subset: v1
      weight: 50
    - destination:
        host: reviews
        subset: v2
      weight: 0
    - destination:
        host: reviews
        subset: v3
      weight: 50

Step 3: Enable detailed circuit breaker metrics

By default, Red Hat OpenShift Service Mesh collects a minimal set of statistics from its Envoy proxies to reduce resource consumption and improve performance. The specific metric we need to monitor our circuit breaker, envoy_cluster_outlier_detection_ejections_active, is not included in this default set. Refer to the documentation for more details.

To enable it, we must add an annotation to our application's pods. This annotation instructs the Envoy sidecar to include additional metrics, specifically those related to outbound cluster statistics.

The following is an abbreviated example of the reviews-v3 deployment manifest, modified to include the required annotation. Note the new annotations section under spec.template.metadata.

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: reviews
    version: v3
  name: reviews-v3
  namespace: bookinfo
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: reviews
      version: v3
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
    # --- ANNOTATION ADDED HERE ---
      annotations:
        proxy.istio.io/config: |
          proxyStatsMatcher:
            inclusionPrefixes:
            - "cluster.outbound"
            - "cluster_manager"
            - "listener_manager"
            - "server"
            - "cluster.xds-grpc"
     # ---------------------------
      creationTimestamp: null
      labels:
        app: reviews
        version: v3
    spec:
     ........

Important: For this tutorial, the annotation must be applied to all deployments involved in the requests (productpage, reviews-v1, reviews-v2, reviews-v3).

Step 4: Generate traffic

With our configurations in place, we need to generate a consistent stream of user traffic. Now, run the following command in a new terminal window. It will continuously send requests to the /productpage every second, printing the HTTP status code of the response.

while true; do
  echo "$(date) - Status: $(curl -s -o /dev/null -w '%{http_code}' http://istio-ingressgateway-bookinfo.<yourdomainName>/productpage)"
  sleep 1
done

In the Kiali service graph (Figure 2), observe the traffic flow. You will see that requests to the reviews service are split between the v1 and v3 versions, and that only the reviews:v3 workload calls the ratings service.

Figure 2: Kiali graph showing the traffic flow for bookinfo using only review v1 and v3 versions.

Step 5: Simulate a service failure

Now, we will deliberately cause the reviews:v3 service to fail. This will generate the 5xx errors needed to trip the circuit breaker we configured earlier. A direct way to simulate a critical failure is to terminate the main process within the container, causing the pod to crash and become temporarily unavailable.

Execute the kill 1 command inside the reviews container of that specific pod. This command terminates the main application process, causing the container to exit with an error. See Figure 3.

oc exec -n bookinfo reviews-v3-6d8f75d44c-fqmzf -c reviews -- kill 1

OCP Pod terminal to execute kill command. — Figure 3: OpenShift Container Platform Pod terminal to execute kill command.

Immediately after running this command, look at your traffic generation terminal from Step 4. You will see the output change from 200 to 503 (Service Unavailable) as the Envoy proxy attempts to route requests to the now-unresponsive pod.

OpenShift will automatically restart the crashed pod, but during this failure window, our circuit breaker will detect the 5xx errors and trip.

Step 6: Monitor the circuit breaker in the console

While the traffic generation script is running, let's observe the circuit breaker in action.

Navigate to the Observe → Metrics section in your OpenShift web console.
In the Expression field of the PromQL UI, enter the following query. This query checks for the number of hosts that are currently ejected for the reviews-v3 cluster.

envoy_cluster_outlier_detection_ejections_active{namespace='bookinfo'} >0

Select Run queries.

You should see a graph where the value is 1, as shown in Figure 4. This indicates that the single instance of reviews:v3 has been ejected. The value will periodically drop to 0 for a brief moment before returning to 1 as the 300-second ejection period expires and the circuit is immediately re-tripped by the next failed request.

Figure 4: Metrics showing the outliner active status.

In the Kiali service graph, you can now see the circuit breaker in action. Traffic to the reviews service is being routed exclusively to the healthy v1 version. The path to reviews:v3 shows an open circuit breaker, and no traffic is flowing to it or its downstream ratings service. See Figure 5.

Figure 5: Kiali graph showing the circuit breaker is open for the reviews:v3.

You will now observe that requests routed to the reviews:v1 service succeed without issue (Figure 6).

Figure 6: Bookinfo page showing success for v1 review service.

Any traffic intended for reviews:v3 will result in an error, as shown in Figure 7. This happens because the circuit breaker is active for its 300-second ejection period, blocking calls to the v3 pod even though it is running.

Figure 7: Bookinfo page showing error for v3 review service.

Understanding key outlier detection metrics

While ejections_active is perfect for seeing the real-time state, Envoy provides a rich set of metrics for a deeper understanding of your circuit breaker's behavior. According to the official Envoy proxy documentation, these statistics give you a more complete picture for monitoring and tuning.

Wrap up

Congratulations! You have successfully configured a circuit breaker for a microservice, simulated a failure using fault injection, and monitored the circuit's state in real-time using metrics in the OpenShift console. This powerful resilience pattern is a fundamental tool for building robust, self-healing applications with Red Hat OpenShift Service Mesh.

Red Hat Developer Sandbox

Programming languages & frameworks

System design & architecture

Developer experience

Automated data processing

Platform engineering

Secure development & architectures

E-books

Cheat sheets

Documentation

How to implement and monitor circuit breakers in OpenShift Service Mesh 3

Prerequisites

Step-by-step instructions

Step 1: Preparation

Step 2: Configure the circuit breaker

Step 3: Enable detailed circuit breaker metrics

Step 4: Generate traffic

Step 5: Simulate a service failure

Step 6: Monitor the circuit breaker in the console

Understanding key outlier detection metrics

Wrap up

Upgrade air-gapped OpenShift with self-signed certificates

Tame Ray workloads on OpenShift AI with KubeRay and Kueue

Run Mistral Large 3 & Ministral 3 on vLLM with Red Hat AI on Day 0: A step-by-step guide

Run cost-effective AI workloads on OpenShift with AWS Neuron Operator

Automate unique compliance checks with OpenShift and CustomRule

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue