service mesh

Istio service mesh offers a multitude of solutions at network level 7 (L7) to define traffic routing, security, and application monitoring in a cloud environment. However, given the complexity of cloud-based networks, the host of devices involved, and the difficulty of visualizing effective changes made by Istio, it's hard to debug the unpopular "no healthy upstream" error messages that often show up in Envoy logs.

This article attempts some pain relief in the form of quick guidance on how to respond to emergency calls demanding a resolution to "no healthy upstream" error messages and related errors such as "Applications in the Mesh are not available" or "Istio is broken."

In my experience, 90% of these issues are caused by configuration problems in either the network or Istio. This article shows some troubleshooting tools you can use to identify such problems quickly, in the context of two recent cases that a Red Hat customer escalated to us.

It's important to understand a few aspects of this customer's architecture. The customer is running services in separate Red Hat OpenShift clusters, some of which are in the customer's own on-premises infrastructure, while others span several countries in the EU region. Each OpenShift cluster has its own instance of a Red Hat OpenShift Service Mesh, Red Hat's productized Istio service.

Kubernetes services make both intramesh and intermesh requests. But a service in this customer's configuration always makes a local call. Integration and routing between services in the different clusters are performed by the mesh via a set of VirtualService, DestinationRule, and ServiceEntry resources that redirect the local call to a remote service.

A duplicate service

In our first real-life example, the customer complained that the service mesh somehow was causing cluster-to-cluster communications to fail, and reported the "no healthy upstream" message.

To identify a problem related to Istio configuration, I always use Istio's Kiali console to visualize the network state and pinpoint where issues are occurring. Kiali allows you to "play back" network behavior, a nice feature that is very helpful if you're dealing with a problem that is not occurring right now. Whether or not I discover the problematic service, I turn next to checking the logs of the Envoy proxy via either Kiali or OpenShift (using an oc logs <pod_name> -c istio-proxy commmand). The aim in both cases is to find the service for which the "no healthy upstream" error appears.

In this case, Kiali showed that 95% of the traffic to the destination service destination.mynamespace.svc.cluster.local was failing. My next resource was the istioctl command, which can provide a quick view of the state of the Envoy proxy and whether its configuration was updated correctly by Istio:

$ istioctl proxy-status
NAME CDS LDS EDS RDS PILOT VERSION
...
service-source-v1-74f955bd84-9lmnf.mynamespace SYNCED SYNCED SYNCED SYNCED istiod-86798869b8-bqw7c 1.5.0
...

Getting confirmation from the output that the mesh managed to keep all relevant service Envoy proxies up to date, I then checked the cluster names configured on the Envoy proxy of the client service pod. I focused only on the clusters related to the outbound service host for which logs showed the "no healthy upstream" message:

$ istioctl proxy-config cluster -i istio-system service-source-v1-74f955bd84- 9lmnf.mynamespace --fqdn service-destination.mynamespace.svc.cluster.local -o json | jq -r .[].name
outbound|80||service-destination.mynamespace.svc.cluster.local
inbound|80|9180-tcp|service-destination.mynamespace.svc.cluster.local   outbound|80|v1|service-destination.mynamespace.svc.cluster.local
outbound|80|v2|service-destination.mynamespace.svc.cluster.local

In the output, I noticed that the Istio configuration had defined two services (v1 and v2) for the cluster in question: outbound|80|v2|service-destination.mynamespace.svc.cluster.local. I then checked for the available endpoints for the v2 service:

$ istioctl proxy-config endpoints service-source-v1-74f955bd84-9lmnf.mynamespace --cluster "outbound|80|v2|service-destination.mynamespace.svc.cluster.local"   ENDPOINT STATUS OUTLIER CHECK CLUSTER
172.17.0.28:9180 HEALTHY OK outbound|80|v2|service destination.mynamespace.svc.cluster.local
172.17.0.29:9180 HEALTHY OK outbound|80|v2|service destination.mynamespace.svc.cluster.local

Then I proceeded to check the endpoints for v1. However, for the v1 service for outbound|80|v2|service destination.mynamespace, the mesh has no endpoints, and therefore no pods:

$ istioctl proxy-config endpoints teachstore-course-v1-74f965bd84-8lmnf.development 2
cluster "outbound|80|v`|service-destination.mynamespace.svc.cluster.local"   ENDPOINT STATUS OUTLIER CHECK CLUSTER

This misconfiguration caused the "no healthy upstream" errors.

Checking the VirtualService for the destination, I noticed that 5% of the traffic is routed to v2, which agrees with what I saw also in Kiali, while 95% is routed to v1, which also explains why the customer saw 95% failures with the "no healthy upstream" message.

All the customer needed to do to fix the problem was to deploy service v1 or update the VirtualService to distribute all requests to v2.

Duplicate Envoy clusters

In a follow-up escalation, the "no healthy upstream" issue came up again. We followed the same troubleshooting approach as in the previous example, but in this case there was no VirtualService.

We saw multiple ServiceEntry definitions for multiple country destinations. We found it puzzling that all country destinations, apart from the one reported, had requests directed correctly. The following check verified that all clusters were reported as healthy except service-destination.remote-namespace.ocp4.customdomain.com:

$ oc exec istio-egressgateway-6567f7d756-4gvh8 -- curl localhost:15000/clusters |egrep 'health|remote-service-destination.remote-namespace.ocp4.customdomain.com'

In this case I looked for a log entry like:

outbound|443||remote-service-destination.remote namespace.ocp4.customdomain.com::10.128.2.21:443::health_flags::/failed_active_hc

Having verified the cluster as unhealthy, I turned to Istiod to ensure that no errors were being reported against this cluster. The Istiod pod reported:

2022-05-13T12:27:51.262316Z info ads Push finished: 5.709679819s {   "ProxyStatus": {
"pilot_duplicate_envoy_clusters": {
"outbound|443||remote-service-destination.remote-namespace.ocp4.customdomain.com": {
"proxy": "e2e-871-remote-namespace-c58f7f7f6-vljr6.e2e-871-remote namespace",
"message": "Duplicate cluster outbound|53||remote-service destination.remote-namespace.ocp4.customdomain.com found while pushing CDS"   }
},

This output indicates that Istiod was trying to apply a cluster configuration for which there was a duplicate. This information prompted me to check the applied ServiceEntry resources, which quickly revealed that there had been duplicate definitions for remote-service-destination.remote-namespace.ocp4.customdomain.com.

Istio administrative tools reveal the source of errors

A "no healthy upstream" error can be caused by Istio, a misconfigured network device, or actual network outages. Thus, it is difficult for the mesh operations team to pinpoint the cause or even predict its occurrence, because DevOps teams may unwittingly apply an incorrect configuration. This article provided guidance on how to establish the cause of such errors, determining whether they are or are not due to Istio configuration.

Even greater benefits can be realized when, as in the case of the customer in this example, the operations team applies observability monitoring and alerts against such occurrences, so that the team can be aware in advance of the issue and inform the relevant teams before an escalation occurs.

Last updated: September 20, 2023