Skip to main content
Redhat Developers  Logo
  • Products

    Featured

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat OpenShift AI
      Red Hat OpenShift AI
    • Red Hat Enterprise Linux AI
      Linux icon inside of a brain
    • Image mode for Red Hat Enterprise Linux
      RHEL image mode
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • Red Hat Developer Hub
      Developer Hub
    • View All Red Hat Products
    • Linux

      • Red Hat Enterprise Linux
      • Image mode for Red Hat Enterprise Linux
      • Red Hat Universal Base Images (UBI)
    • Java runtimes & frameworks

      • JBoss Enterprise Application Platform
      • Red Hat build of OpenJDK
    • Kubernetes

      • Red Hat OpenShift
      • Microsoft Azure Red Hat OpenShift
      • Red Hat OpenShift Virtualization
      • Red Hat OpenShift Lightspeed
    • Integration & App Connectivity

      • Red Hat Build of Apache Camel
      • Red Hat Service Interconnect
      • Red Hat Connectivity Link
    • AI/ML

      • Red Hat OpenShift AI
      • Red Hat Enterprise Linux AI
    • Automation

      • Red Hat Ansible Automation Platform
      • Red Hat Ansible Lightspeed
    • Developer tools

      • Red Hat Trusted Software Supply Chain
      • Podman Desktop
      • Red Hat OpenShift Dev Spaces
    • Developer Sandbox

      Developer Sandbox
      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Secure Development & Architectures

      • Security
      • Secure coding
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
      • View All Technologies
    • Start exploring in the Developer Sandbox for free

      sandbox graphic
      Try Red Hat's products and technologies without setup or configuration.
    • Try at no cost
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • Java
      Java icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • API Catalog
    • Product Documentation
    • Legacy Documentation
    • Red Hat Learning

      Learning image
      Boost your technical skills to expert-level with the help of interactive lessons offered by various Red Hat Learning programs.
    • Explore Red Hat Learning
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Troubleshooting "no healthy upstream" errors in Istio service mesh

December 23, 2022
Stelios Kousouris
Related topics:
KubernetesMicroservicesService Mesh
Related products:
Red Hat OpenShift Service Mesh

Share:

    Istio service mesh offers a multitude of solutions at network level 7 (L7) to define traffic routing, security, and application monitoring in a cloud environment. However, given the complexity of cloud-based networks, the host of devices involved, and the difficulty of visualizing effective changes made by Istio, it's hard to debug the unpopular "no healthy upstream" error messages that often show up in Envoy logs.

    This article attempts some pain relief in the form of quick guidance on how to respond to emergency calls demanding a resolution to "no healthy upstream" error messages and related errors such as "Applications in the Mesh are not available" or "Istio is broken."

    In my experience, 90% of these issues are caused by configuration problems in either the network or Istio. This article shows some troubleshooting tools you can use to identify such problems quickly, in the context of two recent cases that a Red Hat customer escalated to us.

    It's important to understand a few aspects of this customer's architecture. The customer is running services in separate Red Hat OpenShift clusters, some of which are in the customer's own on-premises infrastructure, while others span several countries in the EU region. Each OpenShift cluster has its own instance of a Red Hat OpenShift Service Mesh, Red Hat's productized Istio service.

    Kubernetes services make both intramesh and intermesh requests. But a service in this customer's configuration always makes a local call. Integration and routing between services in the different clusters are performed by the mesh via a set of VirtualService, DestinationRule, and ServiceEntry resources that redirect the local call to a remote service.

    A duplicate service

    In our first real-life example, the customer complained that the service mesh somehow was causing cluster-to-cluster communications to fail, and reported the "no healthy upstream" message.

    To identify a problem related to Istio configuration, I always use Istio's Kiali console to visualize the network state and pinpoint where issues are occurring. Kiali allows you to "play back" network behavior, a nice feature that is very helpful if you're dealing with a problem that is not occurring right now. Whether or not I discover the problematic service, I turn next to checking the logs of the Envoy proxy via either Kiali or OpenShift (using an oc logs <pod_name> -c istio-proxy commmand). The aim in both cases is to find the service for which the "no healthy upstream" error appears.

    In this case, Kiali showed that 95% of the traffic to the destination service destination.mynamespace.svc.cluster.local was failing. My next resource was the istioctl command, which can provide a quick view of the state of the Envoy proxy and whether its configuration was updated correctly by Istio:

    $ istioctl proxy-status
    NAME CDS LDS EDS RDS PILOT VERSION
    ...
    service-source-v1-74f955bd84-9lmnf.mynamespace SYNCED SYNCED SYNCED SYNCED istiod-86798869b8-bqw7c 1.5.0
    ...

    Getting confirmation from the output that the mesh managed to keep all relevant service Envoy proxies up to date, I then checked the cluster names configured on the Envoy proxy of the client service pod. I focused only on the clusters related to the outbound service host for which logs showed the "no healthy upstream" message:

    $ istioctl proxy-config cluster -i istio-system service-source-v1-74f955bd84- 9lmnf.mynamespace --fqdn service-destination.mynamespace.svc.cluster.local -o json | jq -r .[].name
    outbound|80||service-destination.mynamespace.svc.cluster.local
    inbound|80|9180-tcp|service-destination.mynamespace.svc.cluster.local   outbound|80|v1|service-destination.mynamespace.svc.cluster.local
    outbound|80|v2|service-destination.mynamespace.svc.cluster.local

    In the output, I noticed that the Istio configuration had defined two services (v1 and v2) for the cluster in question: outbound|80|v2|service-destination.mynamespace.svc.cluster.local. I then checked for the available endpoints for the v2 service:

    $ istioctl proxy-config endpoints service-source-v1-74f955bd84-9lmnf.mynamespace --cluster "outbound|80|v2|service-destination.mynamespace.svc.cluster.local"   ENDPOINT STATUS OUTLIER CHECK CLUSTER
    172.17.0.28:9180 HEALTHY OK outbound|80|v2|service destination.mynamespace.svc.cluster.local
    172.17.0.29:9180 HEALTHY OK outbound|80|v2|service destination.mynamespace.svc.cluster.local

    Then I proceeded to check the endpoints for v1. However, for the v1 service for outbound|80|v2|service destination.mynamespace, the mesh has no endpoints, and therefore no pods:

    $ istioctl proxy-config endpoints teachstore-course-v1-74f965bd84-8lmnf.development 2
    cluster "outbound|80|v`|service-destination.mynamespace.svc.cluster.local"   ENDPOINT STATUS OUTLIER CHECK CLUSTER

    This misconfiguration caused the "no healthy upstream" errors.

    Checking the VirtualService for the destination, I noticed that 5% of the traffic is routed to v2, which agrees with what I saw also in Kiali, while 95% is routed to v1, which also explains why the customer saw 95% failures with the "no healthy upstream" message.

    All the customer needed to do to fix the problem was to deploy service v1 or update the VirtualService to distribute all requests to v2.

    Duplicate Envoy clusters

    In a follow-up escalation, the "no healthy upstream" issue came up again. We followed the same troubleshooting approach as in the previous example, but in this case there was no VirtualService.

    We saw multiple ServiceEntry definitions for multiple country destinations. We found it puzzling that all country destinations, apart from the one reported, had requests directed correctly. The following check verified that all clusters were reported as healthy except service-destination.remote-namespace.ocp4.customdomain.com:

    $ oc exec istio-egressgateway-6567f7d756-4gvh8 -- curl localhost:15000/clusters |egrep 'health|remote-service-destination.remote-namespace.ocp4.customdomain.com'

    In this case I looked for a log entry like:

    outbound|443||remote-service-destination.remote namespace.ocp4.customdomain.com::10.128.2.21:443::health_flags::/failed_active_hc

    Having verified the cluster as unhealthy, I turned to Istiod to ensure that no errors were being reported against this cluster. The Istiod pod reported:

    2022-05-13T12:27:51.262316Z info ads Push finished: 5.709679819s {   "ProxyStatus": {
    "pilot_duplicate_envoy_clusters": {
    "outbound|443||remote-service-destination.remote-namespace.ocp4.customdomain.com": {
    "proxy": "e2e-871-remote-namespace-c58f7f7f6-vljr6.e2e-871-remote namespace",
    "message": "Duplicate cluster outbound|53||remote-service destination.remote-namespace.ocp4.customdomain.com found while pushing CDS"   }
    },

    This output indicates that Istiod was trying to apply a cluster configuration for which there was a duplicate. This information prompted me to check the applied ServiceEntry resources, which quickly revealed that there had been duplicate definitions for remote-service-destination.remote-namespace.ocp4.customdomain.com.

    Istio administrative tools reveal the source of errors

    A "no healthy upstream" error can be caused by Istio, a misconfigured network device, or actual network outages. Thus, it is difficult for the mesh operations team to pinpoint the cause or even predict its occurrence, because DevOps teams may unwittingly apply an incorrect configuration. This article provided guidance on how to establish the cause of such errors, determining whether they are or are not due to Istio configuration.

    Even greater benefits can be realized when, as in the case of the customer in this example, the operations team applies observability monitoring and alerts against such occurrences, so that the team can be aware in advance of the issue and inform the relevant teams before an escalation occurs.

    Last updated: September 20, 2023

    Related Posts

    • Manage your APIs deployed with Istio service mesh

    • Integrating Kubeflow with Red Hat OpenShift Service Mesh

    • Custom WebAssembly extensions in OpenShift Service Mesh

    Recent Posts

    • How Kafka improves agentic AI

    • How to use service mesh to improve AI model security

    • How to run AI models in cloud development environments

    • How Trilio secures OpenShift virtual machines and containers

    • How to implement observability with Node.js and Llama Stack

    What’s up next?

    istio for microservices ebook

    In Introducing Istio Service Mesh for Microservices, you'll learn several key microservices capabilities that Istio provides on Kubernetes and Red Hat OpenShift.

    Get the e-book
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Products

    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue