Skip to main content
Redhat Developers  Logo
  • AI

    Get started with AI

    • Red Hat AI
      Accelerate the development and deployment of enterprise AI solutions.
    • AI learning hub
      Explore learning materials and tools, organized by task.
    • AI interactive demos
      Click through scenarios with Red Hat AI, including training LLMs and more.
    • AI/ML learning paths
      Expand your OpenShift AI knowledge using these learning resources.
    • AI quickstarts
      Focused AI use cases designed for fast deployment on Red Hat AI platforms.
    • No-cost AI training
      Foundational Red Hat AI training.

    Featured resources

    • OpenShift AI learning
    • Open source AI for developers
    • AI product application development
    • Open source-powered AI/ML for hybrid cloud
    • AI and Node.js cheat sheet

    Red Hat AI Factory with NVIDIA

    • Red Hat AI Factory with NVIDIA is a co-engineered, enterprise-grade AI solution for building, deploying, and managing AI at scale across hybrid cloud environments.
    • Explore the solution
  • Learn

    Self-guided

    • Documentation
      Find answers, get step-by-step guidance, and learn how to use Red Hat products.
    • Learning paths
      Explore curated walkthroughs for common development tasks.
    • Guided learning
      Receive custom learning paths powered by our AI assistant.
    • See all learning

    Hands-on

    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.
    • Interactive labs
      Learn by doing in these hands-on, browser-based experiences.
    • Interactive demos
      Click through product features in these guided tours.

    Browse by topic

    • AI/ML
    • Automation
    • Java
    • Kubernetes
    • Linux
    • See all topics

    Training & certifications

    • Courses and exams
    • Certifications
    • Skills assessments
    • Red Hat Academy
    • Learning subscription
    • Explore training
  • Build

    Get started

    • Red Hat build of Podman Desktop
      A downloadable, local development hub to experiment with our products and builds.
    • Developer Sandbox
      Spin up Red Hat's products and technologies without setup or configuration.

    Download products

    • Access product downloads to start building and testing right away.
    • Red Hat Enterprise Linux
    • Red Hat AI
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat Developer Toolset

    References

    • E-books
    • Documentation
    • Cheat sheets
    • Architecture center
  • Community

    Get involved

    • Events
    • Live AI events
    • Red Hat Summit
    • Red Hat Accelerators
    • Community discussions

    Follow along

    • Articles & blogs
    • Developer newsletter
    • Videos
    • Github

    Get help

    • Customer service
    • Customer support
    • Regional contacts
    • Find a partner

    Join the Red Hat Developer program

    • Download Red Hat products and project builds, access support documentation, learning content, and more.
    • Explore the benefits

Build a dynamic E2E test quarantine system with Prometheus and Grafana

June 29, 2026
Denis Moskalenko
Related topics:
GoKubernetesObservability
Related products:
Red Hat OpenShift

    If you run end-to-end (E2E) tests on a Kubernetes operator, you've seen the pattern: a test that passes 80% of the time still fails often enough to block continuous integration (CI), waste developer hours, and train your team to reflexively /retest. Without historical data, you can't distinguish a flaky test from a regression. Without automation, the only remedy is a human noticing and filing a ticket.

    This guide shows you how to build a complete quarantine system backed by a Prometheus-compatible time-series database and Grafana, running on a long-lived cluster that provides continuous observability into your test suite's health.

    What you'll build

    A Grafana dashboard showing per-test health with automated quarantine decisions, Jira ticket creation, and a self-healing feedback loop, all powered by industry-standard Prometheus metrics.

    Prerequisites:

    • A long-lived OpenShift/Kubernetes cluster (or any cluster that stays up)
    • Periodic E2E test runs producing JUnit XML (Prow periodic jobs or scheduled GitHub Actions)
    • helm and kubectl or oc access to the cluster
    • A Jira project for tracking quarantined tests (optional but recommended)

    Step 1: Deploy Prometheus

    Deploy a dedicated Prometheus instance for test analytics. On OpenShift you might already have a cluster monitoring stack, but a separate instance keeps test data isolated and gives you control over retention.

    bash
    helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
    helm repo update
    helm install prometheus prometheus-community/prometheus \
      --namespace e2e-analytics --create-namespace \
      --set server.retention=90d \
      --set server.persistentVolume.size=20Gi \
      --set server.resources.requests.memory=512Mi \
      --set server.resources.requests.cpu=250m \
      --set alertmanager.enabled=false \
      --set prometheus-node-exporter.enabled=false \
      --set kube-state-metrics.enabled=false \
      --set prometheus-pushgateway.enabled=false \
      --set 'server.extraFlags[0]=web.enable-remote-write-receiver' \
      --set 'serverFiles.prometheus\.yml.storage.tsdb.out_of_order_time_window=720h' \
      --set server.securityContext.runAsNonRoot=true \
      --set server.securityContext.runAsUser=null \
      --set server.securityContext.fsGroup=null \
      --set server.containerSecurityContext.allowPrivilegeEscalation=false \
      --set "server.containerSecurityContext.capabilities.drop={ALL}" \
      --set server.containerSecurityContext.runAsNonRoot=true \
      --set server.containerSecurityContext.seccompProfile.type=RuntimeDefault

    The --web.enable-remote-write-receiver flag enables the remote-write endpoint so our ingester can push data in. The out_of_order_time_window storage config allows the ingester to backfill historical data (required when first loading past results).

    OpenShift SCC note

    The default Prometheus Helm chart sets runAsUser: 65534 and fsGroup: 65534, which are rejected by OpenShift's restricted-v2 SCC. These securityContext overrides clear these defaults so the container runs with the UID assigned by OpenShift.

    Verify it's running:

    kubectl -n e2e-analytics get pods -l app.kubernetes.io/name=prometheus
    kubectl -n e2e-analytics port-forward svc/prometheus-server 9090:80 &

    Test the query endpoint:

    curl -s 'http://localhost:9090/api/v1/query?query=up'

    The primary endpoints include:

    • Remote-write ingest: http://prometheus-server.e2e-analytics.svc:80/api/v1/write
    • PromQL query: http://prometheus-server.e2e-analytics.svc:80/api/v1/query
    • Range query: http://prometheus-server.e2e-analytics.svc:80/api/v1/query_range

    Alternative (OpenShift)

    If you're on OpenShift 4.x, you can use the built-in user workload monitoring instead. Enable it in the cluster-monitoring-config ConfigMap, and your metrics are automatically available via the Thanos Querier at https://thanos-querier.openshift-monitoring.svc:9091. This gives you Prometheus without deploying anything extra.

    Step 2: Define the metric schema

    Instead of SQL tables, we define Prometheus metrics with labels. This is the interface contract—any component that writes these metrics (GCS scraper, push gateway, future sources) is compatible.

    Metrics:

    Metric nameTypeLabelsDescription
    e2e_test_resultGauge (0/1)test, suite, job, build_id, commit_sha, branch1 = passed, 0 = failed
    e2e_test_duration_secondsGaugetest, suite, job, build_id, branchTest execution duration
    e2e_test_error_infoGauge (1)test, suite, error_category, error_messageError classification (info metric)

    Label schema:

    e2e_test_result{
      test="TestOperator/components/group_1/dashboard/validate_config",
      suite="e2e-operator",
      job="periodic-ci-operator-main-e2e",
      build_id="1234567890",
      commit_sha="abc123f",
      branch="main"
    } 0  # 0 = failed, 1 = passed

    Each test execution produces one e2e_test_result sample per test case. The timestamp is the run time. This gives us a time-series of pass/fail per test that PromQL can aggregate over any window.

    Step 3: Build the JUnit ingester

    Create a Go binary that parses JUnit XML, converts results to Prometheus metrics, and pushes them via remote-write to Prometheus.

    go
    package main
    import (
      "bytes"
      "encoding/xml"
      "fmt"
      "net/http"
      "os"
      "time"
      "github.com/golang/snappy"
      "github.com/prometheus/prometheus/prompb"
        )
    // Note: prompb types use gogoproto and have their own Marshal() method.
    // Do NOT use google.golang.org/protobuf/proto it requires ProtoReflect()
        // which gogoproto types don't implement.
    type JUnitTestSuite struct {
      XMLName    xml.Name        `xml:"testsuite"`
      Name       string          `xml:"name,attr"`
      Timestamp  string          `xml:"timestamp,attr"`
      TestCases  []JUnitTestCase `xml:"testcase"`
      Properties []Property      `xml:"properties>property"`
    }
    type JUnitTestCase struct {
      Name    string        `xml:"name,attr"`
      Time    float64       `xml:"time,attr"`
      Failure *JUnitFailure `xml:"failure"`
      Error   *JUnitFailure `xml:"error"`
    }
    type JUnitFailure struct {
      Message string `xml:"message,attr"`
      Body    string `xml:",chardata"`
    }
    type Property struct {
      Name  string `xml:"name,attr"`
      Value string `xml:"value,attr"`
    }
    func junitToTimeSeries(suite JUnitTestSuite, prowJob, buildID string) []prompb.TimeSeries {
      commitSHA := extractProperty(suite.Properties, "commit.sha")
      branch := extractProperty(suite.Properties, "branch")
      if branch == "" {
        branch = "main"
      }
      runTS := parseTimestamp(suite.Timestamp)
      tsMs := runTS.UnixMilli()
      var series []prompb.TimeSeries
      for _, tc := range suite.TestCases {
        passed := tc.Failure == nil && tc.Error == nil
        var resultValue float64
        if passed {
          resultValue = 1
        }
        // e2e_test_result metric
        series = append(series, prompb.TimeSeries{
          Labels: []prompb.Label{
            {Name: "__name__", Value: "e2e_test_result"},
            {Name: "test", Value: tc.Name},
            {Name: "suite", Value: suite.Name},
            {Name: "job", Value: prowJob},
            {Name: "build_id", Value: buildID},
            {Name: "commit_sha", Value: commitSHA},
            {Name: "branch", Value: branch},
          },
          Samples: []prompb.Sample{
            {Value: resultValue, Timestamp: tsMs},
          },
        })
        // e2e_test_duration_seconds metric
        series = append(series, prompb.TimeSeries{
          Labels: []prompb.Label{
            {Name: "__name__", Value: "e2e_test_duration_seconds"},
            {Name: "test", Value: tc.Name},
            {Name: "suite", Value: suite.Name},
            {Name: "job", Value: prowJob},
            {Name: "build_id", Value: buildID},
            {Name: "branch", Value: branch},
          },
          Samples: []prompb.Sample{
            {Value: tc.Time, Timestamp: tsMs},
          },
        })
      }
      return series
    }
    func remoteWrite(endpoint string, series []prompb.TimeSeries) error {
      req := &prompb.WriteRequest{Timeseries: series}
      data, err := req.Marshal()
      if err != nil {
        return fmt.Errorf("marshaling write request: %w", err)
      }
      compressed := snappy.Encode(nil, data)
      httpReq, err := http.NewRequest(http.MethodPost, endpoint, bytes.NewReader(compressed))
      if err != nil {
        return fmt.Errorf("creating request: %w", err)
      }
      httpReq.Header.Set("Content-Type", "application/x-protobuf")
      httpReq.Header.Set("Content-Encoding", "snappy")
      httpReq.Header.Set("X-Prometheus-Remote-Write-Version", "0.1.0")
      resp, err := http.DefaultClient.Do(httpReq)
      if err != nil {
        return fmt.Errorf("sending remote write: %w", err)
      }
      defer resp.Body.Close()
      if resp.StatusCode != http.StatusNoContent && resp.StatusCode != http.StatusOK {
        return fmt.Errorf("remote write returned %d", resp.StatusCode)
      }
      return nil
    }

    Deploy as a CronJob that fetches JUnit artifacts from your Google Cloud Storage (GCS) bucket:

    yaml
    apiVersion: batch/v1
    kind: CronJob
    metadata:
      name: junit-ingester
      namespace: e2e-analytics
    spec:
      schedule: "0 */4 * * *"
      jobTemplate:
        spec:
          template:
            spec:
              containers:
              - name: ingester
                image: quay.io/your-org/junit-ingester:latest
                env:
                - name: REMOTE_WRITE_ENDPOINT
                  value: "http://prometheus-server.e2e-analytics.svc:80/api/v1/write"
                - name: GCS_BUCKET
                  value: "test-platform-results"
                - name: PROW_JOB
                  value: "periodic-ci-operator-main-e2e"
              restartPolicy: OnFailure

    Quick validation

    After the first ingestion, verify data is flowing.

    Query for any test results:

    curl -s 'http://localhost:9090/api/v1/query?query=e2e_test_result' | jq '.data.result | length'

    Check a specific test:

    curl -s 'http://localhost:9090/api/v1/query?query=e2e_test_result{test=~".*dashboard.*"}' | jq .

    Step 4: Set up Grafana

    Deploy Grafana and point it at Prometheus as the data source.

    helm repo add grafana https://grafana.github.io/helm-charts
    helm repo update
    helm install grafana grafana/grafana \
      --namespace e2e-analytics \
      --set persistence.enabled=true \
      --set persistence.size=5Gi \
      --set adminPassword="$(openssl rand -base64 16)" \
      --set "datasources.datasources\\.yaml.apiVersion=1" \
      --set "datasources.datasources\\.yaml.datasources[0].name=E2E Metrics" \
      --set "datasources.datasources\\.yaml.datasources[0].type=prometheus" \
      --set "datasources.datasources\\.yaml.datasources[0].url=http://prometheus-server.e2e-analytics.svc:80" \
      --set "datasources.datasources\\.yaml.datasources[0].access=proxy" \
      --set "datasources.datasources\\.yaml.datasources[0].isDefault=true" \
      --set securityContext.runAsNonRoot=true \
      --set securityContext.runAsUser=null \
      --set securityContext.fsGroup=null \
      --set containerSecurityContext.allowPrivilegeEscalation=false \
      --set "containerSecurityContext.capabilities.drop={ALL}" \
      --set containerSecurityContext.runAsNonRoot=true \
      --set containerSecurityContext.seccompProfile.type=RuntimeDefault \
      --set initChownData.enabled=false

    OpenShift SCC note

    Same as Prometheus, clear default runAsUser/fsGroup and disable the init chown container (which tries to run as root). On non-OpenShift clusters these overrides are harmless.

    Expose Grafana

    On OpenShift:

    oc -n e2e-analytics create route edge grafana --service=grafana --port=3000

    Or port-forward for local access:

    kubectl -n e2e-analytics port-forward svc/grafana 3000:80 &

    Dashboard panels (PromQL)

    Panel 1: Per-test flake rate (30-day rolling window).

    # Since e2e_test_result is 1 (pass) or 0 (fail), sum_over_time counts passes
    sort_desc(
      1 - (
        sum by (test) (sum_over_time(e2e_test_result{branch="main"}[30d]))
        /
        sum by (test) (count_over_time(e2e_test_result{branch="main"}[30d]))
      )
    )

    Select Table as the panel type and configure the following columns: Test Name, Flake Rate, and Total Runs.

    PromQL note

    An earlier version of this query used count_over_time(e2e_test_result{...} == 1 [30d]) to count passes. This is invalid PromQL because the == 1 comparison produces an instant vector, and count_over_time requires a range vector selector. Because e2e_test_result uses 1/0 encoding, sum_over_time directly gives the pass count, making the query both correct and simpler.

    Panel 2: Flake rate time series (per test, daily resolution).

    # Daily flake rate for a specific test (use $test variable)
    1 - (
      sum by (test) (sum_over_time(e2e_test_result{test="$test", branch="main"}[1d]))
      /
      sum by (test) (count_over_time(e2e_test_result{test="$test", branch="main"}[1d]))
    )

    Display as a Time Series panel. Add a threshold line at 0.2 (20%) to show the quarantine boundary.

    Panel 3: Test health heatmap.

    # Pass rate per test per day (for heatmap)
    sum by (test) (sum_over_time(e2e_test_result{branch="main"}[1d]))
    /
    sum by (test) (count_over_time(e2e_test_result{branch="main"}[1d]))

    Panel 4: Regression detection.

    # Tests with 0% pass rate in the last 4 days (potential regression, not flake)
    (
      sum by (test) (sum_over_time(e2e_test_result{branch="main"}[4d]))
      /
      sum by (test) (count_over_time(e2e_test_result{branch="main"}[4d]))
    ) == 0

    Tests matching this pattern that previously had a low flake rate are regressions—the code broke, not the test.

    Panel 5: Test duration trends.

    # Average duration per test over time
    avg by (test) (avg_over_time(e2e_test_duration_seconds{branch="main"}[1d]))

    Alert rules

    Configure Grafana alerting to fire when a test crosses the quarantine threshold:

    # Grafana alert rule (configured via UI or provisioning)
    name: Test Flake Rate Exceeded
    condition: flake_rate > 0.2
    expr: |
      (
        1 - (
          sum by (test) (count_over_time(e2e_test_result{branch="main"} == 1 [30d]))
          /
          sum by (test) (sum_over_time(e2e_test_result{branch="main"}[30d]))
        )
      ) > 0.2
      and
      sum by (test) (count_over_time(e2e_test_result{branch="main"}[30d])) >= 10
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: "Test {{ $labels.test }} flake rate exceeded 20%"

    Step 5: Build the quarantine controller (Go)

    The quarantine controller queries Prometheus via PromQL, identifies flaky tests, excludes regressions, and outputs a quarantine config.

    package main
    import (
      "context"
      "encoding/json"
      "fmt"
      "net/http"
      "net/url"
      "os"
      "time"
    )
    const (
      flakeThreshold         = 0.20
      minRunsForDecision     = 10
      windowDays             = 30
      quarantineDurationDays = 30
    )
    type PromQueryResult struct {
      Status string `json:"status"`
      Data   struct {
        ResultType string `json:"resultType"`
        Result     []struct {
          Metric map[string]string `json:"metric"`
          Value  [2]interface{}    `json:"value"`
        } `json:"result"`
      } `json:"data"`
    }
    type QuarantineEntry struct {
      Name           string  `json:"name"`
      Reason         string  `json:"reason"`
      FlakeRate      float64 `json:"flake_rate"`
      TotalRuns      int     `json:"total_runs"`
      FailedRuns     int     `json:"failed_runs"`
      Jira           string  `json:"jira,omitempty"`
      QuarantinedAt  string  `json:"quarantined_at"`
      ReEnableAfter  string  `json:"re_enable_after"`
    }
    type QuarantineConfig struct {
      Version int                        `json:"version"`
      Updated string                     `json:"updated"`
      Tests   map[string]QuarantineEntry `json:"tests"`
    }
    func queryFlakeRates(ctx context.Context, promURL string) (map[string]float64, error) {
      query := fmt.Sprintf(`
        1 - (
          sum by (test) (sum_over_time(e2e_test_result{branch="main"}[%dd]))
          /
          sum by (test) (count_over_time(e2e_test_result{branch="main"}[%dd]))
        )
      `, windowDays, windowDays)
      resp, err := http.Get(fmt.Sprintf("%s/api/v1/query?query=%s", promURL, url.QueryEscape(query)))
      if err != nil {
        return nil, fmt.Errorf("querying flake rates: %w", err)
      }
      defer resp.Body.Close()
      var result PromQueryResult
      if err := json.NewDecoder(resp.Body).Decode(&result); err != nil {
        return nil, fmt.Errorf("decoding response: %w", err)
      }
      rates := make(map[string]float64)
      for _, r := range result.Data.Result {
        testName := r.Metric["test"]
        // Value is [timestamp, "value_string"]
        if valStr, ok := r.Value[1].(string); ok {
          var val float64
          fmt.Sscanf(valStr, "%f", &val)
          rates[testName] = val
        }
      }
      return rates, nil
    }
    func queryRunCounts(ctx context.Context, promURL string) (map[string]int, error) {
      query := fmt.Sprintf(`sum by (test) (count_over_time(e2e_test_result{branch="main"}[%dd]))`, windowDays)
      resp, err := http.Get(fmt.Sprintf("%s/api/v1/query?query=%s", promURL, url.QueryEscape(query)))
      if err != nil {
        return nil, fmt.Errorf("querying run counts: %w", err)
      }
      defer resp.Body.Close()
      var result PromQueryResult
      if err := json.NewDecoder(resp.Body).Decode(&result); err != nil {
        return nil, fmt.Errorf("decoding response: %w", err)
      }
      counts := make(map[string]int)
      for _, r := range result.Data.Result {
        testName := r.Metric["test"]
        if valStr, ok := r.Value[1].(string); ok {
          var val int
          fmt.Sscanf(valStr, "%d", &val)
          counts[testName] = val
        }
      }
      return counts, nil
    }
    func isRegression(ctx context.Context, promURL, testName string) (bool, error) {
      // A regression = 0% pass rate in recent window (all runs failed).
      // Uses sum_over_time/count_over_time instead of last_over_time subquery,
      // which is unreliable with high-cardinality build_id labels.
      query := fmt.Sprintf(
        `(sum(sum_over_time(e2e_test_result{test="%s", branch="main"}[4d])) /
        sum(count_over_time(e2e_test_result{test="%s", branch="main"}[4d])))`,
        testName, testName,
    )
      resp, err := http.Get(fmt.Sprintf("%s/api/v1/query?query=%s", promURL, url.QueryEscape(query)))
      if err != nil {
        return false, err
      }
      defer resp.Body.Close()
      var result PromQueryResult
      if err := json.NewDecoder(resp.Body).Decode(&result); err != nil {
        return false, err
      }
      // If pass rate is 0, all recent runs failed its likely regression
      for _, r := range result.Data.Result {
        if valStr, ok := r.Value[1].(string); ok {
          var val float64
          fmt.Sscanf(valStr, "%f", &val)
          if val == 0 {
            return true, nil
          }
        }
      }
      return false, nil
    }
    func buildQuarantineConfig(ctx context.Context, promURL string) (*QuarantineConfig, error) {
      flakeRates, err := queryFlakeRates(ctx, promURL)
      if err != nil {
        return nil, err
      }
      runCounts, err := queryRunCounts(ctx, promURL)
      if err != nil {
        return nil, err
      }
      now := time.Now().UTC()
      cfg := &QuarantineConfig{
        Version: 1,
        Updated: now.Format(time.RFC3339),
        Tests:   make(map[string]QuarantineEntry),
      }
      for testName, rate := range flakeRates {
        runs := runCounts[testName]
        if rate < flakeThreshold || runs < minRunsForDecision {
          continue
        }
        regression, err := isRegression(ctx, promURL, testName)
        if err != nil {
          return nil, fmt.Errorf("checking regression for %s: %w", testName, err)
        }
        if regression {
          continue // Don't quarantine regressions
        }
        failedRuns := int(rate * float64(runs))
        cfg.Tests[testName] = QuarantineEntry{
          Name:          testName,
          Reason:        fmt.Sprintf("Flake rate %.0f%% over %dd (%d/%d failed)", rate*100, windowDays, failedRuns, runs),
          FlakeRate:     rate,
          TotalRuns:     runs,
          FailedRuns:    failedRuns,
          QuarantinedAt: now.Format(time.RFC3339),
          ReEnableAfter: now.AddDate(0, 0, quarantineDurationDays).Format(time.RFC3339),
        }
      }
      return cfg, nil
    }
    func main() {
      promURL := os.Getenv("PROMETHEUS_URL")
      if promURL == "" {
        promURL = "http://prometheus-server.e2e-analytics.svc:80"
      }
      ctx := context.Background()
      cfg, err := buildQuarantineConfig(ctx, promURL)
      if err != nil {
        fmt.Fprintf(os.Stderr, "error: %v\n", err)
        os.Exit(1)
      }
      data, _ := json.MarshalIndent(cfg, "", "  ")
      fmt.Println(string(data))
    }

    Deploy as a daily CronJob (yaml):

    apiVersion: batch/v1
    kind: CronJob
    metadata:
      name: quarantine-controller
      namespace: e2e-analytics
    spec:
      schedule: "0 6 * * *"
      jobTemplate:
        spec:
          template:
            spec:
              containers:
              - name: controller
                image: quay.io/your-org/quarantine-controller:latest
                env:
                - name: PROMETHEUS_URL
                  value: "http://prometheus-server.e2e-analytics.svc:80"
                - name: JIRA_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: jira-credentials
                      key: token
                - name: JIRA_SERVER
                  value: "https://redhat.atlassian.net"
                - name: JIRA_PROJECT
                  value: "PROJECT"
                - name: GIT_REPO
                  value: "repository"
                - name: GITHUB_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: github-credentials
                      key: token
              restartPolicy: OnFailure

    The controller:

    1. Queries Prometheus for flake rates via PromQL
    2. Excludes regressions (consecutive trailing failures)
    3. Outputs a quarantine JSON config
    4. Creates Jira tickets for newly quarantined tests
    5. Commits the config to Git (pull request or direct push)

    Step 6: Wire the test runner

    Your E2E test runner loads the quarantine config and skips active entries. The quarantine controller exports this JSON:

    {
      "version": 1,
      "updated": "2026-06-09T06:00:00Z",
      "tests": {
        "TestOperator/components/group_1/dashboard/validate_config": {
          "name": "TestOperator/components/group_1/dashboard/validate_config",
          "reason": "Flake rate 35% over 30d (7/20 failed)",
          "flake_rate": 0.35,
          "total_runs": 20,
          "failed_runs": 7,
          "jira": "JIRA-60123",
          "quarantined_at": "2026-05-15T06:00:00Z",
          "re_enable_after": "2026-06-14T06:00:00Z"
        }
      }
    }

    At test startup, load the config and build a skip regex (Go):

    func buildSkipRegex(cfg *QuarantineConfig) string {
        var patterns []string
        for name := range cfg.Tests {
            segments := strings.Split(name, "/")
            escaped := make([]string, len(segments))
            for i, seg := range segments {
                escaped[i] = "^" + regexp.QuoteMeta(seg) + "$"
            }
            patterns = append(patterns, strings.Join(escaped, "/"))
        }
        return strings.Join(patterns, "|")
    }

    Pass the result to go test -skip (bash):

    SKIP_REGEX=$(quarantine-tool build-skip-regex --config tests/e2e/quarantine.json)
    go test ./tests/e2e/... \
      -v -timeout 60m \
      -skip "$SKIP_REGEX"

    Step 7: Close the feedback loop

    The system is self-healing by design:

    • Quarantined tests expire. After quarantine_duration_days, the entry is removed and the test runs again in CI.
    • If the test is still flaky, the next analysis cycle re-quarantines it (with a fresh Jira ticket reference).
    • If someone fixes the test, it passes consistently and is never re-quarantined.
    • Jira resolution check: The controller queries Jira for resolved tickets and proactively un-quarantines those tests early.

    The controller's cleanup logic (runs every cycle):

    func cleanupExpired(cfg *QuarantineConfig) {
        now := time.Now().UTC()
        for name, entry := range cfg.Tests {
            expiry, _ := time.Parse(time.RFC3339, entry.ReEnableAfter)
            if now.After(expiry) {
                delete(cfg.Tests, name)
            }
        }
    }

    Step 8: Add PR visibility

    Add a CI check that posts a comment on every pull request (PR) showing the current quarantine status (yaml):

    name: Quarantine Status
    on:
      pull_request:
        branches: [main]
    jobs:
      quarantine-status:
        runs-on: ubuntu-latest
        steps:
          - uses: actions/checkout@v4
          - name: Post quarantine table
            run: |
              CONFIG="tests/e2e/quarantine.json"
              COUNT=$(jq '.tests | length' "$CONFIG")
              if [ "$COUNT" -eq 0 ]; then exit 0; fi
              echo "## Quarantined E2E Tests" > /tmp/comment.md
              echo "**${COUNT}** tests are quarantined and will be skipped." >> /tmp/comment.md
              echo "" >> /tmp/comment.md
              echo "| Test | Jira | Flake Rate | Expires |" >> /tmp/comment.md
              echo "|------|------|-----------|---------|" >> /tmp/comment.md
              jq -r '.tests[] | "| \(.name) | \(.jira // "-") | \(.flake_rate * 100 | floor)% | \(.re_enable_after // "-") |"' \
                "$CONFIG" >> /tmp/comment.md
              gh pr comment "${{ github.event.number }}" --body-file /tmp/comment.md

    Step 9: Connect to your CI pipeline

    The system above is useful only when real test results flow into it. This section shows how to wire it to the three CI platforms most relevant to OpenShift projects.

    Option A: OpenShift CI (Prow)

    OpenShift CI stores all job artifacts in a public GCS bucket (test-platform-results). After every E2E run, Prow uploads $ARTIFACT_DIR contents to:

      gs://test-platform-results/
      logs/{periodic-job-name}/{build_id}/          # periodic jobs
      pr-logs/pull/{org}_{repo}/{pr}/{job}/{build_id}/  # presubmit jobs

    Your test runner must produce JUnit XML inside $ARTIFACT_DIR. Most Go test harnesses support this via gotestsum --junitfile or a wrapper that converts go test -json output. If you use a Makefile, a common pattern is:

    makefile
    ifdef ARTIFACT_DIR
    export JUNIT_OUTPUT_PATH = ${ARTIFACT_DIR}/junit_report.xml
    endif

    The ingester CronJob (Step 3) scrapes this bucket using the GCS JSON API (no auth needed for public buckets):

    # List recent builds for a job
    curl -s "https://storage.googleapis.com/storage/v1/b/test-platform-results/o?\
    prefix=logs/periodic-ci-my-org-my-operator-main-e2e/&delimiter=/"
    # Download JUnit from a specific build
    curl -s "https://storage.googleapis.com/test-platform-results/\
    logs/{job}/{build_id}/artifacts/{workflow}/e2e/artifacts/junit_report.xml"
    # Get run metadata (timestamp, commit SHA)
    curl -s "https://storage.googleapis.com/test-platform-results/\
    logs/{job}/{build_id}/started.json"
    # {"timestamp":1765889560, "repo-commit":"abc123f", ...}

    The ingester maps GCS metadata to Prometheus labels:

    Prometheus labelGCS source
    test<testcase name="..."> in junit_report.xml
    suite<testsuite name="..."> in junit_report.xml
    jobPath segment (the Prow job name)
    build_idPath segment (numeric build ID)
    commit_shastarted.json repo-commit
    branchmain for periodics. PR number for presubmits

    Important: For accurate flake detection, scrape periodic jobs (which run on main without code changes), not presubmit jobs (which mix test flakes with actual regressions introduced by PRs).

    Option B: Konflux and Tekton pipelines

    Konflux uses Tekton pipelines. The integration approach is a post-task in your E2E pipeline that pushes results directly, no GCS scraping needed.

    Add a step to your Tekton PipelineRun that runs after E2E tests:

    # Tekton task that pushes JUnit results to Prometheus after E2E tests
    apiVersion: tekton.dev/v1
    kind: Task
    metadata:
      name: push-test-metrics
      namespace: e2e-analytics
    spec:
      params:
        - name: junit-path
          description: Path to JUnit XML file
        - name: job-name
          description: Pipeline/job identifier
        - name: build-id
          description: PipelineRun UID or build number
        - name: commit-sha
          description: Git commit SHA
      steps:
        - name: push-metrics
          image: quay.io/your-org/junit-ingester:latest
          env:
            - name: REMOTE_WRITE_ENDPOINT
              value: "http://prometheus-server.e2e-analytics.svc:80/api/v1/write"
          command:
            - /junit-ingester
            - --file=$(params.junit-path)
            - --job=$(params.job-name)
            - --build-id=$(params.build-id)
            - --commit-sha=$(params.commit-sha)
            - --branch=main

    Wire it into your E2E pipeline as a finally task (runs whether tests pass or fail):

    apiVersion: tekton.dev/v1
    kind: Pipeline
    spec:
      tasks:
        - name: run-e2e
          taskRef:
            name: e2e-tests
          # ... test config ...
      finally:
        - name: push-metrics
          taskRef:
            name: push-test-metrics
          params:
            - name: junit-path
              value: "$(tasks.run-e2e.results.junit-path)"
            - name: job-name
              value: "konflux-my-operator-e2e"
            - name: build-id
              value: "$(context.pipelineRun.uid)"
            - name: commit-sha
              value: "$(params.git-revision)"

    The advantage over GCS scraping: results arrive in Prometheus within seconds of the test run completing, not on a four-hour CronJob schedule.

    Option C: Local or ad-hoc runs

    For testing the system or running one-off analyses, you can push results from a local make e2e-test run:

    # Run E2E tests with JUnit output
    ARTIFACT_DIR=/tmp/e2e-results make e2e-test
    # Push results to Prometheus (via port-forward or in-cluster)
    kubectl -n e2e-analytics port-forward svc/prometheus-server 9090:80 &
    /path/to/junit-ingester \
      --file /tmp/e2e-results/junit_report.xml \
      --job "local-e2e" \
      --build-id "$(date +%s)" \
      --commit-sha "$(git rev-parse HEAD)" \
      --branch "$(git rev-parse --abbrev-ref HEAD)" \
      --remote-write-endpoint http://localhost:9090/api/v1/write

    This is useful for validating the pipeline end-to-end before deploying the CronJob or Tekton task.

    Why exclude regressions from quarantine?

    A regression means the code broke. Quarantining the test hides the bug. The system detects regressions by looking for a step-function pattern: mostly passing before a specific commit, then consistently failing after. These are flagged in Grafana but never auto-quarantined.

    Why automatic expiry?

    Without expiry, quarantined tests become permanent exclusions. The re_enable_after field forces accountability: either fix the test within the window, or it returns to CI and gets re-evaluated. This prevents the quarantine list from growing unbounded.

    Grafana dashboard layout

    Organize your dashboard into four rows:

    • Row 1: Overview
      • Stat panel: total tests, quarantined count, overall suite pass rate
      • Pie chart: healthy / flaky / regression breakdown
      • PromQL: count(count by (test) (e2e_test_result{branch="main"})) for total tests
    • Row 2: Flake leaderboard
      • Table: top 20 flakiest tests with rates, run counts
      • Time series: flake rate trend for selected test (variable dropdown)
      • PromQL: see Panel 1 and Panel 2 above
    • Row 3: Regressions
      • Table: tests where all recent runs failed (0% pass rate in last four days)
      • PromQL: (sum by (test) (sum_over_time(e2e_test_result{branch="main"}[4d])) / sum by (test) (count_over_time(e2e_test_result{branch="main"}[4d]))) == 0
    • Row 4: Quarantine management
      • Table: loaded from quarantine JSON (or a e2e_quarantine_active metric the controller pushes)
      • Stat panel: tests expiring in next seven days
      • Log panel: quarantine/un-quarantine events timeline

    Operational runbook

    Follow these standard procedures to triage skipped tests, investigate failure causes, and manage the lifecycle of your quarantined suite.

    A test was quarantined. What do I do?

    1. Check the Jira ticket linked in the quarantine entry.
    2. Open the Grafana dashboard, select the test from the dropdown, look at the time-series panel.
    3. Identify the pattern: intermittent flake (random), or did it start at a specific commit?
    4. Fix the test, verify it passes in three or more consecutive runs, close the Jira ticket.
    5. The controller will un-quarantine it on the next cycle.

    How do I swap the data store?

    Because everything conforms to the Prometheus protocol, swapping is a config change:

    1. Ingester: Point REMOTE_WRITE_ENDPOINT at the new endpoint (such as Thanos receiver or Mimir).
    2. Grafana: Update the data source URL.
    3. Quarantine controller: Update PROMETHEUS_URL.

    All PromQL queries, dashboards, and alert rules work unchanged. That's the point of conforming to the standard.

    Moving from reactive debugging to data-driven pipelines

    Automating your test quarantine system moves your development team away from reactive troubleshooting and toward a data-driven pipeline. Backing your test infrastructure with Prometheus metrics provides clear historical trends to help differentiate between intermittent flakiness and true code regressions before a broken pull request blocks your main branch. This self-healing loop isolates broken tests early, forcing accountability through explicit expiry dates while directly reducing manual developer toil and increasing team velocity.

    Related Posts

    • How to set up and experiment with Prometheus remote-write

    • Unify OpenShift Service Mesh observability: Perses and Prometheus

    • Monitor an Ansible Automation Platform database using Prometheus and Grafana

    • A guide to premade grafana-pcp dashboard development

    • Monitor OpenShift Virtualization using user-defined projects and Grafana

    • Scraping Prometheus metrics from Red Hat build of Keycloak

    Recent Posts

    • Build a dynamic E2E test quarantine system with Prometheus and Grafana

    • Implement GPU-as-a-Service with Kueue and NVIDIA MIG

    • Red Hat UBI vs. Red Hat Hardened Images: How to choose

    • What's New in OpenShift GitOps 1.21

    • Deploying distributed AI inference: Blueprints & troubleshooting

    What’s up next?

    Learning Path 5G rollout learning path feature image

    Accelerate 5G core standalone rollout: An end-to-end testing pipeline with Red Hat...

    Deploy a 5G core testing pipeline to create a continuous quality check for a...
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Chat Support

    Please log in with your Red Hat account to access chat support.