Build a dynamic E2E test quarantine system with Prometheus and Grafana

If you run end-to-end (E2E) tests on a Kubernetes operator, you've seen the pattern: a test that passes 80% of the time still fails often enough to block continuous integration (CI), waste developer hours, and train your team to reflexively /retest. Without historical data, you can't distinguish a flaky test from a regression. Without automation, the only remedy is a human noticing and filing a ticket.

This guide shows you how to build a complete quarantine system backed by a Prometheus-compatible time-series database and Grafana, running on a long-lived cluster that provides continuous observability into your test suite's health.

What you'll build

A Grafana dashboard showing per-test health with automated quarantine decisions, Jira ticket creation, and a self-healing feedback loop, all powered by industry-standard Prometheus metrics.

Prerequisites:

A long-lived OpenShift/Kubernetes cluster (or any cluster that stays up)
Periodic E2E test runs producing JUnit XML (Prow periodic jobs or scheduled GitHub Actions)
helm and kubectl or oc access to the cluster
A Jira project for tracking quarantined tests (optional but recommended)

Step 1: Deploy Prometheus

Deploy a dedicated Prometheus instance for test analytics. On OpenShift you might already have a cluster monitoring stack, but a separate instance keeps test data isolated and gives you control over retention.

bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/prometheus \
  --namespace e2e-analytics --create-namespace \
  --set server.retention=90d \
  --set server.persistentVolume.size=20Gi \
  --set server.resources.requests.memory=512Mi \
  --set server.resources.requests.cpu=250m \
  --set alertmanager.enabled=false \
  --set prometheus-node-exporter.enabled=false \
  --set kube-state-metrics.enabled=false \
  --set prometheus-pushgateway.enabled=false \
  --set 'server.extraFlags[0]=web.enable-remote-write-receiver' \
  --set 'serverFiles.prometheus\.yml.storage.tsdb.out_of_order_time_window=720h' \
  --set server.securityContext.runAsNonRoot=true \
  --set server.securityContext.runAsUser=null \
  --set server.securityContext.fsGroup=null \
  --set server.containerSecurityContext.allowPrivilegeEscalation=false \
  --set "server.containerSecurityContext.capabilities.drop={ALL}" \
  --set server.containerSecurityContext.runAsNonRoot=true \
  --set server.containerSecurityContext.seccompProfile.type=RuntimeDefault

The --web.enable-remote-write-receiver flag enables the remote-write endpoint so our ingester can push data in. The out_of_order_time_window storage config allows the ingester to backfill historical data (required when first loading past results).

OpenShift SCC note

The default Prometheus Helm chart sets runAsUser: 65534 and fsGroup: 65534, which are rejected by OpenShift's restricted-v2 SCC. These securityContext overrides clear these defaults so the container runs with the UID assigned by OpenShift.

Verify it's running:

kubectl -n e2e-analytics get pods -l app.kubernetes.io/name=prometheus
kubectl -n e2e-analytics port-forward svc/prometheus-server 9090:80 &

Test the query endpoint:

curl -s 'http://localhost:9090/api/v1/query?query=up'

The primary endpoints include:

Remote-write ingest: http://prometheus-server.e2e-analytics.svc:80/api/v1/write
PromQL query: http://prometheus-server.e2e-analytics.svc:80/api/v1/query
Range query: http://prometheus-server.e2e-analytics.svc:80/api/v1/query_range

Alternative (OpenShift)

If you're on OpenShift 4.x, you can use the built-in user workload monitoring instead. Enable it in the cluster-monitoring-config ConfigMap, and your metrics are automatically available via the Thanos Querier at https://thanos-querier.openshift-monitoring.svc:9091. This gives you Prometheus without deploying anything extra.

Step 2: Define the metric schema

Instead of SQL tables, we define Prometheus metrics with labels. This is the interface contract—any component that writes these metrics (GCS scraper, push gateway, future sources) is compatible.

Metrics:

Metric name	Type	Labels	Description
`e2e_test_result`	Gauge (0/1)	`test`, `suite`, `job`, `build_id`, `commit_sha`, `branch`	1 = passed, 0 = failed
`e2e_test_duration_seconds`	Gauge	`test`, `suite`, `job`, `build_id`, `branch`	Test execution duration
`e2e_test_error_info`	Gauge (1)	`test`, `suite`, `error_category`, `error_message`	Error classification (info metric)

Label schema:

e2e_test_result{
  test="TestOperator/components/group_1/dashboard/validate_config",
  suite="e2e-operator",
  job="periodic-ci-operator-main-e2e",
  build_id="1234567890",
  commit_sha="abc123f",
  branch="main"
} 0  # 0 = failed, 1 = passed

Each test execution produces one e2e_test_result sample per test case. The timestamp is the run time. This gives us a time-series of pass/fail per test that PromQL can aggregate over any window.

Step 3: Build the JUnit ingester

Create a Go binary that parses JUnit XML, converts results to Prometheus metrics, and pushes them via remote-write to Prometheus.

go
package main
import (
  "bytes"
  "encoding/xml"
  "fmt"
  "net/http"
  "os"
  "time"
  "github.com/golang/snappy"
  "github.com/prometheus/prometheus/prompb"
    )

// Note: prompb types use gogoproto and have their own Marshal() method.
// Do NOT use google.golang.org/protobuf/proto it requires ProtoReflect()
    // which gogoproto types don't implement.

type JUnitTestSuite struct {
  XMLName    xml.Name        `xml:"testsuite"`
  Name       string          `xml:"name,attr"`
  Timestamp  string          `xml:"timestamp,attr"`
  TestCases  []JUnitTestCase `xml:"testcase"`
  Properties []Property      `xml:"properties>property"`
}
type JUnitTestCase struct {
  Name    string        `xml:"name,attr"`
  Time    float64       `xml:"time,attr"`
  Failure *JUnitFailure `xml:"failure"`
  Error   *JUnitFailure `xml:"error"`
}
type JUnitFailure struct {
  Message string `xml:"message,attr"`
  Body    string `xml:",chardata"`
}
type Property struct {
  Name  string `xml:"name,attr"`
  Value string `xml:"value,attr"`
}
func junitToTimeSeries(suite JUnitTestSuite, prowJob, buildID string) []prompb.TimeSeries {
  commitSHA := extractProperty(suite.Properties, "commit.sha")
  branch := extractProperty(suite.Properties, "branch")
  if branch == "" {
    branch = "main"
  }
  runTS := parseTimestamp(suite.Timestamp)
  tsMs := runTS.UnixMilli()
  var series []prompb.TimeSeries
  for _, tc := range suite.TestCases {
    passed := tc.Failure == nil && tc.Error == nil
    var resultValue float64
    if passed {
      resultValue = 1
    }
    // e2e_test_result metric
    series = append(series, prompb.TimeSeries{
      Labels: []prompb.Label{
        {Name: "__name__", Value: "e2e_test_result"},
        {Name: "test", Value: tc.Name},
        {Name: "suite", Value: suite.Name},
        {Name: "job", Value: prowJob},
        {Name: "build_id", Value: buildID},
        {Name: "commit_sha", Value: commitSHA},
        {Name: "branch", Value: branch},
      },
      Samples: []prompb.Sample{
        {Value: resultValue, Timestamp: tsMs},
      },
    })
    // e2e_test_duration_seconds metric
    series = append(series, prompb.TimeSeries{
      Labels: []prompb.Label{
        {Name: "__name__", Value: "e2e_test_duration_seconds"},
        {Name: "test", Value: tc.Name},
        {Name: "suite", Value: suite.Name},
        {Name: "job", Value: prowJob},
        {Name: "build_id", Value: buildID},
        {Name: "branch", Value: branch},
      },
      Samples: []prompb.Sample{
        {Value: tc.Time, Timestamp: tsMs},
      },
    })
  }
  return series
}
func remoteWrite(endpoint string, series []prompb.TimeSeries) error {
  req := &prompb.WriteRequest{Timeseries: series}
  data, err := req.Marshal()
  if err != nil {
    return fmt.Errorf("marshaling write request: %w", err)
  }
  compressed := snappy.Encode(nil, data)
  httpReq, err := http.NewRequest(http.MethodPost, endpoint, bytes.NewReader(compressed))
  if err != nil {
    return fmt.Errorf("creating request: %w", err)
  }
  httpReq.Header.Set("Content-Type", "application/x-protobuf")
  httpReq.Header.Set("Content-Encoding", "snappy")
  httpReq.Header.Set("X-Prometheus-Remote-Write-Version", "0.1.0")
  resp, err := http.DefaultClient.Do(httpReq)
  if err != nil {
    return fmt.Errorf("sending remote write: %w", err)
  }
  defer resp.Body.Close()
  if resp.StatusCode != http.StatusNoContent && resp.StatusCode != http.StatusOK {
    return fmt.Errorf("remote write returned %d", resp.StatusCode)
  }
  return nil
}

Deploy as a CronJob that fetches JUnit artifacts from your Google Cloud Storage (GCS) bucket:

yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: junit-ingester
  namespace: e2e-analytics
spec:
  schedule: "0 */4 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: ingester
            image: quay.io/your-org/junit-ingester:latest
            env:
            - name: REMOTE_WRITE_ENDPOINT
              value: "http://prometheus-server.e2e-analytics.svc:80/api/v1/write"
            - name: GCS_BUCKET
              value: "test-platform-results"
            - name: PROW_JOB
              value: "periodic-ci-operator-main-e2e"
          restartPolicy: OnFailure

Quick validation

After the first ingestion, verify data is flowing.

Query for any test results:

curl -s 'http://localhost:9090/api/v1/query?query=e2e_test_result' | jq '.data.result | length'

Check a specific test:

curl -s 'http://localhost:9090/api/v1/query?query=e2e_test_result{test=~".*dashboard.*"}' | jq .

Step 4: Set up Grafana

Deploy Grafana and point it at Prometheus as the data source.

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm install grafana grafana/grafana \
  --namespace e2e-analytics \
  --set persistence.enabled=true \
  --set persistence.size=5Gi \
  --set adminPassword="$(openssl rand -base64 16)" \
  --set "datasources.datasources\\.yaml.apiVersion=1" \
  --set "datasources.datasources\\.yaml.datasources[0].name=E2E Metrics" \
  --set "datasources.datasources\\.yaml.datasources[0].type=prometheus" \
  --set "datasources.datasources\\.yaml.datasources[0].url=http://prometheus-server.e2e-analytics.svc:80" \
  --set "datasources.datasources\\.yaml.datasources[0].access=proxy" \
  --set "datasources.datasources\\.yaml.datasources[0].isDefault=true" \
  --set securityContext.runAsNonRoot=true \
  --set securityContext.runAsUser=null \
  --set securityContext.fsGroup=null \
  --set containerSecurityContext.allowPrivilegeEscalation=false \
  --set "containerSecurityContext.capabilities.drop={ALL}" \
  --set containerSecurityContext.runAsNonRoot=true \
  --set containerSecurityContext.seccompProfile.type=RuntimeDefault \
  --set initChownData.enabled=false

OpenShift SCC note

Same as Prometheus, clear default runAsUser/fsGroup and disable the init chown container (which tries to run as root). On non-OpenShift clusters these overrides are harmless.

Expose Grafana

On OpenShift:

oc -n e2e-analytics create route edge grafana --service=grafana --port=3000

Or port-forward for local access:

kubectl -n e2e-analytics port-forward svc/grafana 3000:80 &

Dashboard panels (PromQL)

Panel 1: Per-test flake rate (30-day rolling window).

# Since e2e_test_result is 1 (pass) or 0 (fail), sum_over_time counts passes
sort_desc(
  1 - (
    sum by (test) (sum_over_time(e2e_test_result{branch="main"}[30d]))
    /
    sum by (test) (count_over_time(e2e_test_result{branch="main"}[30d]))
  )
)

Select Table as the panel type and configure the following columns: Test Name, Flake Rate, and Total Runs.

PromQL note

An earlier version of this query used count_over_time(e2e_test_result{...} == 1 [30d]) to count passes. This is invalid PromQL because the == 1 comparison produces an instant vector, and count_over_time requires a range vector selector. Because e2e_test_result uses 1/0 encoding, sum_over_time directly gives the pass count, making the query both correct and simpler.

Panel 2: Flake rate time series (per test, daily resolution).

# Daily flake rate for a specific test (use $test variable)
1 - (
  sum by (test) (sum_over_time(e2e_test_result{test="$test", branch="main"}[1d]))
  /
  sum by (test) (count_over_time(e2e_test_result{test="$test", branch="main"}[1d]))
)

Display as a Time Series panel. Add a threshold line at 0.2 (20%) to show the quarantine boundary.

Panel 3: Test health heatmap.

# Pass rate per test per day (for heatmap)
sum by (test) (sum_over_time(e2e_test_result{branch="main"}[1d]))
/
sum by (test) (count_over_time(e2e_test_result{branch="main"}[1d]))

Panel 4: Regression detection.

# Tests with 0% pass rate in the last 4 days (potential regression, not flake)
(
  sum by (test) (sum_over_time(e2e_test_result{branch="main"}[4d]))
  /
  sum by (test) (count_over_time(e2e_test_result{branch="main"}[4d]))
) == 0

Tests matching this pattern that previously had a low flake rate are regressions—the code broke, not the test.

Panel 5: Test duration trends.

# Average duration per test over time
avg by (test) (avg_over_time(e2e_test_duration_seconds{branch="main"}[1d]))

Alert rules

Configure Grafana alerting to fire when a test crosses the quarantine threshold:

# Grafana alert rule (configured via UI or provisioning)
name: Test Flake Rate Exceeded
condition: flake_rate > 0.2
expr: |
  (
    1 - (
      sum by (test) (count_over_time(e2e_test_result{branch="main"} == 1 [30d]))
      /
      sum by (test) (sum_over_time(e2e_test_result{branch="main"}[30d]))
    )
  ) > 0.2
  and
  sum by (test) (count_over_time(e2e_test_result{branch="main"}[30d])) >= 10
for: 0m
labels:
  severity: warning
annotations:
  summary: "Test {{ $labels.test }} flake rate exceeded 20%"

Step 5: Build the quarantine controller (Go)

The quarantine controller queries Prometheus via PromQL, identifies flaky tests, excludes regressions, and outputs a quarantine config.

package main
import (
  "context"
  "encoding/json"
  "fmt"
  "net/http"
  "net/url"
  "os"
  "time"
)
const (
  flakeThreshold         = 0.20
  minRunsForDecision     = 10
  windowDays             = 30
  quarantineDurationDays = 30
)
type PromQueryResult struct {
  Status string `json:"status"`
  Data   struct {
    ResultType string `json:"resultType"`
    Result     []struct {
      Metric map[string]string `json:"metric"`
      Value  [2]interface{}    `json:"value"`
    } `json:"result"`
  } `json:"data"`
}
type QuarantineEntry struct {
  Name           string  `json:"name"`
  Reason         string  `json:"reason"`
  FlakeRate      float64 `json:"flake_rate"`
  TotalRuns      int     `json:"total_runs"`
  FailedRuns     int     `json:"failed_runs"`
  Jira           string  `json:"jira,omitempty"`
  QuarantinedAt  string  `json:"quarantined_at"`
  ReEnableAfter  string  `json:"re_enable_after"`
}
type QuarantineConfig struct {
  Version int                        `json:"version"`
  Updated string                     `json:"updated"`
  Tests   map[string]QuarantineEntry `json:"tests"`
}
func queryFlakeRates(ctx context.Context, promURL string) (map[string]float64, error) {
  query := fmt.Sprintf(`
    1 - (
      sum by (test) (sum_over_time(e2e_test_result{branch="main"}[%dd]))
      /
      sum by (test) (count_over_time(e2e_test_result{branch="main"}[%dd]))
    )
  `, windowDays, windowDays)
  resp, err := http.Get(fmt.Sprintf("%s/api/v1/query?query=%s", promURL, url.QueryEscape(query)))
  if err != nil {
    return nil, fmt.Errorf("querying flake rates: %w", err)
  }
  defer resp.Body.Close()
  var result PromQueryResult
  if err := json.NewDecoder(resp.Body).Decode(&result); err != nil {
    return nil, fmt.Errorf("decoding response: %w", err)
  }
  rates := make(map[string]float64)
  for _, r := range result.Data.Result {
    testName := r.Metric["test"]
    // Value is [timestamp, "value_string"]
    if valStr, ok := r.Value[1].(string); ok {
      var val float64
      fmt.Sscanf(valStr, "%f", &val)
      rates[testName] = val
    }
  }
  return rates, nil
}
func queryRunCounts(ctx context.Context, promURL string) (map[string]int, error) {
  query := fmt.Sprintf(`sum by (test) (count_over_time(e2e_test_result{branch="main"}[%dd]))`, windowDays)
  resp, err := http.Get(fmt.Sprintf("%s/api/v1/query?query=%s", promURL, url.QueryEscape(query)))
  if err != nil {
    return nil, fmt.Errorf("querying run counts: %w", err)
  }
  defer resp.Body.Close()
  var result PromQueryResult
  if err := json.NewDecoder(resp.Body).Decode(&result); err != nil {
    return nil, fmt.Errorf("decoding response: %w", err)
  }
  counts := make(map[string]int)
  for _, r := range result.Data.Result {
    testName := r.Metric["test"]
    if valStr, ok := r.Value[1].(string); ok {
      var val int
      fmt.Sscanf(valStr, "%d", &val)
      counts[testName] = val
    }
  }
  return counts, nil
}
func isRegression(ctx context.Context, promURL, testName string) (bool, error) {
  // A regression = 0% pass rate in recent window (all runs failed).
  // Uses sum_over_time/count_over_time instead of last_over_time subquery,
  // which is unreliable with high-cardinality build_id labels.
  query := fmt.Sprintf(
    `(sum(sum_over_time(e2e_test_result{test="%s", branch="main"}[4d])) /
    sum(count_over_time(e2e_test_result{test="%s", branch="main"}[4d])))`,
    testName, testName,
)
  resp, err := http.Get(fmt.Sprintf("%s/api/v1/query?query=%s", promURL, url.QueryEscape(query)))
  if err != nil {
    return false, err
  }
  defer resp.Body.Close()
  var result PromQueryResult
  if err := json.NewDecoder(resp.Body).Decode(&result); err != nil {
    return false, err
  }
  // If pass rate is 0, all recent runs failed its likely regression
  for _, r := range result.Data.Result {
    if valStr, ok := r.Value[1].(string); ok {
      var val float64
      fmt.Sscanf(valStr, "%f", &val)
      if val == 0 {
        return true, nil
      }
    }
  }
  return false, nil
}
func buildQuarantineConfig(ctx context.Context, promURL string) (*QuarantineConfig, error) {
  flakeRates, err := queryFlakeRates(ctx, promURL)
  if err != nil {
    return nil, err
  }
  runCounts, err := queryRunCounts(ctx, promURL)
  if err != nil {
    return nil, err
  }
  now := time.Now().UTC()
  cfg := &QuarantineConfig{
    Version: 1,
    Updated: now.Format(time.RFC3339),
    Tests:   make(map[string]QuarantineEntry),
  }
  for testName, rate := range flakeRates {
    runs := runCounts[testName]
    if rate < flakeThreshold || runs < minRunsForDecision {
      continue
    }
    regression, err := isRegression(ctx, promURL, testName)
    if err != nil {
      return nil, fmt.Errorf("checking regression for %s: %w", testName, err)
    }
    if regression {
      continue // Don't quarantine regressions
    }
    failedRuns := int(rate * float64(runs))
    cfg.Tests[testName] = QuarantineEntry{
      Name:          testName,
      Reason:        fmt.Sprintf("Flake rate %.0f%% over %dd (%d/%d failed)", rate*100, windowDays, failedRuns, runs),
      FlakeRate:     rate,
      TotalRuns:     runs,
      FailedRuns:    failedRuns,
      QuarantinedAt: now.Format(time.RFC3339),
      ReEnableAfter: now.AddDate(0, 0, quarantineDurationDays).Format(time.RFC3339),
    }
  }
  return cfg, nil
}
func main() {
  promURL := os.Getenv("PROMETHEUS_URL")
  if promURL == "" {
    promURL = "http://prometheus-server.e2e-analytics.svc:80"
  }
  ctx := context.Background()
  cfg, err := buildQuarantineConfig(ctx, promURL)
  if err != nil {
    fmt.Fprintf(os.Stderr, "error: %v\n", err)
    os.Exit(1)
  }
  data, _ := json.MarshalIndent(cfg, "", "  ")
  fmt.Println(string(data))
}

Deploy as a daily CronJob (yaml):

apiVersion: batch/v1
kind: CronJob
metadata:
  name: quarantine-controller
  namespace: e2e-analytics
spec:
  schedule: "0 6 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: controller
            image: quay.io/your-org/quarantine-controller:latest
            env:
            - name: PROMETHEUS_URL
              value: "http://prometheus-server.e2e-analytics.svc:80"
            - name: JIRA_TOKEN
              valueFrom:
                secretKeyRef:
                  name: jira-credentials
                  key: token
            - name: JIRA_SERVER
              value: "https://redhat.atlassian.net"
            - name: JIRA_PROJECT
              value: "PROJECT"
            - name: GIT_REPO
              value: "repository"
            - name: GITHUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: github-credentials
                  key: token
          restartPolicy: OnFailure

The controller:

Queries Prometheus for flake rates via PromQL
Excludes regressions (consecutive trailing failures)
Outputs a quarantine JSON config
Creates Jira tickets for newly quarantined tests
Commits the config to Git (pull request or direct push)

Step 6: Wire the test runner

Your E2E test runner loads the quarantine config and skips active entries. The quarantine controller exports this JSON:

{
  "version": 1,
  "updated": "2026-06-09T06:00:00Z",
  "tests": {
    "TestOperator/components/group_1/dashboard/validate_config": {
      "name": "TestOperator/components/group_1/dashboard/validate_config",
      "reason": "Flake rate 35% over 30d (7/20 failed)",
      "flake_rate": 0.35,
      "total_runs": 20,
      "failed_runs": 7,
      "jira": "JIRA-60123",
      "quarantined_at": "2026-05-15T06:00:00Z",
      "re_enable_after": "2026-06-14T06:00:00Z"
    }
  }
}

At test startup, load the config and build a skip regex (Go):

func buildSkipRegex(cfg *QuarantineConfig) string {
    var patterns []string
    for name := range cfg.Tests {
        segments := strings.Split(name, "/")
        escaped := make([]string, len(segments))
        for i, seg := range segments {
            escaped[i] = "^" + regexp.QuoteMeta(seg) + "$"
        }
        patterns = append(patterns, strings.Join(escaped, "/"))
    }
    return strings.Join(patterns, "|")
}

Pass the result to go test -skip (bash):

SKIP_REGEX=$(quarantine-tool build-skip-regex --config tests/e2e/quarantine.json)
go test ./tests/e2e/... \
  -v -timeout 60m \
  -skip "$SKIP_REGEX"

Step 7: Close the feedback loop

The system is self-healing by design:

Quarantined tests expire. After quarantine_duration_days, the entry is removed and the test runs again in CI.
If the test is still flaky, the next analysis cycle re-quarantines it (with a fresh Jira ticket reference).
If someone fixes the test, it passes consistently and is never re-quarantined.
Jira resolution check: The controller queries Jira for resolved tickets and proactively un-quarantines those tests early.

The controller's cleanup logic (runs every cycle):

func cleanupExpired(cfg *QuarantineConfig) {
    now := time.Now().UTC()
    for name, entry := range cfg.Tests {
        expiry, _ := time.Parse(time.RFC3339, entry.ReEnableAfter)
        if now.After(expiry) {
            delete(cfg.Tests, name)
        }
    }
}

Step 8: Add PR visibility

Add a CI check that posts a comment on every pull request (PR) showing the current quarantine status (yaml):

name: Quarantine Status
on:
  pull_request:
    branches: [main]
jobs:
  quarantine-status:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Post quarantine table
        run: |
          CONFIG="tests/e2e/quarantine.json"
          COUNT=$(jq '.tests | length' "$CONFIG")
          if [ "$COUNT" -eq 0 ]; then exit 0; fi
          echo "## Quarantined E2E Tests" > /tmp/comment.md
          echo "**${COUNT}** tests are quarantined and will be skipped." >> /tmp/comment.md
          echo "" >> /tmp/comment.md
          echo "| Test | Jira | Flake Rate | Expires |" >> /tmp/comment.md
          echo "|------|------|-----------|---------|" >> /tmp/comment.md
          jq -r '.tests[] | "| \(.name) | \(.jira // "-") | \(.flake_rate * 100 | floor)% | \(.re_enable_after // "-") |"' \
            "$CONFIG" >> /tmp/comment.md
          gh pr comment "${{ github.event.number }}" --body-file /tmp/comment.md

Step 9: Connect to your CI pipeline

The system above is useful only when real test results flow into it. This section shows how to wire it to the three CI platforms most relevant to OpenShift projects.

Option A: OpenShift CI (Prow)

OpenShift CI stores all job artifacts in a public GCS bucket (test-platform-results). After every E2E run, Prow uploads $ARTIFACT_DIR contents to:

  gs://test-platform-results/
  logs/{periodic-job-name}/{build_id}/          # periodic jobs
  pr-logs/pull/{org}_{repo}/{pr}/{job}/{build_id}/  # presubmit jobs

Your test runner must produce JUnit XML inside $ARTIFACT_DIR. Most Go test harnesses support this via gotestsum --junitfile or a wrapper that converts go test -json output. If you use a Makefile, a common pattern is:

makefile
ifdef ARTIFACT_DIR
export JUNIT_OUTPUT_PATH = ${ARTIFACT_DIR}/junit_report.xml
endif

The ingester CronJob (Step 3) scrapes this bucket using the GCS JSON API (no auth needed for public buckets):

# List recent builds for a job
curl -s "https://storage.googleapis.com/storage/v1/b/test-platform-results/o?\
prefix=logs/periodic-ci-my-org-my-operator-main-e2e/&delimiter=/"
# Download JUnit from a specific build
curl -s "https://storage.googleapis.com/test-platform-results/\
logs/{job}/{build_id}/artifacts/{workflow}/e2e/artifacts/junit_report.xml"
# Get run metadata (timestamp, commit SHA)
curl -s "https://storage.googleapis.com/test-platform-results/\
logs/{job}/{build_id}/started.json"
# {"timestamp":1765889560, "repo-commit":"abc123f", ...}

The ingester maps GCS metadata to Prometheus labels:

Prometheus label	GCS source
`test`	`<testcase name="...">` in `junit_report.xml`
`suite`	`<testsuite name="...">` in `junit_report.xml`
`job`	Path segment (the Prow job name)
`build_id`	Path segment (numeric build ID)
`commit_sha`	`started.json` `repo-commit`
`branch`	`main` for periodics. PR number for presubmits

Important: For accurate flake detection, scrape periodic jobs (which run on main without code changes), not presubmit jobs (which mix test flakes with actual regressions introduced by PRs).

Option B: Konflux and Tekton pipelines

Konflux uses Tekton pipelines. The integration approach is a post-task in your E2E pipeline that pushes results directly, no GCS scraping needed.

Add a step to your Tekton PipelineRun that runs after E2E tests:

# Tekton task that pushes JUnit results to Prometheus after E2E tests
apiVersion: tekton.dev/v1
kind: Task
metadata:
  name: push-test-metrics
  namespace: e2e-analytics
spec:
  params:
    - name: junit-path
      description: Path to JUnit XML file
    - name: job-name
      description: Pipeline/job identifier
    - name: build-id
      description: PipelineRun UID or build number
    - name: commit-sha
      description: Git commit SHA
  steps:
    - name: push-metrics
      image: quay.io/your-org/junit-ingester:latest
      env:
        - name: REMOTE_WRITE_ENDPOINT
          value: "http://prometheus-server.e2e-analytics.svc:80/api/v1/write"
      command:
        - /junit-ingester
        - --file=$(params.junit-path)
        - --job=$(params.job-name)
        - --build-id=$(params.build-id)
        - --commit-sha=$(params.commit-sha)
        - --branch=main

Wire it into your E2E pipeline as a finally task (runs whether tests pass or fail):

apiVersion: tekton.dev/v1
kind: Pipeline
spec:
  tasks:
    - name: run-e2e
      taskRef:
        name: e2e-tests
      # ... test config ...
  finally:
    - name: push-metrics
      taskRef:
        name: push-test-metrics
      params:
        - name: junit-path
          value: "$(tasks.run-e2e.results.junit-path)"
        - name: job-name
          value: "konflux-my-operator-e2e"
        - name: build-id
          value: "$(context.pipelineRun.uid)"
        - name: commit-sha
          value: "$(params.git-revision)"

The advantage over GCS scraping: results arrive in Prometheus within seconds of the test run completing, not on a four-hour CronJob schedule.

Option C: Local or ad-hoc runs

For testing the system or running one-off analyses, you can push results from a local make e2e-test run:

# Run E2E tests with JUnit output
ARTIFACT_DIR=/tmp/e2e-results make e2e-test
# Push results to Prometheus (via port-forward or in-cluster)
kubectl -n e2e-analytics port-forward svc/prometheus-server 9090:80 &
/path/to/junit-ingester \
  --file /tmp/e2e-results/junit_report.xml \
  --job "local-e2e" \
  --build-id "$(date +%s)" \
  --commit-sha "$(git rev-parse HEAD)" \
  --branch "$(git rev-parse --abbrev-ref HEAD)" \
  --remote-write-endpoint http://localhost:9090/api/v1/write

This is useful for validating the pipeline end-to-end before deploying the CronJob or Tekton task.

Why exclude regressions from quarantine?

A regression means the code broke. Quarantining the test hides the bug. The system detects regressions by looking for a step-function pattern: mostly passing before a specific commit, then consistently failing after. These are flagged in Grafana but never auto-quarantined.

Why automatic expiry?

Without expiry, quarantined tests become permanent exclusions. The re_enable_after field forces accountability: either fix the test within the window, or it returns to CI and gets re-evaluated. This prevents the quarantine list from growing unbounded.

Grafana dashboard layout

Organize your dashboard into four rows:

Row 1: Overview
- Stat panel: total tests, quarantined count, overall suite pass rate
- Pie chart: healthy / flaky / regression breakdown
- PromQL: count(count by (test) (e2e_test_result{branch="main"})) for total tests
Row 2: Flake leaderboard
- Table: top 20 flakiest tests with rates, run counts
- Time series: flake rate trend for selected test (variable dropdown)
- PromQL: see Panel 1 and Panel 2 above
Row 3: Regressions
- Table: tests where all recent runs failed (0% pass rate in last four days)
- PromQL: (sum by (test) (sum_over_time(e2e_test_result{branch="main"}[4d])) / sum by (test) (count_over_time(e2e_test_result{branch="main"}[4d]))) == 0
Row 4: Quarantine management
- Table: loaded from quarantine JSON (or a e2e_quarantine_active metric the controller pushes)
- Stat panel: tests expiring in next seven days
- Log panel: quarantine/un-quarantine events timeline

Operational runbook

Follow these standard procedures to triage skipped tests, investigate failure causes, and manage the lifecycle of your quarantined suite.

A test was quarantined. What do I do?

Check the Jira ticket linked in the quarantine entry.
Open the Grafana dashboard, select the test from the dropdown, look at the time-series panel.
Identify the pattern: intermittent flake (random), or did it start at a specific commit?
Fix the test, verify it passes in three or more consecutive runs, close the Jira ticket.
The controller will un-quarantine it on the next cycle.

How do I swap the data store?

Because everything conforms to the Prometheus protocol, swapping is a config change:

Ingester: Point REMOTE_WRITE_ENDPOINT at the new endpoint (such as Thanos receiver or Mimir).
Grafana: Update the data source URL.
Quarantine controller: Update PROMETHEUS_URL.

All PromQL queries, dashboards, and alert rules work unchanged. That's the point of conforming to the standard.

Moving from reactive debugging to data-driven pipelines

Automating your test quarantine system moves your development team away from reactive troubleshooting and toward a data-driven pipeline. Backing your test infrastructure with Prometheus metrics provides clear historical trends to help differentiate between intermittent flakiness and true code regressions before a broken pull request blocks your main branch. This self-healing loop isolates broken tests early, forcing accountability through explicit expiry dates while directly reducing manual developer toil and increasing team velocity.

Build a dynamic E2E test quarantine system with Prometheus and Grafana

What you'll build

Step 1: Deploy Prometheus

OpenShift SCC note

Alternative (OpenShift)

Step 2: Define the metric schema

Step 3: Build the JUnit ingester

Quick validation

Step 4: Set up Grafana

OpenShift SCC note

Expose Grafana

Dashboard panels (PromQL)

PromQL note

Alert rules

Step 5: Build the quarantine controller (Go)

Step 6: Wire the test runner

Step 7: Close the feedback loop

Step 8: Add PR visibility

Step 9: Connect to your CI pipeline

Option A: OpenShift CI (Prow)

Option B: Konflux and Tekton pipelines

Option C: Local or ad-hoc runs

Why exclude regressions from quarantine?

Why automatic expiry?

Grafana dashboard layout

Operational runbook

A test was quarantined. What do I do?

How do I swap the data store?

Moving from reactive debugging to data-driven pipelines

Platforms

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links