If you run end-to-end (E2E) tests on a Kubernetes operator, you've seen the pattern: a test that passes 80% of the time still fails often enough to block continuous integration (CI), waste developer hours, and train your team to reflexively /retest. Without historical data, you can't distinguish a flaky test from a regression. Without automation, the only remedy is a human noticing and filing a ticket.
This guide shows you how to build a complete quarantine system backed by a Prometheus-compatible time-series database and Grafana, running on a long-lived cluster that provides continuous observability into your test suite's health.
What you'll build
A Grafana dashboard showing per-test health with automated quarantine decisions, Jira ticket creation, and a self-healing feedback loop, all powered by industry-standard Prometheus metrics.
Prerequisites:
- A long-lived OpenShift/Kubernetes cluster (or any cluster that stays up)
- Periodic E2E test runs producing JUnit XML (Prow periodic jobs or scheduled GitHub Actions)
helmandkubectlorocaccess to the cluster- A Jira project for tracking quarantined tests (optional but recommended)
Step 1: Deploy Prometheus
Deploy a dedicated Prometheus instance for test analytics. On OpenShift you might already have a cluster monitoring stack, but a separate instance keeps test data isolated and gives you control over retention.
bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/prometheus \
--namespace e2e-analytics --create-namespace \
--set server.retention=90d \
--set server.persistentVolume.size=20Gi \
--set server.resources.requests.memory=512Mi \
--set server.resources.requests.cpu=250m \
--set alertmanager.enabled=false \
--set prometheus-node-exporter.enabled=false \
--set kube-state-metrics.enabled=false \
--set prometheus-pushgateway.enabled=false \
--set 'server.extraFlags[0]=web.enable-remote-write-receiver' \
--set 'serverFiles.prometheus\.yml.storage.tsdb.out_of_order_time_window=720h' \
--set server.securityContext.runAsNonRoot=true \
--set server.securityContext.runAsUser=null \
--set server.securityContext.fsGroup=null \
--set server.containerSecurityContext.allowPrivilegeEscalation=false \
--set "server.containerSecurityContext.capabilities.drop={ALL}" \
--set server.containerSecurityContext.runAsNonRoot=true \
--set server.containerSecurityContext.seccompProfile.type=RuntimeDefaultThe --web.enable-remote-write-receiver flag enables the remote-write endpoint so our ingester can push data in. The out_of_order_time_window storage config allows the ingester to backfill historical data (required when first loading past results).
OpenShift SCC note
The default Prometheus Helm chart sets runAsUser: 65534 and fsGroup: 65534, which are rejected by OpenShift's restricted-v2 SCC. These securityContext overrides clear these defaults so the container runs with the UID assigned by OpenShift.
Verify it's running:
kubectl -n e2e-analytics get pods -l app.kubernetes.io/name=prometheus kubectl -n e2e-analytics port-forward svc/prometheus-server 9090:80 &
Test the query endpoint:
curl -s 'http://localhost:9090/api/v1/query?query=up'
The primary endpoints include:
- Remote-write ingest:
http://prometheus-server.e2e-analytics.svc:80/api/v1/write - PromQL query:
http://prometheus-server.e2e-analytics.svc:80/api/v1/query - Range query:
http://prometheus-server.e2e-analytics.svc:80/api/v1/query_range
Alternative (OpenShift)
If you're on OpenShift 4.x, you can use the built-in user workload monitoring instead. Enable it in the cluster-monitoring-config ConfigMap, and your metrics are automatically available via the Thanos Querier at https://thanos-querier.openshift-monitoring.svc:9091. This gives you Prometheus without deploying anything extra.
Step 2: Define the metric schema
Instead of SQL tables, we define Prometheus metrics with labels. This is the interface contract—any component that writes these metrics (GCS scraper, push gateway, future sources) is compatible.
Metrics:
| Metric name | Type | Labels | Description |
|---|---|---|---|
e2e_test_result | Gauge (0/1) | test, suite, job, build_id, commit_sha, branch | 1 = passed, 0 = failed |
e2e_test_duration_seconds | Gauge | test, suite, job, build_id, branch | Test execution duration |
e2e_test_error_info | Gauge (1) | test, suite, error_category, error_message | Error classification (info metric) |
Label schema:
e2e_test_result{
test="TestOperator/components/group_1/dashboard/validate_config",
suite="e2e-operator",
job="periodic-ci-operator-main-e2e",
build_id="1234567890",
commit_sha="abc123f",
branch="main"
} 0 # 0 = failed, 1 = passedEach test execution produces one e2e_test_result sample per test case. The timestamp is the run time. This gives us a time-series of pass/fail per test that PromQL can aggregate over any window.
Step 3: Build the JUnit ingester
Create a Go binary that parses JUnit XML, converts results to Prometheus metrics, and pushes them via remote-write to Prometheus.
go
package main
import (
"bytes"
"encoding/xml"
"fmt"
"net/http"
"os"
"time"
"github.com/golang/snappy"
"github.com/prometheus/prometheus/prompb"
)// Note: prompb types use gogoproto and have their own Marshal() method.
// Do NOT use google.golang.org/protobuf/proto it requires ProtoReflect()
// which gogoproto types don't implement.type JUnitTestSuite struct {
XMLName xml.Name `xml:"testsuite"`
Name string `xml:"name,attr"`
Timestamp string `xml:"timestamp,attr"`
TestCases []JUnitTestCase `xml:"testcase"`
Properties []Property `xml:"properties>property"`
}
type JUnitTestCase struct {
Name string `xml:"name,attr"`
Time float64 `xml:"time,attr"`
Failure *JUnitFailure `xml:"failure"`
Error *JUnitFailure `xml:"error"`
}
type JUnitFailure struct {
Message string `xml:"message,attr"`
Body string `xml:",chardata"`
}
type Property struct {
Name string `xml:"name,attr"`
Value string `xml:"value,attr"`
}
func junitToTimeSeries(suite JUnitTestSuite, prowJob, buildID string) []prompb.TimeSeries {
commitSHA := extractProperty(suite.Properties, "commit.sha")
branch := extractProperty(suite.Properties, "branch")
if branch == "" {
branch = "main"
}
runTS := parseTimestamp(suite.Timestamp)
tsMs := runTS.UnixMilli()
var series []prompb.TimeSeries
for _, tc := range suite.TestCases {
passed := tc.Failure == nil && tc.Error == nil
var resultValue float64
if passed {
resultValue = 1
}
// e2e_test_result metric
series = append(series, prompb.TimeSeries{
Labels: []prompb.Label{
{Name: "__name__", Value: "e2e_test_result"},
{Name: "test", Value: tc.Name},
{Name: "suite", Value: suite.Name},
{Name: "job", Value: prowJob},
{Name: "build_id", Value: buildID},
{Name: "commit_sha", Value: commitSHA},
{Name: "branch", Value: branch},
},
Samples: []prompb.Sample{
{Value: resultValue, Timestamp: tsMs},
},
})
// e2e_test_duration_seconds metric
series = append(series, prompb.TimeSeries{
Labels: []prompb.Label{
{Name: "__name__", Value: "e2e_test_duration_seconds"},
{Name: "test", Value: tc.Name},
{Name: "suite", Value: suite.Name},
{Name: "job", Value: prowJob},
{Name: "build_id", Value: buildID},
{Name: "branch", Value: branch},
},
Samples: []prompb.Sample{
{Value: tc.Time, Timestamp: tsMs},
},
})
}
return series
}
func remoteWrite(endpoint string, series []prompb.TimeSeries) error {
req := &prompb.WriteRequest{Timeseries: series}
data, err := req.Marshal()
if err != nil {
return fmt.Errorf("marshaling write request: %w", err)
}
compressed := snappy.Encode(nil, data)
httpReq, err := http.NewRequest(http.MethodPost, endpoint, bytes.NewReader(compressed))
if err != nil {
return fmt.Errorf("creating request: %w", err)
}
httpReq.Header.Set("Content-Type", "application/x-protobuf")
httpReq.Header.Set("Content-Encoding", "snappy")
httpReq.Header.Set("X-Prometheus-Remote-Write-Version", "0.1.0")
resp, err := http.DefaultClient.Do(httpReq)
if err != nil {
return fmt.Errorf("sending remote write: %w", err)
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusNoContent && resp.StatusCode != http.StatusOK {
return fmt.Errorf("remote write returned %d", resp.StatusCode)
}
return nil
}Deploy as a CronJob that fetches JUnit artifacts from your Google Cloud Storage (GCS) bucket:
yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: junit-ingester
namespace: e2e-analytics
spec:
schedule: "0 */4 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: ingester
image: quay.io/your-org/junit-ingester:latest
env:
- name: REMOTE_WRITE_ENDPOINT
value: "http://prometheus-server.e2e-analytics.svc:80/api/v1/write"
- name: GCS_BUCKET
value: "test-platform-results"
- name: PROW_JOB
value: "periodic-ci-operator-main-e2e"
restartPolicy: OnFailureQuick validation
After the first ingestion, verify data is flowing.
Query for any test results:
curl -s 'http://localhost:9090/api/v1/query?query=e2e_test_result' | jq '.data.result | length'Check a specific test:
curl -s 'http://localhost:9090/api/v1/query?query=e2e_test_result{test=~".*dashboard.*"}' | jq .Step 4: Set up Grafana
Deploy Grafana and point it at Prometheus as the data source.
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm install grafana grafana/grafana \
--namespace e2e-analytics \
--set persistence.enabled=true \
--set persistence.size=5Gi \
--set adminPassword="$(openssl rand -base64 16)" \
--set "datasources.datasources\\.yaml.apiVersion=1" \
--set "datasources.datasources\\.yaml.datasources[0].name=E2E Metrics" \
--set "datasources.datasources\\.yaml.datasources[0].type=prometheus" \
--set "datasources.datasources\\.yaml.datasources[0].url=http://prometheus-server.e2e-analytics.svc:80" \
--set "datasources.datasources\\.yaml.datasources[0].access=proxy" \
--set "datasources.datasources\\.yaml.datasources[0].isDefault=true" \
--set securityContext.runAsNonRoot=true \
--set securityContext.runAsUser=null \
--set securityContext.fsGroup=null \
--set containerSecurityContext.allowPrivilegeEscalation=false \
--set "containerSecurityContext.capabilities.drop={ALL}" \
--set containerSecurityContext.runAsNonRoot=true \
--set containerSecurityContext.seccompProfile.type=RuntimeDefault \
--set initChownData.enabled=falseOpenShift SCC note
Same as Prometheus, clear default runAsUser/fsGroup and disable the init chown container (which tries to run as root). On non-OpenShift clusters these overrides are harmless.
Expose Grafana
On OpenShift:
oc -n e2e-analytics create route edge grafana --service=grafana --port=3000Or port-forward for local access:
kubectl -n e2e-analytics port-forward svc/grafana 3000:80 &Dashboard panels (PromQL)
Panel 1: Per-test flake rate (30-day rolling window).
# Since e2e_test_result is 1 (pass) or 0 (fail), sum_over_time counts passes
sort_desc(
1 - (
sum by (test) (sum_over_time(e2e_test_result{branch="main"}[30d]))
/
sum by (test) (count_over_time(e2e_test_result{branch="main"}[30d]))
)
)Select Table as the panel type and configure the following columns: Test Name, Flake Rate, and Total Runs.
PromQL note
An earlier version of this query used count_over_time(e2e_test_result{...} == 1 [30d]) to count passes. This is invalid PromQL because the == 1 comparison produces an instant vector, and count_over_time requires a range vector selector. Because e2e_test_result uses 1/0 encoding, sum_over_time directly gives the pass count, making the query both correct and simpler.
Panel 2: Flake rate time series (per test, daily resolution).
# Daily flake rate for a specific test (use $test variable)
1 - (
sum by (test) (sum_over_time(e2e_test_result{test="$test", branch="main"}[1d]))
/
sum by (test) (count_over_time(e2e_test_result{test="$test", branch="main"}[1d]))
)Display as a Time Series panel. Add a threshold line at 0.2 (20%) to show the quarantine boundary.
Panel 3: Test health heatmap.
# Pass rate per test per day (for heatmap)
sum by (test) (sum_over_time(e2e_test_result{branch="main"}[1d]))
/
sum by (test) (count_over_time(e2e_test_result{branch="main"}[1d]))Panel 4: Regression detection.
# Tests with 0% pass rate in the last 4 days (potential regression, not flake)
(
sum by (test) (sum_over_time(e2e_test_result{branch="main"}[4d]))
/
sum by (test) (count_over_time(e2e_test_result{branch="main"}[4d]))
) == 0Tests matching this pattern that previously had a low flake rate are regressions—the code broke, not the test.
Panel 5: Test duration trends.
# Average duration per test over time
avg by (test) (avg_over_time(e2e_test_duration_seconds{branch="main"}[1d]))Alert rules
Configure Grafana alerting to fire when a test crosses the quarantine threshold:
# Grafana alert rule (configured via UI or provisioning)
name: Test Flake Rate Exceeded
condition: flake_rate > 0.2
expr: |
(
1 - (
sum by (test) (count_over_time(e2e_test_result{branch="main"} == 1 [30d]))
/
sum by (test) (sum_over_time(e2e_test_result{branch="main"}[30d]))
)
) > 0.2
and
sum by (test) (count_over_time(e2e_test_result{branch="main"}[30d])) >= 10
for: 0m
labels:
severity: warning
annotations:
summary: "Test {{ $labels.test }} flake rate exceeded 20%"Step 5: Build the quarantine controller (Go)
The quarantine controller queries Prometheus via PromQL, identifies flaky tests, excludes regressions, and outputs a quarantine config.
package main
import (
"context"
"encoding/json"
"fmt"
"net/http"
"net/url"
"os"
"time"
)
const (
flakeThreshold = 0.20
minRunsForDecision = 10
windowDays = 30
quarantineDurationDays = 30
)
type PromQueryResult struct {
Status string `json:"status"`
Data struct {
ResultType string `json:"resultType"`
Result []struct {
Metric map[string]string `json:"metric"`
Value [2]interface{} `json:"value"`
} `json:"result"`
} `json:"data"`
}
type QuarantineEntry struct {
Name string `json:"name"`
Reason string `json:"reason"`
FlakeRate float64 `json:"flake_rate"`
TotalRuns int `json:"total_runs"`
FailedRuns int `json:"failed_runs"`
Jira string `json:"jira,omitempty"`
QuarantinedAt string `json:"quarantined_at"`
ReEnableAfter string `json:"re_enable_after"`
}
type QuarantineConfig struct {
Version int `json:"version"`
Updated string `json:"updated"`
Tests map[string]QuarantineEntry `json:"tests"`
}
func queryFlakeRates(ctx context.Context, promURL string) (map[string]float64, error) {
query := fmt.Sprintf(`
1 - (
sum by (test) (sum_over_time(e2e_test_result{branch="main"}[%dd]))
/
sum by (test) (count_over_time(e2e_test_result{branch="main"}[%dd]))
)
`, windowDays, windowDays)
resp, err := http.Get(fmt.Sprintf("%s/api/v1/query?query=%s", promURL, url.QueryEscape(query)))
if err != nil {
return nil, fmt.Errorf("querying flake rates: %w", err)
}
defer resp.Body.Close()
var result PromQueryResult
if err := json.NewDecoder(resp.Body).Decode(&result); err != nil {
return nil, fmt.Errorf("decoding response: %w", err)
}
rates := make(map[string]float64)
for _, r := range result.Data.Result {
testName := r.Metric["test"]
// Value is [timestamp, "value_string"]
if valStr, ok := r.Value[1].(string); ok {
var val float64
fmt.Sscanf(valStr, "%f", &val)
rates[testName] = val
}
}
return rates, nil
}
func queryRunCounts(ctx context.Context, promURL string) (map[string]int, error) {
query := fmt.Sprintf(`sum by (test) (count_over_time(e2e_test_result{branch="main"}[%dd]))`, windowDays)
resp, err := http.Get(fmt.Sprintf("%s/api/v1/query?query=%s", promURL, url.QueryEscape(query)))
if err != nil {
return nil, fmt.Errorf("querying run counts: %w", err)
}
defer resp.Body.Close()
var result PromQueryResult
if err := json.NewDecoder(resp.Body).Decode(&result); err != nil {
return nil, fmt.Errorf("decoding response: %w", err)
}
counts := make(map[string]int)
for _, r := range result.Data.Result {
testName := r.Metric["test"]
if valStr, ok := r.Value[1].(string); ok {
var val int
fmt.Sscanf(valStr, "%d", &val)
counts[testName] = val
}
}
return counts, nil
}
func isRegression(ctx context.Context, promURL, testName string) (bool, error) {
// A regression = 0% pass rate in recent window (all runs failed).
// Uses sum_over_time/count_over_time instead of last_over_time subquery,
// which is unreliable with high-cardinality build_id labels.
query := fmt.Sprintf(
`(sum(sum_over_time(e2e_test_result{test="%s", branch="main"}[4d])) /
sum(count_over_time(e2e_test_result{test="%s", branch="main"}[4d])))`,
testName, testName,
)
resp, err := http.Get(fmt.Sprintf("%s/api/v1/query?query=%s", promURL, url.QueryEscape(query)))
if err != nil {
return false, err
}
defer resp.Body.Close()
var result PromQueryResult
if err := json.NewDecoder(resp.Body).Decode(&result); err != nil {
return false, err
}
// If pass rate is 0, all recent runs failed its likely regression
for _, r := range result.Data.Result {
if valStr, ok := r.Value[1].(string); ok {
var val float64
fmt.Sscanf(valStr, "%f", &val)
if val == 0 {
return true, nil
}
}
}
return false, nil
}
func buildQuarantineConfig(ctx context.Context, promURL string) (*QuarantineConfig, error) {
flakeRates, err := queryFlakeRates(ctx, promURL)
if err != nil {
return nil, err
}
runCounts, err := queryRunCounts(ctx, promURL)
if err != nil {
return nil, err
}
now := time.Now().UTC()
cfg := &QuarantineConfig{
Version: 1,
Updated: now.Format(time.RFC3339),
Tests: make(map[string]QuarantineEntry),
}
for testName, rate := range flakeRates {
runs := runCounts[testName]
if rate < flakeThreshold || runs < minRunsForDecision {
continue
}
regression, err := isRegression(ctx, promURL, testName)
if err != nil {
return nil, fmt.Errorf("checking regression for %s: %w", testName, err)
}
if regression {
continue // Don't quarantine regressions
}
failedRuns := int(rate * float64(runs))
cfg.Tests[testName] = QuarantineEntry{
Name: testName,
Reason: fmt.Sprintf("Flake rate %.0f%% over %dd (%d/%d failed)", rate*100, windowDays, failedRuns, runs),
FlakeRate: rate,
TotalRuns: runs,
FailedRuns: failedRuns,
QuarantinedAt: now.Format(time.RFC3339),
ReEnableAfter: now.AddDate(0, 0, quarantineDurationDays).Format(time.RFC3339),
}
}
return cfg, nil
}
func main() {
promURL := os.Getenv("PROMETHEUS_URL")
if promURL == "" {
promURL = "http://prometheus-server.e2e-analytics.svc:80"
}
ctx := context.Background()
cfg, err := buildQuarantineConfig(ctx, promURL)
if err != nil {
fmt.Fprintf(os.Stderr, "error: %v\n", err)
os.Exit(1)
}
data, _ := json.MarshalIndent(cfg, "", " ")
fmt.Println(string(data))
}Deploy as a daily CronJob (yaml):
apiVersion: batch/v1
kind: CronJob
metadata:
name: quarantine-controller
namespace: e2e-analytics
spec:
schedule: "0 6 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: controller
image: quay.io/your-org/quarantine-controller:latest
env:
- name: PROMETHEUS_URL
value: "http://prometheus-server.e2e-analytics.svc:80"
- name: JIRA_TOKEN
valueFrom:
secretKeyRef:
name: jira-credentials
key: token
- name: JIRA_SERVER
value: "https://redhat.atlassian.net"
- name: JIRA_PROJECT
value: "PROJECT"
- name: GIT_REPO
value: "repository"
- name: GITHUB_TOKEN
valueFrom:
secretKeyRef:
name: github-credentials
key: token
restartPolicy: OnFailureThe controller:
- Queries Prometheus for flake rates via PromQL
- Excludes regressions (consecutive trailing failures)
- Outputs a quarantine JSON config
- Creates Jira tickets for newly quarantined tests
- Commits the config to Git (pull request or direct push)
Step 6: Wire the test runner
Your E2E test runner loads the quarantine config and skips active entries. The quarantine controller exports this JSON:
{
"version": 1,
"updated": "2026-06-09T06:00:00Z",
"tests": {
"TestOperator/components/group_1/dashboard/validate_config": {
"name": "TestOperator/components/group_1/dashboard/validate_config",
"reason": "Flake rate 35% over 30d (7/20 failed)",
"flake_rate": 0.35,
"total_runs": 20,
"failed_runs": 7,
"jira": "JIRA-60123",
"quarantined_at": "2026-05-15T06:00:00Z",
"re_enable_after": "2026-06-14T06:00:00Z"
}
}
}At test startup, load the config and build a skip regex (Go):
func buildSkipRegex(cfg *QuarantineConfig) string {
var patterns []string
for name := range cfg.Tests {
segments := strings.Split(name, "/")
escaped := make([]string, len(segments))
for i, seg := range segments {
escaped[i] = "^" + regexp.QuoteMeta(seg) + "$"
}
patterns = append(patterns, strings.Join(escaped, "/"))
}
return strings.Join(patterns, "|")
}Pass the result to go test -skip (bash):
SKIP_REGEX=$(quarantine-tool build-skip-regex --config tests/e2e/quarantine.json) go test ./tests/e2e/... \ -v -timeout 60m \ -skip "$SKIP_REGEX"
Step 7: Close the feedback loop
The system is self-healing by design:
- Quarantined tests expire. After
quarantine_duration_days, the entry is removed and the test runs again in CI. - If the test is still flaky, the next analysis cycle re-quarantines it (with a fresh Jira ticket reference).
- If someone fixes the test, it passes consistently and is never re-quarantined.
- Jira resolution check: The controller queries Jira for resolved tickets and proactively un-quarantines those tests early.
The controller's cleanup logic (runs every cycle):
func cleanupExpired(cfg *QuarantineConfig) {
now := time.Now().UTC()
for name, entry := range cfg.Tests {
expiry, _ := time.Parse(time.RFC3339, entry.ReEnableAfter)
if now.After(expiry) {
delete(cfg.Tests, name)
}
}
}Step 8: Add PR visibility
Add a CI check that posts a comment on every pull request (PR) showing the current quarantine status (yaml):
name: Quarantine Status
on:
pull_request:
branches: [main]
jobs:
quarantine-status:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Post quarantine table
run: |
CONFIG="tests/e2e/quarantine.json"
COUNT=$(jq '.tests | length' "$CONFIG")
if [ "$COUNT" -eq 0 ]; then exit 0; fi
echo "## Quarantined E2E Tests" > /tmp/comment.md
echo "**${COUNT}** tests are quarantined and will be skipped." >> /tmp/comment.md
echo "" >> /tmp/comment.md
echo "| Test | Jira | Flake Rate | Expires |" >> /tmp/comment.md
echo "|------|------|-----------|---------|" >> /tmp/comment.md
jq -r '.tests[] | "| \(.name) | \(.jira // "-") | \(.flake_rate * 100 | floor)% | \(.re_enable_after // "-") |"' \
"$CONFIG" >> /tmp/comment.md
gh pr comment "${{ github.event.number }}" --body-file /tmp/comment.mdStep 9: Connect to your CI pipeline
The system above is useful only when real test results flow into it. This section shows how to wire it to the three CI platforms most relevant to OpenShift projects.
Option A: OpenShift CI (Prow)
OpenShift CI stores all job artifacts in a public GCS bucket (test-platform-results). After every E2E run, Prow uploads $ARTIFACT_DIR contents to:
gs://test-platform-results/
logs/{periodic-job-name}/{build_id}/ # periodic jobs
pr-logs/pull/{org}_{repo}/{pr}/{job}/{build_id}/ # presubmit jobsYour test runner must produce JUnit XML inside $ARTIFACT_DIR. Most Go test harnesses support this via gotestsum --junitfile or a wrapper that converts go test -json output. If you use a Makefile, a common pattern is:
makefile
ifdef ARTIFACT_DIR
export JUNIT_OUTPUT_PATH = ${ARTIFACT_DIR}/junit_report.xml
endifThe ingester CronJob (Step 3) scrapes this bucket using the GCS JSON API (no auth needed for public buckets):
# List recent builds for a job
curl -s "https://storage.googleapis.com/storage/v1/b/test-platform-results/o?\
prefix=logs/periodic-ci-my-org-my-operator-main-e2e/&delimiter=/"
# Download JUnit from a specific build
curl -s "https://storage.googleapis.com/test-platform-results/\
logs/{job}/{build_id}/artifacts/{workflow}/e2e/artifacts/junit_report.xml"
# Get run metadata (timestamp, commit SHA)
curl -s "https://storage.googleapis.com/test-platform-results/\
logs/{job}/{build_id}/started.json"
# {"timestamp":1765889560, "repo-commit":"abc123f", ...}The ingester maps GCS metadata to Prometheus labels:
| Prometheus label | GCS source |
|---|---|
test | <testcase name="..."> in junit_report.xml |
suite | <testsuite name="..."> in junit_report.xml |
job | Path segment (the Prow job name) |
build_id | Path segment (numeric build ID) |
commit_sha | started.json repo-commit |
branch | main for periodics. PR number for presubmits |
Important: For accurate flake detection, scrape periodic jobs (which run on main without code changes), not presubmit jobs (which mix test flakes with actual regressions introduced by PRs).
Option B: Konflux and Tekton pipelines
Konflux uses Tekton pipelines. The integration approach is a post-task in your E2E pipeline that pushes results directly, no GCS scraping needed.
Add a step to your Tekton PipelineRun that runs after E2E tests:
# Tekton task that pushes JUnit results to Prometheus after E2E tests
apiVersion: tekton.dev/v1
kind: Task
metadata:
name: push-test-metrics
namespace: e2e-analytics
spec:
params:
- name: junit-path
description: Path to JUnit XML file
- name: job-name
description: Pipeline/job identifier
- name: build-id
description: PipelineRun UID or build number
- name: commit-sha
description: Git commit SHA
steps:
- name: push-metrics
image: quay.io/your-org/junit-ingester:latest
env:
- name: REMOTE_WRITE_ENDPOINT
value: "http://prometheus-server.e2e-analytics.svc:80/api/v1/write"
command:
- /junit-ingester
- --file=$(params.junit-path)
- --job=$(params.job-name)
- --build-id=$(params.build-id)
- --commit-sha=$(params.commit-sha)
- --branch=mainWire it into your E2E pipeline as a finally task (runs whether tests pass or fail):
apiVersion: tekton.dev/v1
kind: Pipeline
spec:
tasks:
- name: run-e2e
taskRef:
name: e2e-tests
# ... test config ...
finally:
- name: push-metrics
taskRef:
name: push-test-metrics
params:
- name: junit-path
value: "$(tasks.run-e2e.results.junit-path)"
- name: job-name
value: "konflux-my-operator-e2e"
- name: build-id
value: "$(context.pipelineRun.uid)"
- name: commit-sha
value: "$(params.git-revision)"The advantage over GCS scraping: results arrive in Prometheus within seconds of the test run completing, not on a four-hour CronJob schedule.
Option C: Local or ad-hoc runs
For testing the system or running one-off analyses, you can push results from a local make e2e-test run:
# Run E2E tests with JUnit output ARTIFACT_DIR=/tmp/e2e-results make e2e-test # Push results to Prometheus (via port-forward or in-cluster) kubectl -n e2e-analytics port-forward svc/prometheus-server 9090:80 & /path/to/junit-ingester \ --file /tmp/e2e-results/junit_report.xml \ --job "local-e2e" \ --build-id "$(date +%s)" \ --commit-sha "$(git rev-parse HEAD)" \ --branch "$(git rev-parse --abbrev-ref HEAD)" \ --remote-write-endpoint http://localhost:9090/api/v1/write
This is useful for validating the pipeline end-to-end before deploying the CronJob or Tekton task.
Why exclude regressions from quarantine?
A regression means the code broke. Quarantining the test hides the bug. The system detects regressions by looking for a step-function pattern: mostly passing before a specific commit, then consistently failing after. These are flagged in Grafana but never auto-quarantined.
Why automatic expiry?
Without expiry, quarantined tests become permanent exclusions. The re_enable_after field forces accountability: either fix the test within the window, or it returns to CI and gets re-evaluated. This prevents the quarantine list from growing unbounded.
Grafana dashboard layout
Organize your dashboard into four rows:
- Row 1: Overview
- Stat panel: total tests, quarantined count, overall suite pass rate
- Pie chart: healthy / flaky / regression breakdown
- PromQL:
count(count by (test) (e2e_test_result{branch="main"}))for total tests
- Row 2: Flake leaderboard
- Table: top 20 flakiest tests with rates, run counts
- Time series: flake rate trend for selected test (variable dropdown)
- PromQL: see Panel 1 and Panel 2 above
- Row 3: Regressions
- Table: tests where all recent runs failed (0% pass rate in last four days)
- PromQL:
(sum by (test) (sum_over_time(e2e_test_result{branch="main"}[4d])) / sum by (test) (count_over_time(e2e_test_result{branch="main"}[4d]))) == 0
- Row 4: Quarantine management
- Table: loaded from quarantine JSON (or a
e2e_quarantine_activemetric the controller pushes) - Stat panel: tests expiring in next seven days
- Log panel: quarantine/un-quarantine events timeline
- Table: loaded from quarantine JSON (or a
Operational runbook
Follow these standard procedures to triage skipped tests, investigate failure causes, and manage the lifecycle of your quarantined suite.
A test was quarantined. What do I do?
- Check the Jira ticket linked in the quarantine entry.
- Open the Grafana dashboard, select the test from the dropdown, look at the time-series panel.
- Identify the pattern: intermittent flake (random), or did it start at a specific commit?
- Fix the test, verify it passes in three or more consecutive runs, close the Jira ticket.
- The controller will un-quarantine it on the next cycle.
How do I swap the data store?
Because everything conforms to the Prometheus protocol, swapping is a config change:
- Ingester: Point
REMOTE_WRITE_ENDPOINTat the new endpoint (such as Thanos receiver or Mimir). - Grafana: Update the data source URL.
- Quarantine controller: Update
PROMETHEUS_URL.
All PromQL queries, dashboards, and alert rules work unchanged. That's the point of conforming to the standard.
Moving from reactive debugging to data-driven pipelines
Automating your test quarantine system moves your development team away from reactive troubleshooting and toward a data-driven pipeline. Backing your test infrastructure with Prometheus metrics provides clear historical trends to help differentiate between intermittent flakiness and true code regressions before a broken pull request blocks your main branch. This self-healing loop isolates broken tests early, forcing accountability through explicit expiry dates while directly reducing manual developer toil and increasing team velocity.