Troubleshoot app performance with OpenShift metrics

This series shows how to use Red Hat OpenShift metrics in a real-life performance testing scenario. I used these metrics to run performance tests on the Service Binding Operator. We used the results to performance-tune the Service Binding Operator for acceptance into the Developer Sandbox for Red Hat OpenShift.

In Part 3, I showed you how we set up OpenShift's monitoring stack to collect runtime metrics for our testing scenarios. I also shared a collector script that ensures the results are preserved on a node that won't crash. Now, we can look at the performance metrics we'll use and how to gather the data we need.

Read the whole series:

Part 1: Performance requirements
Part 2: The test environment
Part 3: Collecting runtime metrics
Part 4: Gathering performance metrics
Part 5: Test rounds and results

CPU and memory usage

As mentioned in Part 3, some metrics have to be collected for the duration of the test. The most important metrics required by the Developer Sandbox team were the CPU and memory usage of both OpenShift nodes and the tested operators. We also wanted to see the number of resources left in the cluster after the test was done.

Unfortunately, we ran into problems during our first attempts to run the stress tests. The OpenShift cluster actually crashed when one of its worker nodes came down. Naturally, it was important to know what caused the failure. I needed a more granular view of the CPU and memory usage, so I had to collect data from the workloads deployed on those nodes.

From watching and inspection using the cluster's own Grafana instance, I identified a couple of resources that were loaded and stressed by our scenario. Then, I included them to be watched by the collector script I shared in the previous article, together with the cluster's nodes.

I identified workloads from a handful of namespaces as stressed, and included them to be watched for CPU and memory usage. Table 1 shows these namespaces and workloads.

Table 1: Workloads to watch for CPU and memory usage.
Namespace	Workloads
`openshift-apiserver`	`apiserver-*` pods
`openshift-kube-apiserver`	`kube-apiserver-*` pods
`openshift-monitoring`	`prometheus-k8s-*` pods
`openshift-operators`	`service-binding-operator-` pods `rhoas-operator-` pods
`openshift-operator-lifecycle-manager`	`catalog-operator-` pods `olm-operator-` pods

Resources created in the cluster

Another one of the metrics requested by the Developer Sandbox team was the number of resources created in the cluster during the test. There were two ways to get that information:

If the Prometheus instance was available (meaning the node on which it is deployed had not crashed), I could use a simple Prometheus query:
```
sort_desc(cluster:usage:resources:sum)
```
Copy snippet
If the Prometheus instance was not available, I had to employ a brute-force approach, using the oc tool to count the number of each resource.

Service Binding Operator performance and developer experience

The final set of metrics was related to the performance of the Service Binding Operator itself, from a developer's perspective. Specifically, we wanted to know how long it took to perform the binding after the ServiceBinding resource was created.

The typical situation for a developer using the Service Binding Operator is to have a backing service and an application running that the user wants to bind together. So, the developer sends a ServiceBinding request and expects the binding to be done by Service Binding Operator. The scenario can be split into the following sequence of steps. Each step is shown along with the way to retrieve the respective timestamp:

The ServiceBinding request is sent, processed by OpenShift, and created internally as a resource (.metadata.creationTimestamp of the ServiceBinding resource).
The Service Binding Operator picks up the resource while watching for it and processes it. (This is the first "Reconciling ServiceBinding" message for the particular ServiceBinding resource in the Service Binding Operator logs.)
Based on the content of the resource, the Service Binding Operator performs the binding. It collects bindable information from the backing service and injects it into the application.
The application is re-deployed with the bound information injected into the Deployment resource (.status.conditions[] | select(.type=="Available") | select(.status=="True").lastTransitionTime).
The ServiceBinding resource is marked as done. (A "Done" message is sent for the particular ServiceBinding resource in the Service Binding Operator logs.)

I defined the following metrics to evaluate the developer experience in this scenario:

Time to Ready: The time between the creation of the ServiceBinding resource (1) and completion of the binding (5). That can be further split into the following:
- Time to Reconcile: The time between the creation of the ServiceBinding resource (1) and when it is picked up by Service Binding Operator (2).
- Reconcile to Ready: The time between when the Service Binding Operator picks up the ServiceBinding (2) and completes the binding (5).

At the time of the performance evaluation, these metrics were not collected by OpenShift's monitoring stack or by the Service Binding Operator. So, I had to dig up the information from the data that I could get. As shown earlier, I derived the information I needed from metadata that OpenShift collects about the active user's backing service and the application (especially timestamps), along with the Service Binding Operator logs. I wrote the following script to collect the necessary data from OpenShift and compute the information after the test is complete:

if [ -z "$QUAY_NAMESPACE" ]; then
    echo "QUAY_NAMESPACE environemnt variable needs to be set to a non-empty value"
    exit 1
fi

DT=$(date "+%F_%T")
RESULTS=results-$DT
mkdir -p $RESULTS

USER_NS_PREFIX=${1:-zippy}

# Resource counts
resource_counts(){
    echo -n "$1;"
    # All resource counts from user namespaces
    echo -n "$(oc get $1 --all-namespaces -o custom-columns=NAMESPACE:.metadata.namespace | grep $USER_NS_PREFIX | wc -l)"
    echo -n ";"
    # All resource counts from all namespaces
    echo "$(oc get $1 --all-namespaces -o name | wc -l)"
}

# Dig various timestamps out
timestamps(){
    SBR_JSON=$1
    DEPLOYMENTS_JSON=$2
    SBO_LOG=$3
    RESULTS=$4

    jq -rc '((.metadata.namespace) + ";" + (.metadata.name) + ";" + (.metadata.creationTimestamp) + ";" + (.status.conditions[] | select(.type=="Ready").lastTransitionTime))' $SBR_JSON > $RESULTS/tmp.csv
    echo "ServiceBinding;Created;ReconciledTimestamp;Ready;AllDoneTimestamp" > $RESULTS/sbr-timestamps.csv
    for i in $(cat $RESULTS/tmp.csv); do
        ns=$(echo -n $i | cut -d ";" -f1)
        name=$(echo -n $i | cut -d ";" -f2)
        echo -n $ns/$name;
        echo -n ";";
        echo -n $(date -d $(echo -n $i | cut -d ";" -f3) "+%F %T");
        echo -n ";";
        log=$(cat $SBO_LOG | grep $ns)
        date -d @$(echo $log | jq -rc 'select(.msg | contains("Reconciling")).ts' | head -n1) "+%F %T.%N" | tr -d "\n"
        echo -n ";";
        echo -n $(date -d $(echo -n $i | cut -d ";" -f4) "+%F %T");
        echo -n ";";
        done_ts=$(echo $log | jq -rc 'select(.msg | contains("Done")) | select(.retry==false).ts')
        if [ -n "$done_ts" ]; then
            date -d "@$done_ts" "+%F %T.%N"
        else
            echo ""
        fi
    done >> $RESULTS/sbr-timestamps.csv
    rm -f $RESULTS/tmp.csv

    jq -rc '((.metadata.namespace) + ";" + (.metadata.name) + ";" + (.metadata.creationTimestamp) + ";" + (.status.conditions[] | select(.type=="Available") | select(.status=="True").lastTransitionTime)) + ";" + (.metadata.managedFields[] | select(.manager=="manager").time)' $DEPLOYMENTS_JSON > $RESULTS/tmp.csv
    echo "Namespace;Deployment;Deployment_Created;Deployment_Available;Deployment_Updated_by_SBO;SB_Name;SB_created;SB_ReconciledTimestamp;SB_Ready;SB_AllDoneTimestamp" > $RESULTS/binding-timestamps.csv
    for i in $(cat $RESULTS/tmp.csv); do
        NS=$(echo -n $i | cut -d ";" -f1);
        echo -n $NS;
        echo -n ";";
        echo -n $(echo -n $i | cut -d ";" -f2);
        echo -n ";";
        echo -n $(date -d $(echo -n $i | cut -d ";" -f3) "+%F %T");
        echo -n ";";
        echo -n $(date -d $(echo -n $i | cut -d ";" -f4) "+%F %T");
        echo -n ";";
        echo -n $(date -d $(echo -n $i | cut -d ";" -f5) "+%F %T");
        echo -n ";";
        cat $RESULTS/sbr-timestamps.csv | grep $NS
    done >> $RESULTS/binding-timestamps.csv
    rm -f $RESULTS/tmp.csv
}

# Collect timestamps
{
# ServiceBinding resources in user namespaces
oc get sbr --all-namespaces -o json | jq -r '.items[] | select(.metadata.namespace | contains("'$USER_NS_PREFIX'"))' > $RESULTS/service-bindings.json

# Deployment resources in user namespaces
oc get deploy --all-namespaces -o json | jq -r '.items[] | select(.metadata.namespace | contains("'$USER_NS_PREFIX'"))' > $RESULTS/deployments.json

# ServiceBiding operator log
oc logs $(oc get $(oc get pods -n openshift-operators -o name | grep service-binding-operator) -n openshift-operators -o jsonpath='{.metadata.name}') -n openshift-operators > $RESULTS/service-binding-operator.log

timestamps $RESULTS/service-bindings.json $RESULTS/deployments.json $RESULTS/service-binding-operator.log $RESULTS
} &

# Collect resource counts
{
RESOURCE_COUNTS_OUT=$RESULTS/resource-count.csv
echo "Resource;UserNamespaces;AllNamespaces" > $RESOURCE_COUNTS_OUT
for i in $(cat resources.list); do
    resource_counts $i >> $RESOURCE_COUNTS_OUT;
done
} &

wait

Copy snippet

Next steps

This article introduced the metrics we collected to performance-test the Service Binding Operator for acceptance into the Developer Sandbox for Red Hat OpenShift. I showed how we collected both metrics of interest to the Developer Sandbox team, and additional metrics for our team specifically. Look for Part 5, the final article in this series, where I will present the testing rounds and their results.

Last updated: January 15, 2025

Troubleshooting application performance with Red Hat OpenShift metrics, Part 4: Gathering performance metrics

Share:

CPU and memory usage

Resources created in the cluster

Service Binding Operator performance and developer experience

Next steps

Products

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue