This series shows how to use Red Hat OpenShift metrics in a real-life performance testing scenario. I used these metrics to run performance tests on the Service Binding Operator. We used the results to performance-tune the Service Binding Operator for acceptance into the Developer Sandbox for Red Hat OpenShift.
In Part 3, I showed you how we set up OpenShift's monitoring stack to collect runtime metrics for our testing scenarios. I also shared a collector script that ensures the results are preserved on a node that won't crash. Now, we can look at the performance metrics we'll use and how to gather the data we need.
Read the whole series:
- Part 1: Performance requirements
- Part 2: The test environment
- Part 3: Collecting runtime metrics
- Part 4: Gathering performance metrics
- Part 5: Test rounds and results (August 5)
CPU and memory usage
As mentioned in Part 3, some metrics have to be collected for the duration of the test. The most important metrics required by the Developer Sandbox team were the CPU and memory usage of both OpenShift nodes and the tested operators. We also wanted to see the number of resources left in the cluster after the test was done.
Unfortunately, we ran into problems during our first attempts to run the stress tests. The OpenShift cluster actually crashed when one of its worker nodes came down. Naturally, it was important to know what caused the failure. I needed a more granular view of the CPU and memory usage, so I had to collect data from the workloads deployed on those nodes.
From watching and inspection using the cluster's own Grafana instance, I identified a couple of resources that were loaded and stressed by our scenario. Then, I included them to be watched by the collector script I shared in the previous article, together with the cluster's nodes.
I identified workloads from a handful of namespaces as stressed, and included them to be watched for CPU and memory usage. Table 1 shows these namespaces and workloads.
Namespace | Workloads |
---|---|
openshift-apiserver |
apiserver-* pods |
openshift-kube-apiserver |
kube-apiserver-* pods |
openshift-monitoring |
prometheus-k8s-* pods |
openshift-operators |
|
openshift-operator-lifecycle-manager |
|
Resources created in the cluster
Another one of the metrics requested by the Developer Sandbox team was the number of resources created in the cluster during the test. There were two ways to get that information:
- If the Prometheus instance was available (meaning the node on which it is deployed had not crashed), I could use a simple Prometheus query:
sort_desc(cluster:usage:resources:sum)
- If the Prometheus instance was not available, I had to employ a brute-force approach, using the
oc
tool to count the number of each resource.
Service Binding Operator performance and developer experience
The final set of metrics was related to the performance of the Service Binding Operator itself, from a developer's perspective. Specifically, we wanted to know how long it took to perform the binding after the ServiceBinding
resource was created.
The typical situation for a developer using the Service Binding Operator is to have a backing service and an application running that the user wants to bind together. So, the developer sends a ServiceBinding
request and expects the binding to be done by Service Binding Operator. The scenario can be split into the following sequence of steps. Each step is shown along with the way to retrieve the respective timestamp:
- The
ServiceBinding
request is sent, processed by OpenShift, and created internally as a resource (.metadata.creationTimestamp
of theServiceBinding
resource). - The Service Binding Operator picks up the resource while watching for it and processes it. (This is the first "Reconciling ServiceBinding" message for the particular
ServiceBinding
resource in the Service Binding Operator logs.) - Based on the content of the resource, the Service Binding Operator performs the binding. It collects bindable information from the backing service and injects it into the application.
- The application is re-deployed with the bound information injected into the
Deployment
resource (.status.conditions[] | select(.type=="Available") | select(.status=="True").lastTransitionTime
). - The
ServiceBinding
resource is marked as done. (A "Done" message is sent for the particularServiceBinding
resource in the Service Binding Operator logs.)
I defined the following metrics to evaluate the developer experience in this scenario:
- Time to Ready: The time between the creation of the
ServiceBinding
resource (1) and completion of the binding (5). That can be further split into the following:- Time to Reconcile: The time between the creation of the
ServiceBinding
resource (1) and when it is picked up by Service Binding Operator (2). - Reconcile to Ready: The time between when the Service Binding Operator picks up the
ServiceBinding
(2) and completes the binding (5).
- Time to Reconcile: The time between the creation of the
At the time of the performance evaluation, these metrics were not collected by OpenShift's monitoring stack or by the Service Binding Operator. So, I had to dig up the information from the data that I could get. As shown earlier, I derived the information I needed from metadata that OpenShift collects about the active user's backing service and the application (especially timestamps), along with the Service Binding Operator logs. I wrote the following script to collect the necessary data from OpenShift and compute the information after the test is complete:
if [ -z "$QUAY_NAMESPACE" ]; then
echo "QUAY_NAMESPACE environemnt variable needs to be set to a non-empty value"
exit 1
fi
DT=$(date "+%F_%T")
RESULTS=results-$DT
mkdir -p $RESULTS
USER_NS_PREFIX=${1:-zippy}
# Resource counts
resource_counts(){
echo -n "$1;"
# All resource counts from user namespaces
echo -n "$(oc get $1 --all-namespaces -o custom-columns=NAMESPACE:.metadata.namespace | grep $USER_NS_PREFIX | wc -l)"
echo -n ";"
# All resource counts from all namespaces
echo "$(oc get $1 --all-namespaces -o name | wc -l)"
}
# Dig various timestamps out
timestamps(){
SBR_JSON=$1
DEPLOYMENTS_JSON=$2
SBO_LOG=$3
RESULTS=$4
jq -rc '((.metadata.namespace) + ";" + (.metadata.name) + ";" + (.metadata.creationTimestamp) + ";" + (.status.conditions[] | select(.type=="Ready").lastTransitionTime))' $SBR_JSON > $RESULTS/tmp.csv
echo "ServiceBinding;Created;ReconciledTimestamp;Ready;AllDoneTimestamp" > $RESULTS/sbr-timestamps.csv
for i in $(cat $RESULTS/tmp.csv); do
ns=$(echo -n $i | cut -d ";" -f1)
name=$(echo -n $i | cut -d ";" -f2)
echo -n $ns/$name;
echo -n ";";
echo -n $(date -d $(echo -n $i | cut -d ";" -f3) "+%F %T");
echo -n ";";
log=$(cat $SBO_LOG | grep $ns)
date -d @$(echo $log | jq -rc 'select(.msg | contains("Reconciling")).ts' | head -n1) "+%F %T.%N" | tr -d "\n"
echo -n ";";
echo -n $(date -d $(echo -n $i | cut -d ";" -f4) "+%F %T");
echo -n ";";
done_ts=$(echo $log | jq -rc 'select(.msg | contains("Done")) | select(.retry==false).ts')
if [ -n "$done_ts" ]; then
date -d "@$done_ts" "+%F %T.%N"
else
echo ""
fi
done >> $RESULTS/sbr-timestamps.csv
rm -f $RESULTS/tmp.csv
jq -rc '((.metadata.namespace) + ";" + (.metadata.name) + ";" + (.metadata.creationTimestamp) + ";" + (.status.conditions[] | select(.type=="Available") | select(.status=="True").lastTransitionTime)) + ";" + (.metadata.managedFields[] | select(.manager=="manager").time)' $DEPLOYMENTS_JSON > $RESULTS/tmp.csv
echo "Namespace;Deployment;Deployment_Created;Deployment_Available;Deployment_Updated_by_SBO;SB_Name;SB_created;SB_ReconciledTimestamp;SB_Ready;SB_AllDoneTimestamp" > $RESULTS/binding-timestamps.csv
for i in $(cat $RESULTS/tmp.csv); do
NS=$(echo -n $i | cut -d ";" -f1);
echo -n $NS;
echo -n ";";
echo -n $(echo -n $i | cut -d ";" -f2);
echo -n ";";
echo -n $(date -d $(echo -n $i | cut -d ";" -f3) "+%F %T");
echo -n ";";
echo -n $(date -d $(echo -n $i | cut -d ";" -f4) "+%F %T");
echo -n ";";
echo -n $(date -d $(echo -n $i | cut -d ";" -f5) "+%F %T");
echo -n ";";
cat $RESULTS/sbr-timestamps.csv | grep $NS
done >> $RESULTS/binding-timestamps.csv
rm -f $RESULTS/tmp.csv
}
# Collect timestamps
{
# ServiceBinding resources in user namespaces
oc get sbr --all-namespaces -o json | jq -r '.items[] | select(.metadata.namespace | contains("'$USER_NS_PREFIX'"))' > $RESULTS/service-bindings.json
# Deployment resources in user namespaces
oc get deploy --all-namespaces -o json | jq -r '.items[] | select(.metadata.namespace | contains("'$USER_NS_PREFIX'"))' > $RESULTS/deployments.json
# ServiceBiding operator log
oc logs $(oc get $(oc get pods -n openshift-operators -o name | grep service-binding-operator) -n openshift-operators -o jsonpath='{.metadata.name}') -n openshift-operators > $RESULTS/service-binding-operator.log
timestamps $RESULTS/service-bindings.json $RESULTS/deployments.json $RESULTS/service-binding-operator.log $RESULTS
} &
# Collect resource counts
{
RESOURCE_COUNTS_OUT=$RESULTS/resource-count.csv
echo "Resource;UserNamespaces;AllNamespaces" > $RESOURCE_COUNTS_OUT
for i in $(cat resources.list); do
resource_counts $i >> $RESOURCE_COUNTS_OUT;
done
} &
wait
Next steps
This article introduced the metrics we collected to performance-test the Service Binding Operator for acceptance into the Developer Sandbox for Red Hat OpenShift. I showed how we collected both metrics of interest to the Developer Sandbox team, and additional metrics for our team specifically. Look for Part 5, the final article in this series, where I will present the testing rounds and their results.
Last updated: September 19, 2023