Monitoring RHGS

OK so you watched:

https://www.redhat.com/en/about/videos/architecting-and-performance-tuning-efficient-gluster-storage-pools

You put in the time and architected an efficient and performant GlusterFS deployment. Your users are reading and writing, applications are humming along, and Gluster is keeping your data safe.

Now what?

Well, congratulations you just completed the sprint! Now its time for the marathon.

The often forgotten component of performance tuning is monitoring, you put in all that work up front to get your cluster performing and your users happy, now how do you ensure that this continues and possibly improves? This is all done through continued monitoring and profiling of your storage cluster, clients, and a deeper look at your workload. In this blog we will look at the different metrics you can monitor in your storage cluster, identify which of these are important to monitor, how often to monitor them, and different ways to accomplish this.

I will break down the metrics we will be looking at into a few different categories:

System resources
Cluster resources
Client resources

System resources are the typical things you would monitor on any Linux system and include:

CPU
RAM
Network
Disk
Processes

System resources should be monitored on all nodes in the cluster and should be looked at on an individual basis.

Next, we have cluster resources. Cluster resources include:

Gluster Peers
Gluster Volumes
Geo-Replication

Cluster resources should only be monitored from one node, the current GlusterD implementations have some limitations on running commands from multiple nodes at the same time and monitoring cluster-wide things from multiple nodes at the same time is redundant. Running a command from multiple nodes at the same time can lead to incidental things like temporary warnings that another node has a lock and the command cannot be completed to more serious lost locks where all Gluster are blocked until the GlusterD service is restarted on the node holding the lock. I want to reiterate that gluster volume/etc command MUST be run from only one node.

The last piece is keeping an eye on your clients. This may or may not be something you want to monitor along with your storage cluster, there are pluses and minuses to doing this. It can be useful to see what the Gluster processes are doing client side as gluster can consume CPU, memory, and network. On clients I look at:

CPU
RAM
Network
Gluster processes
Available space
IOPs / throughput on Gluster mounts
Number of clients in use

In this blog, I will give suggestions on what commands to run to see resource usage, how often to run said commands, and on which systems they should be run on. I will attempt to give a few examples of how to accomplish this, but as everyone's environment is different and various tools can be used, I will try not to be too tool specific.

Now that we've gone through the overview let's get into the weeds a bit. Lets start with monitoring the resources of your Gluster storage cluster nodes. To expand a bit further on the server resources I listed above, I want to get into the data points we will look at for each resource group as well as a possible way to check the usage of this resource:

CPU
- CPU usage (percent utilization for each CPU)
  - top -> 1
- Context switches
  - sar -w
- System load
  - sar -q
- Where CPU resources are being used (system/user/await/steal)
  - sar -C
RAM
- RAM usage
  - free -m -> available column, be sure to account for free-able memory
- SWAP usage
  - free -m
- RAM used by Gluster processes
  - ps aux | grep gluster -> be sure to look at RSS (resident/actual usage) not VSZ (virtual/shared memory usage)
Network
- Send statistics
  - ifstat
- Receive statistics
  - ifstat
- Dropped packets
  - ifconfig <device>
- Retransmits
  - ifconfig <device>
Disk
- LVM thinpool usage
  - lvdisplay -> Allocated pool data
- LVM metadata usage
  - lvdisplay -> Allocated metadata
- AWAIT
  - iostat -c -m -x
- %Utilized (IOPs)
  - iostat -c -m -x
- Used space of bricks
  - df -h
Processes
- Look for hot threads
  - top -H -> look for threads pegging a CPU at 100% for an extended period (seconds?)

The system level commands I have used as examples are just normal every day Linux commands that most admins should know. It's up to you how often you want to monitor these data points, I would look at them in a matter of minutes if this is something you need to keep a tight eye on. If you are not resource constrained, I would move out to 10s of minutes or even more. Even though these commands are lightweight and well-tested/used, they still take some system resources, and one of the keys of successful monitoring is to ensure that your monitoring doesn't affect your cluster's performance.

Next, we will look at cluster level commands, these are mostly gluster commands and I would like to again reiterate that they should only be executed from one node! Cluster-wide commands tend to be a bit more invasive on the system, so we should keep the frequency these run considerably less than system level commands. The cluster-level commands I like to keep an eye on are:

Gluster Peers
- gluster peer status
Gluster Volumes
- gluster volume status
- gluster volume rebalance <VOLNAME> status
Geo Replication
- gluster volume geo-replication status -> detail add some nice info
Bitrot
- gluster volume bitrot <VOLNAME> scrub status
Snapshots
- gluster snapshot status
Self Heal
- gluster volume heal <VOLNAME> info
- gluster volume heal <VOLNAME> info split-brain
Quota
- gluster volume quota <VOLNAME> list

Again, these commands should be run less often and only from one node at a time. Commands that are less invasive which include peer status, volume status, and quota list can be run more often, maybe every 30-120 minutes. The more invasive commands I would run less often, maybe 4-6 times per day. You can choose to run these more often just remember you don't want your monitoring commands adding additional load to your cluster.

The last group is client-side commands. This can be a bit of a grey area if you are using Gluster as a backing store for an application you may want to monitor your application and storage cluster separately. I am just listing out the things I would look at, you can implement these however you choose. Client-side commands are a bit of a mix of cluster and system level, and consideration on how often they should be run should be made depending which group they fall under:

CPU
- CPU usage(percent utilization for each CPU)
  - top -> 1
- Context switches
  - sar -w
- System load
  - sar -q
- Where CPU resources are being used(system / user / await / steal)
  - sar -C
RAM
- RAM usage
  - free -m -> available column, be sure to account for free-able memory
- SWAP usage
  - free -m
- RAM used by Gluster processes
  - ps aux | grep gluster -> be sure to look at RSS(resident / actual usage) not VSZ(virtual / shared memory usage)
Network
- Send statistics
  - ifstat
- Receive statistics
  - ifstat
- Dropped packets
  - ifconfig <device>
- Retransmits
  - ifconfig <device>
Gluster processes
- ps aux -> look at RSS / memory usage
- top -H -> again, look for hot threads
Available space
- df -h
IOPs / throughput on Gluster mounts
- This can be a tricky one, you could set up a script that run a small read / write test to measure what kind of throughput you are getting on the volume
Number of clients in use
- gluster v status <VOLNAME> clients

Monitoring clients is less important in the general NAS use cases, but if you are hyperconverged and/or use Gluster to back a mission-critical application then client-side monitoring becomes more important. For the system level resources, I would adhere to the same guidelines as I have detailed above. For the cluster level commands I would look at these a few times a day, if you want to run actual read/write tests I would think once a day would suffice.

I hope this blog provides enough detail about what and how often to monitor your cluster/system/client resources. I'll leave you with a couple of key ideas I try to adhere to when monitoring my clusters:

Don't let your monitoring in any way interfere with your cluster's performance.
Run cluster-wide monitoring commands less often than system commands as they are much more expensive.
Run you monitoring commands as few times as possible while still effectively keeping an eye on resources.
Only run cluster (gluster commands here) from one node at a time!

Thanks for reading!

-b

Whether you are new to Containers or have experience, downloading this cheat sheet can assist you when encountering tasks you haven’t done lately.

Last updated: November 15, 2017

Linux

Java runtimes & frameworks

Kubernetes

Integration & App Connectivity

Automation

Developer tools

Developer Sandbox for Red Hat OpenShift

Programming Languages & Frameworks

System Design & Architecture

Developer Productivity

Secure Development & Architectures

Platform Engineering

Automated Data Processing

Start exploring in the Developer Sandbox for free

Interactive Lessons and Learning Paths

Developer Sandbox Activities

E-Books

Tutorials

Cheat Sheets

API Catalog

Red Hat Learning

Tech Talks

Deep Dives

Red Hat Summit 2024

C# 12: Collection expressions and primary constructors

Red Hat Trusted Software Supply Chain is now available

Synchronize instance tags from Amazon EC2 and Microsoft Azure with Red Hat Insights

Containerize Node.js applications at the edge on RHEL and Fedora

How to monitor OpenShift using the Datadog Operator

Products

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue