Skip to main content
Redhat Developers  Logo
  • Products

    Featured

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat OpenShift AI
      Red Hat OpenShift AI
    • Red Hat Enterprise Linux AI
      Linux icon inside of a brain
    • Image mode for Red Hat Enterprise Linux
      RHEL image mode
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • Red Hat Developer Hub
      Developer Hub
    • View All Red Hat Products
    • Linux

      • Red Hat Enterprise Linux
      • Image mode for Red Hat Enterprise Linux
      • Red Hat Universal Base Images (UBI)
    • Java runtimes & frameworks

      • JBoss Enterprise Application Platform
      • Red Hat build of OpenJDK
    • Kubernetes

      • Red Hat OpenShift
      • Microsoft Azure Red Hat OpenShift
      • Red Hat OpenShift Virtualization
      • Red Hat OpenShift Lightspeed
    • Integration & App Connectivity

      • Red Hat Build of Apache Camel
      • Red Hat Service Interconnect
      • Red Hat Connectivity Link
    • AI/ML

      • Red Hat OpenShift AI
      • Red Hat Enterprise Linux AI
    • Automation

      • Red Hat Ansible Automation Platform
      • Red Hat Ansible Lightspeed
    • Developer tools

      • Red Hat Trusted Software Supply Chain
      • Podman Desktop
      • Red Hat OpenShift Dev Spaces
    • Developer Sandbox

      Developer Sandbox
      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Secure Development & Architectures

      • Security
      • Secure coding
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
      • View All Technologies
    • Start exploring in the Developer Sandbox for free

      sandbox graphic
      Try Red Hat's products and technologies without setup or configuration.
    • Try at no cost
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • Java
      Java icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • API Catalog
    • Product Documentation
    • Legacy Documentation
    • Red Hat Learning

      Learning image
      Boost your technical skills to expert-level with the help of interactive lessons offered by various Red Hat Learning programs.
    • Explore Red Hat Learning
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Red Hat OpenShift Data Foundation topology considerations

June 19, 2024
James Force Travis Nielsen
Related topics:
Hybrid CloudKubernetes
Related products:
Red Hat OpenShift Data Foundation

Share:

    There are many different factors to consider when using Red Hat OpenShift Data Foundation to set up a storage cluster. This article dives into some specifics to consider when building a production-grade resilient scalable deployment of OpenShift Data Foundation. We'll need to dig into the following points to design and build out a suitable deployment architecture:

    • What does the underlying platform environment look like in terms of architecture/topology/design?
    • What underlying storage is available and how is that presented?
    • Replication options.

    To start, if you're new to OpenShift Data Foundation, please see the architecture documentation to get a high level understanding of what the operator provides and is doing. Under the covers, OpenShift Data Foundation uses Rook and Ceph. If you are unfamiliar with Ceph, it is essentially a software-defined storage system that was designed to be set up on commodity hardware to where multiple physical servers with attached disks are clustered together. It's worth reviewing this Introduction to Ceph and spending some time getting to grips with its core architecture and components beforehand to gain some familiarity with its concepts.

    When running Ceph on Kubernetes, OpenShift Data Foundation automates the deployment, but it is not a silver bullet. Understanding how the cluster is spread across the underlying infrastructure and where the actual storage comes from is of paramount importance for a successful design and deployment.

    As a reference, for supportability specifics, the knowledge-centered service (KCS) article Supported configurations for Red Hat OpenShift Data Foundation 4.X should cover any possible confusion of what can and can’t be done. 

    Topology

    Regardless of how the storage is presented, be that through some existing dynamic storage class or directly attached disks, it is critical to have a fault tolerant topology. You need to first consider:

    • How many storage nodes you want in the cluster
    • How many OSDs (disks) per node
    • Where the nodes are located

    The above points are the main elements which define the underlying Ceph configuration and thus the placement groups and controlled replication under scalable hashing (CRUSH) maps that get created by the operator.

    If you just install OpenShift Data Foundation without much thought and label seemingly random nodes as storage nodes, it will work, although the topology may be poor. For example, you may have multiple nodes in the same availability zone and fail to spread the data evenly across failure domains. OpenShift Data Foundation relies on building its topology based on node labels. The node labels that are used to form the topology in hierarchical order from highest to lowest are:

    topology.kubernetes.io/region
    topology.kubernetes.io/zone
    topology.rook.io/datacenter
    topology.rook.io/room
    topology.rook.io/pod
    topology.rook.io/pdu
    topology.rook.io/row
    topology.rook.io/rack
    topology.rook.io/chassis

    The region, zone, and rack labels are more commonly used, with OpenShift Data Foundation primarily utilizing zone and rack labels to determine the topology of Ceph pools. While other labels contribute to the CRUSH hierarchy for OSDs, the pools created by OpenShift Data Foundation do not utilize these other labels as failure domains, unlike zones and racks.

    The Kubernetes region and zone labels are common node labels. Depending on your cloud/underlying platform, you may find your nodes already have logical region/zone labels. The topology labels can help to correctly paint a picture of your cluster’s deployment to give Rook the information needed to deploy the Ceph topology. It may make most sense to leverage the Kubernetes region and zone labels where you can. For VMWare you may want to consider specifying regions and zones configuration for a cluster on vSphere if you hadn’t already. This will give your storage cluster some information on how it should deploy the Ceph cluster.

    When the node labels are missing, the OpenShift Data Foundation operator makes a best effort to determine the failure domains automatically. The failure domain essentially defines how the data will be spread. For example, if zone labels are present for each of the nodes for three different zones, Ceph will spread the data replicas across the three zones. By adding node labels, you will ensure the optimal data resiliency instead of relying on distribution of data that does not have any topology information and would spread the data to random nodes.

    Zone-based example

    As an example, Figure 1 illustrates a topology containing three zones with two servers per zone with a single disk per node initially (with the potential to scale with additional disks per node). The topology and thus the failure zone is based upon the topology.kubernetes.io/zone label of the nodes.

    Architecture diagram of an example three zone, six node topology. Each zone has two nodes within.
    Figure 1: An example of a 3 zone, 6 node topology.

    In a Ceph style layout, it would look like this:

    .
    └── region
       ├── zone (topology.kubernetes.io/zone=zone1)
       |   |── host: ocs-deviceset-0-data
       |   |  └── osd
       |   └── host: ocs-deviceset-4-data
       |      └── osd
       ├── zone (topology.kubernetes.io/zone=zone2)
       |   |── host: ocs-deviceset-1-data
       |   |  └── osd
       |   └── host: ocs-deviceset-3-data
       |      └── osd
       └── zone (topology.kubernetes.io/zone=zone3)
           |── host: ocs-deviceset-2-data
           |  └── osd
           └── host: ocs-deviceset-5-data
              └── osd

    Elaborating on this example, the default block pool has a total replica count of 3 with 1 replica per zone (in this example, "zone" is the defined failure domain). When data is written to the cluster it will be stored on one OSD in each zone, so you have a total of 3 replicas of the data spread across the zones. 

    Flexible scaling

    Flexible scaling is a feature that, if enabled in the StorageCluster, sets the failureDomain to host. This allows replicas to be distributed evenly across all nodes, regardless of distribution in zones or racks. This contrasts to the default behavior, in which OpenShift Data Foundation will use zones if already defined or automatically define three racks to simulate some topology. Enabling flexible scaling must be set during cluster creation since it would have a high impact on the Ceph pool creation and the CRUSH rule associated with the pool. Thus, modifying flexibleScaling in flight isn't possible and this won't change the relevant resources and the CRUSH rule of the pool. 

    If your CRUSH rule for the pool is based on a host failure domain, the distribution of data could be random depending on where the storage nodes sit. If you just have a 3 storage node setup and they are pinned to specific zones, it’s not of huge concern as a replica would live on each. However, If you're planning on running across more than 3 storage nodes using host for the CRUSH rule of the pool distribution will be random at node level. See Figure 2. If any zone or other topology labels are defined, topology will be ignored at their level. Only host-level distribution of data is enforced with the flexible scaling feature. 

    Architecture diagram of an example six node host topology.
    Figure 2: An example of 6 node host topology.

    In a Ceph style layout, it would look like this:

    .
    └── 
       ├── host: ocs-deviceset-0-data
       |   └── osd
       ├── host: ocs-deviceset-1-data
       |   └── osd
       ├── host: ocs-deviceset-2-data
       |   └── osd
       ├── host: ocs-deviceset-3-data
       |   └── osd
       ├── host: ocs-deviceset-4-data
       |   └── osd
       └── host: ocs-deviceset-5-data
           └── osd

    Verifying the topology

    The following commands will show you how to validate that a pool is using the expected host, zone, or other level for its hierarchy. To run these commands, you can do so within the rook-ceph toolbox container. See How to configure the Rook-Ceph Toolbox.

    To enable the rook-ceph toolbox pod (if not already enabled/running):

    oc patch OCSInitialization ocsinit -n openshift-storage --type json --patch  '[{ "op": "replace", "path": "/spec/enableCephTools", "value": true }]'

    Enter the pod:

    oc rsh -n openshift-storage $(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)

    Display the OSD tree:

    ceph osd df tree

    Display the pool with the replication factor and CRUSH rule:

    ceph osd pool ls detail

    List all the CRUSH rules and the associated bucket for each pool:

    ceph osd crush rule dump

    Underlying storage

    If you're using bare metal (using actual raw storage) with the local storage operator, the topology can be quite simple to define to match your bare metal topology with hosts, racks, and zones. In other environments such as VMware in which devices are provisioned using a dynamic StorageClass, things may require further consideration. In VMware, if you're deploying using the default StorageClass , it will likely be either VSAN or VMFS. By using either of these, you could find yourself in a situation where replication on replication is happening. The extra level of replication will reduce performance so it’s worth it to spend some time looking at where your underlying storage for OpenShift Data Foundation is being presented from and what that means. 

    Replication policies/tiers

    Replication requirements should be evaluated as not all workloads that require storage will be the same. You may wish to introduce a tier system to which would represent how the data is replicated. Some storage requirements may just be "scratch space", thus there is no point in keeping more copies of the data than what we need. Other application data may be quite important and thus you want more total assurance the data is fully replicated. With that in mind, you may want to define multiple storage classes each with a different pool, such as bronze/silver/gold tiers of replication. 

    In a simple default setup you’d have three nodes each with a single disk and the replica for each pool would be three. For each write in this setup, every OSD would keep a copy of the data. If these were 2TB disks, you’d have 2TB usable out of 6TB total. If you were to have multiple pools with different amounts of replicas then the usable space would depend on usage of each pool. 

    Out of the box you’ll find a default resource for the following resources :

    • CephBlockPool
    • CephFilesystem 
    • CephObjectStore 

    The failureDomain of all of these default pools in the cluster will be consistent with the zones or racks etc created in the topology previously. An example pool for block storage would look like this:

    apiVersion: ceph.rook.io/v1
    kind: CephBlockPool
    spec:
      enableRBDStats: true
      failureDomain: zone
      replicated:
        replicasPerFailureDomain: 1
        size: 3
        targetSizeRatio: 0.49

    This example would ensure three copies of the data exist in total with each copy being stored in a different zone (failureDomain), so if you lose one zone you would still have two copies of this data.

    For support details on different replication options see: Data Availability and Integrity considerations for replica 2 pools and supported configurations. 

    Conclusion

    An OpenShift Data Foundation storage cluster requires careful consideration of various factors, including the underlying platform environment, available storage, and replication options. Understanding the topology and how data is distributed across nodes and failure domains is crucial for achieving a resilient design. By leveraging node labels and configuring appropriate replication policies, you can ensure that your deployment is both fault-tolerant and scalable. 

    Last updated: June 20, 2024

    Related Posts

    • Best practices for OpenShift Data Foundation disaster recovery resource planning

    • Multicluster replication in Red Hat OpenShift Data Foundation S3

    • Install storage in your application cluster using Rook

    • Integrating Red Hat OpenStack 9 Cinder Service With Multiple External Red Hat Ceph Storage Clusters

    • Persistent storage in action: Understanding Red Hat OpenShift's persistent volume framework

    • OpenShift Virtualization for VMware vSphere admins: Disaster and site recovery

    Recent Posts

    • How Trilio secures OpenShift virtual machines and containers

    • How to implement observability with Node.js and Llama Stack

    • How to encrypt RHEL images for Azure confidential VMs

    • How to manage RHEL virtual machines with Podman Desktop

    • Speech-to-text with Whisper and Red Hat AI Inference Server

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Products

    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue