Red Hat Gluster storage

Rsync is a particularly tough workload for GlusterFS because with its defaults, it exercises some of the worst case operations for GlusterFS. GlusterFS is the core of Red Hat Gluster's scale-out storage solution. Gluster is an open, software-defined storage (SDS) platform that is designed to scale out to handle data intensive tasks across many servers in physical, virtual, or cloud deployments. Since GlusterFS is a POSIX compatible distributed file system, getting the best performance from rsync requires some tuning/tweaking on both sides.

In this post, I will go through some of the pain points and the different tunables for working around the pain points.  Getting rsync to run as fast on GlusterFS as it would on a local file system is not really feasible given its architecture, but below I describe how to get as close as possible.

1)  The main issue with rsync and GlusterFS is rsync uses the "write new then rename" idiom when creating files.  The means that for every file created, GlusterFS is forced to rename the file, which is by far the most expensive file operation (FOP).  The Gluster distributed hash table (DHT) developers recognized the issue with "write new then rename" workloads and added a couple tunables to help this workload. The following is from the documentation:

"With the file-lookup mechanisms we already have in place, it's not necessary to move a file from one brick to another when it's renamed - even across directories. It will still be found, albeit a little less efficiently. The first client to look for it after the rename will add a linkfile, which every other client will follow from then on. Also, every client that has found the file once will continue to find it based on its cached location, without any network traffic at all. Because the extra lookup cost is small, and the movement cost might be very large, DHT renames the file "in place" on its current brick instead (taking advantage of the fact that directories exist everywhere).

This optimization is further extended to handle cases where renames are very common. For example, rsync and similar tools often use a "write new then rename" idiom in which a file "xxx" is actually written as ".xxx.1234" and then moved into place only after its contents have been fully written. To make this process more efficient, DHT uses a regular expression to separate the permanent part of a file's name (in this case "xxx") from what is likely to be a temporary part (the leading "." and trailing ".1234"). That way, after the file is renamed it will be in its correct hashed location - which it wouldn't be otherwise if "xxx" and ".xxx.1234" hash differently - and no linkfiles or broadcast lookups will be necessary.

In fact, there are two regular expressions available for this purpose - cluster.rsync-hash-regex and cluster.extra-hash-regex. As its name implies, rsync-hash-regex defaults to the pattern that regex uses, while extra-hash-regex can be set by the user to support a second tool using the same temporary-file idiom."

For example:

# gluster v set testvol cluster.rsync-hash-regex none

It should be noted that this setting will cause a lot of files to be placed in incorrect subvolumes, creating a lot of link files until a rebalance is executed.  The link files will add a small amount of overhead when these files are accessed; while a rebalance is not necessary immediately, it's a good idea to rebalance at some point after using rsync with this tunable.

2)  Rsync defaults to a pretty small request size, and this also is a weak point on GlusterFS.  GlusterFS tends to perform best with request sizes over 64KB; 1MB tends to provide the best performance. With request sizes that are less than 4KB, things really start to degrade.  Rsync does have a tunable to change this behavior. It's called block-size.  Rsync's default for block size is 2KB, which really hurts performance when rsyncing to/from GlusterFS.  Also note that the maximum block size for rsync is 128KB:

#define MAX_BLOCK_SIZE ((int32)1 << 17)u

When rsyncing to/from GlusterFS, I suggest using the block size of 128KB. Some older versions support up to 512MB. If you have an older version, I suggest using 1MB.  You can set block size using the following command, which forces the block size used in rsync’s delta-transfer algorithm to a fixed value. The value is normally selected based on the size of each file being updated. See the Rsync Technical Report for details.

-B, --block-size=BLOCKSIZE

For example:

# rsync -vrah /gluster-mount/ /home/bturner/Downloads/ --progress -B=131072

You can also look at the following option (see the rsync man page):

-W, --whole-file

This option disables rsync’s delta-transfer algorithm, which causes all transferred files to be sent whole. The transfer may be faster if this option is used when the bandwidth between the source and destination machines is higher than the bandwidth to disk (especially when the "disk" is actually a networked filesystem). This is the default when both the source and destination are specified as local paths, but only if no batch-writing option is in effect.

The whole-file option can be used with or instead of the block size option. I suggest testing to see which works best for your data set.

3)  Next, we come to the --inplace option.  This option actually changes how rsync behaves. This option behaves similarly to the rsync regex option discussed above except it is implemented on the rsync side instead of the GlusterFS side.  The following information is from the man page:

--inplace update destination files in-place

This option changes how rsync transfers a file when its data needs to be updated: instead of the default method of creating a new copy of the file and moving it into place when it is complete, rsync instead writes the updated data directly to the destination file.

This has several effects:

  •  Hard links are not broken. This means the new data will be visible through other hard links to the destination file. Moreover, attempts to copy differing source files onto a multiply-linked destination file will result in a "tug of war" with the destination data changing back and forth.
  • In-use binaries cannot be updated (either the OS will prevent this from happening, or binaries that attempt to swap-in their data will misbehave or crash).
  • The file’s data will be in an inconsistent state during the transfer and will be left that way if the transfer is interrupted or if an update fails.
  • A file that rsync cannot write to cannot be updated. While a super user can update any file, a normal user needs to be granted write permission for the open of the file for writing to be successful.
  • The efficiency of rsync’s delta-transfer algorithm may be reduced if some data in the destination file is overwritten before it can be copied to a position later in the file. This does not apply if you use --backup, since rsync is smart enough to use the backup file as the basis file for the transfer.

WARNING: you should not use this option to update files that are being accessed by others, so be careful when choosing to use this for a copy.

This option is useful for transferring large files with block-based changes or appended data, and also on systems that are disk bound, not network bound. It can also help keep a copy-on-write filesystem snapshot from diverging the entire contents of a file that only has minor changes.

The option implies --partial (since an interrupted transfer does not delete the file), but conflicts with --partial-dir and --delay-updates. Prior to rsync 2.6.4 --inplace was also incompatible with --compare-dest and --link-dest.

I recommend using the GlusterFS tunable when you have a changing data set or you don't want to mess with the default operation of rsync.  I usually use either the GlusterFS tunable or --inplace. I haven't tried using both at the same time, but I expect that since we would no longer be using "write new then rename," the GlusterFS tunable wouldn't have any effect.

4)  Other workarounds:

  • tar up the directory and use scp. In some cases, geo-replication TARs up data and sends it as one file, which can reduce the number of round trips that go over the wire as well as avoid the rename FOP.
  • This can be sped up with the parallel-untar utility from Ben England.
  • Rsync to a local directory and copy to GlusterFS.
  • Use geo-replication.
  • Use cp with the proper flags to preserve whatever metadata/xattrs/etc. that you wish to preserve.
  • Use some other application that does not follow the "write new then rename" workflow.  Remember it's the renames that really kill performance here, so using an application that avoids them will improve performance.

Try these tips to see if they increase the performance of your rsync workloads. If you know of any tip I missed, please let me know in the comments!

Last updated: August 11, 2018