troubleshooting red hat process automation manager featured image

Open vSwitch (OVS) is a rich implementation of software-defined networking (SDN) L2 switch. OVS's revalidator process, invoked by Open vSwitch's ovs-vswitchd daemon, is responsible for the following:

  • Making sure the installed datapath flows match the configured OpenFlow rules
  • Updating OpenFlow packet and byte statistics
  • Removing idle datapath flows

However, the OVS implementation details are not well documented or understood. The goal of this article is to provide insight into the operation of the revalidator.

Revalidator threads

The ovs-vswitchd daemon creates several revalidator threads when the datapath initializes at the creation of the first bridge for the datapath. If you use both the kernel (system) datapath and the userspace (netdev) datapath, two sets of revalidator threads get created.

The number of total threads created depends on the n-revalidator-threads setting. The documentation states:

other_config : n-revalidator-threads: optional  string,  containing  an
               integer, at least 1

       Attempts to specify the number of threads for software datapaths
       to use for revalidating flows in the  datapath.  Some  datapaths
       may  choose to ignore this and will set to a sensible option for
       the datapath type.

       Typically, there is a direct correlation between the  number  of
       revalidator  threads,  and  the  number  of flows allowed in the
       datapath. The default is the number of cpu cores divided by four
       plus  one.  If  n-handler-threads is set, the default changes to
       the number of cpu cores minus the number of handler threads.

       This configuration is per datapath. If you have  more  than  one
       software  datapath  (e.g.  some  system  bridges and some netdev
       bridges), then the total number of threads is  n-handler-threads
       times the number of software datapaths.

If you would like to change the number of revalidator threads, you can enter commands such as these:

# vswitchd.log | grep n-revalidator
2022-08-30T09:07:09.023Z|00029|ofproto_dpif_upcall|INFO|Overriding n-handler-threads to 17, setting n-revalidator-threads to 5

# pidstat -t -p `pidof ovs-vswitchd` | grep revalidator
05:07:19 AM   976         -    267439    0.00    0.00    0.00    0.00    0.00    12  |__revalidator19
05:07:19 AM   976         -    267440    0.00    0.00    0.00    0.00    0.00    21  |__revalidator20
05:07:19 AM   976         -    267441    0.00    0.00    0.00    0.00    0.00     9  |__revalidator21
05:07:19 AM   976         -    267442    0.00    0.00    0.00    0.00    0.00     0  |__revalidator22
05:07:19 AM   976         -    267443    0.00    0.00    0.00    0.00    0.00    27  |__revalidator23

# ovs-vsctl set Open_vSwitch . other-config:n-revalidator-threads=10
#

# cat /var/log/openvswitch/ovs-vswitchd.log | grep n-revalidator
2022-08-30T09:07:09.023Z|00029|ofproto_dpif_upcall|INFO|Overriding n-handler-threads to 17, setting n-revalidator-threads to 5
2022-08-30T09:08:22.216Z|00039|ofproto_dpif_upcall|INFO|Overriding n-handler-threads to 17, setting n-revalidator-threads to 10

# pidstat -t -p `pidof ovs-vswitchd` | grep revalidator
05:08:29 AM   976         -    267543    0.00    0.00    0.00    0.00    0.00    27  |__revalidator41
05:08:29 AM   976         -    267544    0.00    0.00    0.00    0.00    0.00     7  |__revalidator42
05:08:29 AM   976         -    267545    0.00    0.00    0.00    0.00    0.00    10  |__revalidator43
05:08:29 AM   976         -    267546    0.00    0.00    0.00    0.00    0.00    14  |__revalidator44
05:08:29 AM   976         -    267547    0.00    0.00    0.00    0.00    0.00    26  |__revalidator45
05:08:29 AM   976         -    267548    0.00    0.00    0.00    0.00    0.00    11  |__revalidator46
05:08:29 AM   976         -    267549    0.00    0.00    0.00    0.00    0.00    21  |__revalidator47
05:08:29 AM   976         -    267550    0.00    0.00    0.00    0.00    0.00    24  |__revalidator48
05:08:29 AM   976         -    267551    0.00    0.00    0.00    0.00    0.00    12  |__revalidator49
05:08:29 AM   976         -    267552    0.00    0.00    0.00    0.00    0.00     9  |__revalidator50

How the revalidator threads operate

What do the revalidator threads do? The following comment in Open vSwitch's ofproto-dpif-upcall.c file provides an initial answer:

 /* A thread that processes datapath flows, updates OpenFlow statistics, and
  * updates or removes them if necessary.
  *
  * Revalidator threads operate in two phases: "dump" and "sweep". In between
  * each phase, all revalidators sync up so that all revalidator threads are
  * either in one phase or the other, but not a combination.
  *
  *     During the dump phase, revalidators fetch flows from the datapath and
  *     attribute the statistics to OpenFlow rules. Each datapath flow has a
  *     corresponding ukey which caches the most recently seen statistics. If
  *     a flow needs to be deleted (for example, because it is unused over a
  *     period of time), revalidator threads may delete the flow during the
  *     dump phase. The datapath is not guaranteed to reliably dump all flows
  *     from the datapath, and there is no mapping between datapath flows to
  *     revalidators, so a particular flow may be handled by zero or more
  *     revalidators during a single dump phase. To avoid duplicate attribution
  *     of statistics, ukeys are never deleted during this phase.
  *
  *     During the sweep phase, each revalidator takes ownership of a different
  *     slice of umaps and sweeps through all ukeys in those umaps to figure out
  *     whether they need to be deleted. During this phase, revalidators may
  *     fetch individual flows which were not dumped during the dump phase to
  *     validate them and attribute statistics.
  */

This all happens in the revalidator threads created by udpif_start_threads(). The entry point for all these threads is udpif_revalidator(). I will describe this later in the "Triggering revalidation" section.

The leader, the first revalidator thread, decides when to run a new dump and sweep. So how does it determine when to run a new dump and sweep? The following criteria determine the answer:

  • The minimum time between two runs is 5ms.
  • The delay time is the lesser value of the ofproto_max_idle and ofproto_max_revalidator variables.
  • The ofproto_max_idle value is fixed and, by default, is 10,000ms when not configured. To configure it, set the max-idle variable in the Open_vSwitch database. For example:
    $ ovs-vsctl set Open_vSwitch . other_config:max-idle=20000
  • The ofproto_max_revalidator value is fixed and, by default, is 500ms when not configured. To configure it, set the max-revalidator variable in the Open_vSwitch database. For example:
    $ ovs-vsctl set Open_vSwitch . other_config:max-revalidator=1000

The documentation describes these two parameters as follows:

other_config : max-idle: optional string,  containing  an  integer,  at
               least 500

       The  maximum  time (in ms) that idle flows will remain cached in
       the datapath. Internally OVS will check the validity and  activ‐
       ity  for  datapath  flows regularly and may expire flows quicker
       than this number, based on real time network conditions.  Tweak‐
       ing  this  value  is  discouraged  unless  you know exactly what
       you're doing.

       The default is 10000.
other_config : max-revalidator: optional string, containing an integer,
               at least 100

       The  maximum time (in ms) that revalidator threads will wait be‐
       fore executing flow revalidation. Note that this is maximum  al‐
       lowed  value.  Actual timeout used by OVS is minimum of max-idle
       and max-revalidator values. Tweaking this value  is  discouraged
       unless you know exactly what you're doing.

       The default is 500.

If the timer has expired or the thread is awakened earlier by a udpif->reval_seq change, the dump phase starts, as explained in the previous documentation. The udpif->reval_seq change happens through a call to udpif_revalidate().

The leader thread has the following tasks. All other threads will remain idle during these calculations:

  1. Determine the flow_limit. This value is dynamically adjusted based on the configured other_config:flow-limit value (default 200,000) and the time the most recent revalidator round took. The initial setting is the lesser value of the configured other_config:flow-limit and 10,000. Here are the formulas for resetting flow_limit:
          if duration > 2000 ms:
             flow_limit /= (duration in seconds)
          elif duration > 1300 ms:
             flow_limit *= 0.75
          elif duration < 1000 ms AND
               flow_limit < n_flows__last_run * 1000 / duration:
             flow_limit += 1000
    
          flow_limit = MIN(other_config:flow-limit, MAX(flow_limit, 1000))
  2. Get the current number of flows in the datapath and set n_flows to that value.
  3. Set the max_n_flows to the maximum ever seen: That is, set max_n_flows to n_flows if n_flows is greater.
  4. Reset avg_n_flows to reflect the number of flows at the current run: In other words, set it to (avg_n_flows + n_flows) / 2.
  5. Create a flow dump. The dump is in terse mode if the datapath enables and supports unique flow identifiers (UFIDs). Otherwise, a full dump happens. A dump in terse mode returns only the UFID and statistics. The following comment from dpif.h explains the sequence of events and functions called during a flow dump:
      /* Flow dumping interface
       * ======================
       *
       * This interface allows iteration through all of the flows currently installed
       * in a datapath.  It is somewhat complicated by two requirements:
       *
       *    - Efficient support for dumping flows in parallel from multiple threads.
       *
       *    - Allow callers to avoid making unnecessary copies of data returned by
       *      the interface across several flows in cases where the dpif
       *      implementation has to maintain a copy of that information anyhow.
       *      (That is, allow the client visibility into any underlying batching as
       *      part of its own batching.)
       *
       *
       * Usage
       * -----
       *
       * 1. Call dpif_flow_dump_create().
       * 2. In each thread that participates in the dump (which may be just a single
       *    thread if parallelism isn't important):
       *        (a) Call dpif_flow_dump_thread_create().
       *        (b) Call dpif_flow_dump_next() repeatedly until it returns 0.
       *        (c) Call dpif_flow_dump_thread_destroy().
       * 3. Call dpif_flow_dump_destroy().
       *
       * All error reporting is deferred to the call to dpif_flow_dump_destroy().
       */

NOTE: The upcoming "Using upcall/show to see internal variables" section explains how to view the values of these dynamic variables.

After the tasks are complete, the leader thread is the last thread to hit ovs_barrier_block(), following which all revalidator threads continue by calling revalidate().

revalidate() does the following:

  1. Call dpif_flow_dump_thread_create().
  2. Call dpif_flow_dump_next() to get a maximum of REVALIDATE_MAX_BATCH (50) flows, and return if zero flows are received.
  3. Determine the max_idle timeout.
  4. For each batch where n_dp_flows >= flow_limit, increment the upcall_flow_limit_hit coverage counter.
  5. For each flow in the batch, follow these five steps:
    1. Call ukey_acquire(), which tries to find the (struct udpif_key *) ukey corresponding to the datapath flow.

      If ukey exists but not locked, then some other thread is using it, and the revalidation for this round is skipped. If the error is EBUSY (can't take the lock), the function increases the upcall_ukey_contention coverage counter. All other acquisition errors call log_unexpected_flow() and add a delete operation for this flow, named delete_op_init__().

      If the ukey is not found at all, an entry is created so that the ukey will be re-evaluated. This algorithm handles the case where OVS restarts, and we are not blindly deleting all the datapath flows.

    2. Based on the ukey->dump_seq value, determine whether we have already looked at this entry. If so, increment one of two overage counters (dumped_duplicate_flow or dumped_new_flow) and continue to the next datapath flow.

    3. If the ukey state <= UKEY_OPERATIONAL, change the value to UKEY_OPERATIONAL. If the value is larger than UKEY_OPERATIONAL, the entry is in some delete state. Log the information and continue with the next entry.

    4. If the flow has been idle for max_idle milliseconds or we are in recovery mode, mark the ukey state as UKEY_DELETE. Otherwise, call revalidate_ukey(), which marks the ukey state as UKEY_KEEP, UKEY_DELETE, or UKEY_MODIFY. See the  "The revalidate_ukey function" section for more details.

    5. If the ukey state is not marked as UKEY_KEEP—that is, the state is UKEY_DELETE or UKEY_MODIFY—add the desired datapath operation to the operation queue. For UKEY_DELETE, also set the ukey state to UKEY_EVICTING.

  6. Now that the batch is completed, push all the operations to the datapath by calling push_dp_ops(). For UKEY_DELETE, this step also updates the statistics and sets the ukey state to UKEY_EVICTED.
  7. Try to receive another batch of flows and repeat this sequence on them. If no more flows are retrieved, continue:

The following comment from ofproto-dpif-upcall.c explains the duration of the flows:

    /* In normal operation we want to keep flows around until they have
     * been idle for 'ofproto_max_idle' milliseconds.  However:
     *
     *     - If the number of datapath flows climbs above 'flow_limit',
     *       drop that down to 100 ms to try to bring the flows down to
     *       the limit.
     *
     *     - If the number of datapath flows climbs above twice
     *       'flow_limit', delete all the datapath flows as an emergency
     *       measure.  (We reassess this condition for the next batch of
     *       datapath flows, so we will recover before all the flows are
     *       gone.) */

NOTE: The number of datapath flows compared for each batch, but the number is read from the datapath only every 100ms. So these actions might not be taken on all the flows.

At this point, the thread has exited from revalidate(). It waits until all threads finish revalidate(). Once this happens, the sweep phase starts with a call to revalidator_sweep().

The revalidator_sweep()function carries out the following tasks:

  1. Loop over all ukey entries in the threads' responsible umaps. For each entry, follow these three steps:
    1. Try to lock the ukey's mutex. If the lock cannot be taken, skip the entry in this round.
    2. If the ukey state is UKEY_OPERATIONAL and the entry missed the revalidate_ukey() call or dump phase, call revalidate_ukey() and push_dp_ops(), just as in the dump phase.
    3. If the new ukey state is UKEY_EVICTED, call ukey_delete(), which removes the ukey from the umap, sets the ukey state to UKEY_DELETED, and frees the data structure using Linux's read, copy, update (RCU) system.

Now we are done until the next revalidate timeout or udpif->reval_seq change, which starts another dump and sweep phase.

The revalidate_ukey function

If needed, this function revalidates the datapath flow and updates the OpenFlow statistics.

The first thing this function does is make sure we need to revalidate flows. OpenFlow revalidation is triggered through a call to udpif_revalidate(), as described in the upcoming section, "Triggering revalidation." The decision is made by comparing udpif->reval_seq with ukey->reval_seq. If they are different, the system needs a revalidation.

Let's explain this function with this pseudocode:

def revalidate_ukey(udpif, ukey):

  result = UKEY_DELETE

  IF udpif->reval_seq != ukey->reval_seq:
    IF !used                                            #[1]
       OR duration < (other_config:max-revalidator / 2) #[2]
       OR flow_pps > other_config:min-revalidate-pps:   #[3]

      IF xlate_ukey():
        IF ukey->actions == xlate_actions:
          result = UKEY_KEEP
        ELSE:
          result = UKEY_UPDATE
      ELSE:
        # Delete ukey as it can not be translated
    ELSE:
      # Delete ukey, to expensive to revalidate

  ELIF !n_packets OR ukey->xcache OR !populate_xcache():
    result = UKEY_KEEP

  IF result != UKEY_DELETE:
    update_statistics                                   #[4]
    ukey->reval_seq = udpif->reval_seq

  RETURN result

The following list explains the lines marked by numbers on the right side of the code:

  1. When used is set, the datapath flow was programmed, but no data has passed through it.
  2. This calculation determines whether it's likely that the function can do the whole revalidation in max-revalidator time.
  3. The other_config:min-revalidate-pps configuration option is documented as follows:
    other_config : min-revalidate-pps: optional string, containing an
                   integer, at least 1
    
           Set minimum pps that flow must have in order to  be  revalidated
           when  revalidation duration exceeds half of max-revalidator con‐
           fig variable.
    
           The default is 5.
  4. Another part of the code updates the statistics for the UKEY_DELETE state.

Triggering revalidation

As seen before, calling udpif_revalidate() can trigger a revalidation, which just issues the call seq_change(udpif->reval_seq). udpif_revalidate() is called only from the type_run() function in ofproto-dpif.c, and is called only if backer->need_revalidate is non-zero. This value also indicates the necessity of revalidation. However, the value might not be 100% accurate because multiple revalidates could be triggered before type_run() is called.

ofproto-dpif.h defines the following revalidation reasons:

enum revalidate_reason {
    REV_RECONFIGURE = 1,       /* Switch configuration changed. */
    REV_STP,                   /* Spanning tree protocol port status change. */
    REV_RSTP,                  /* RSTP port status change. */
    REV_BOND,                  /* Bonding changed. */
    REV_PORT_TOGGLED,          /* Port enabled or disabled by CFM, LACP, ...*/
    REV_FLOW_TABLE,            /* Flow table changed. */
    REV_MAC_LEARNING,          /* Mac learning changed. */
    REV_MCAST_SNOOPING,        /* Multicast snooping changed. */
};

For each type, a corresponding coverage counter might help administrators debug why revalidation happens. Note that only non-zero coverage counters are shown in the following example:

  # ovs-appctl coverage/show | grep -E "rev_|Event"
  Event coverage, avg rate over last: 5 seconds, last minute, last hour,  hash=d9180580:
  rev_port_toggled           0.0/sec     0.000/sec        0.0000/sec   total: 1
  rev_flow_table             0.0/sec     0.000/sec        2.7761/sec   total: 9995

Using upcall/show to see internal variables

The variables we discussed in the previous sections can be read using the ovs-appctl command. Here is a sample output:

$ ovs-appctl upcall/show
system@ovs-system:
  flows         : (current 10000) (avg 9999) (max 10000) (limit 172000)
  dump duration : 9ms
  ufid enabled : true

  19: (keys 2001)
  20: (keys 1956)
  21: (keys 1980)
  22: (keys 1982)
  23: (keys 2081)

These values are related to the current run, or the last run if the system is waiting for the next run to start. The displayed names match up with internal variables as follows:

  • current: n_flows
  • avg: avg_n_flows
  • max: max_n_flows
  • limit: Dynamically calculated flow_limit
  • dump duration: The total time of the most recent dump and sweep phase
  • xx: (keys <nm>): xx is the OVS thread ID of the revalidator thread, which you can see with one of the following commands:
    # pidstat -t -p `pidof ovs-vswitchd`
    # top -H -o +%CPU -p $(pidof ovs-vswitchd) -b -n 1
    The keys value represents the number of ukeys mapped to this thread's umaps.

Getting insights over time

Now that we know all of the internals, it would be nice to visualize some of them. Recently, a patch introduced some user statically-defined tracing (USDT) probes and a script to enable this visualization.

The enabled script shows some real-time plots of the various variables. Figure 1 shows an example. The dot on the n_flow graph indicates a triggered run.

A plot of actual_time shows spikes and sudden plunges.
Figure 1: A plot of actual_time shows spikes and sudden plunges.

 

Here is an explanation of some of the events shown in the graph:

  • The first drop in actual_time took place because we added an OpenFlow rule to the existing 1,000. Immediately after that drop, all existing 1,000 flows were invalidated, taking about 9ms, as shown in the line for n_reval_flows.
  • The second drop of actual_time came because I added 1,000 flows with individual ovs-vsctl commands, causing a lot of udpif_revalidate() invocations (i.e., short instances of actual_wait). In addition, this change caused the flow_limit to grow further.

When you exit the script, it shows a similar graph reflecting data from the full run. Figure 2 shows an example.

The final plot displays data from the full run.
Figure 2: The final plot displays data from the full run.

 

There is also an option to save all the events and run the script offline to generate the graphs. In addition, the script always dumps the statistics to the console in CSV format:

$ ./reval_monitor.py -r captured_reval.events
- Reading events from "captured_reval.events"...
ts_start, ts_complete, n_flows, n_reval_flows, avg_n_flows, max_n_flows, flow_limit, dump_duration, poll_wait, actual_wait
344704473543451, 344704473963582, 1000, 0, 999, 1000, 67000, 1, 500, 500.62
344704974588221, 344704975002936, 1000, 0, 999, 1000, 68000, 1, 500, 499.62
344705474624368, 344705475047878, 1000, 0, 999, 1000, 69000, 1, 500, 499.62
344705974669681, 344705981642974, 1000, 0, 999, 1000, 70000, 7, 500, 493.57
344706475208230, 344706475330465, 0, 0, 499, 1000, 70000, 1, 500, 500.52
344706975854978, 344706975929932, 0, 0, 249, 1000, 70000, 1, 500, 500.45
344707476384490, 344707476431131, 0, 0, 124, 1000, 70000, 1, 500, 500.53
344707976957291, 344707977003421, 0, 0, 62, 1000, 70000, 1, 500, 499.45
344708476457462, 344708476503563, 0, 0, 31, 1000, 70000, 1, 500, 500.53
344708977030540, 344708977078802, 0, 0, 15, 1000, 70000, 1, 500, 500.52
344709477603353, 344709477650819, 0, 0, 7, 1000, 70000, 1, 500, 107.16
344709584813142, 344709584856818, 0, 0, 3, 1000, 70000, 1, 500, 5.19
344709590051075, 344709590093409, 0, 0, 1, 1000, 70000, 1, 500, 5.18
344709595270855, 344709595313071, 0, 0, 0, 1000, 70000, 1, 500, 5.18
344709600488455, 344709600530112, 0, 0, 0, 1000, 70000, 1, 500, 5.18
344709605705511, 344709605748460, 0, 0, 0, 1000, 70000, 1, 500, 5.15

 

Summary

You should now be able to analyze the datapath flow behavior of ovs-vswitchd on any running system. This article was inspired because the OVS implementation details are not well documented or understood. We hope we have provided clarity and insight into the operation of the revalidator. Please feel free to comment below. We always welcome your feedback.