Open vSwitch (OVS) is a rich implementation of software-defined networking (SDN) L2 switch. OVS's revalidator process, invoked by Open vSwitch's ovs-vswitchd daemon, is responsible for the following:
- Making sure the installed datapath flows match the configured OpenFlow rules
- Updating OpenFlow packet and byte statistics
- Removing idle datapath flows
However, the OVS implementation details are not well documented or understood. The goal of this article is to provide insight into the operation of the revalidator.
Revalidator threads
The ovs-vswitchd daemon creates several revalidator threads when the datapath initializes at the creation of the first bridge for the datapath. If you use both the kernel (system) datapath and the userspace (netdev) datapath, two sets of revalidator threads get created.
The number of total threads created depends on the n-revalidator-threads
setting. The documentation states:
other_config : n-revalidator-threads: optional string, containing an
integer, at least 1
Attempts to specify the number of threads for software datapaths
to use for revalidating flows in the datapath. Some datapaths
may choose to ignore this and will set to a sensible option for
the datapath type.
Typically, there is a direct correlation between the number of
revalidator threads, and the number of flows allowed in the
datapath. The default is the number of cpu cores divided by four
plus one. If n-handler-threads is set, the default changes to
the number of cpu cores minus the number of handler threads.
This configuration is per datapath. If you have more than one
software datapath (e.g. some system bridges and some netdev
bridges), then the total number of threads is n-handler-threads
times the number of software datapaths.
If you would like to change the number of revalidator threads, you can enter commands such as these:
# vswitchd.log | grep n-revalidator
2022-08-30T09:07:09.023Z|00029|ofproto_dpif_upcall|INFO|Overriding n-handler-threads to 17, setting n-revalidator-threads to 5
# pidstat -t -p `pidof ovs-vswitchd` | grep revalidator
05:07:19 AM 976 - 267439 0.00 0.00 0.00 0.00 0.00 12 |__revalidator19
05:07:19 AM 976 - 267440 0.00 0.00 0.00 0.00 0.00 21 |__revalidator20
05:07:19 AM 976 - 267441 0.00 0.00 0.00 0.00 0.00 9 |__revalidator21
05:07:19 AM 976 - 267442 0.00 0.00 0.00 0.00 0.00 0 |__revalidator22
05:07:19 AM 976 - 267443 0.00 0.00 0.00 0.00 0.00 27 |__revalidator23
# ovs-vsctl set Open_vSwitch . other-config:n-revalidator-threads=10
#
# cat /var/log/openvswitch/ovs-vswitchd.log | grep n-revalidator
2022-08-30T09:07:09.023Z|00029|ofproto_dpif_upcall|INFO|Overriding n-handler-threads to 17, setting n-revalidator-threads to 5
2022-08-30T09:08:22.216Z|00039|ofproto_dpif_upcall|INFO|Overriding n-handler-threads to 17, setting n-revalidator-threads to 10
# pidstat -t -p `pidof ovs-vswitchd` | grep revalidator
05:08:29 AM 976 - 267543 0.00 0.00 0.00 0.00 0.00 27 |__revalidator41
05:08:29 AM 976 - 267544 0.00 0.00 0.00 0.00 0.00 7 |__revalidator42
05:08:29 AM 976 - 267545 0.00 0.00 0.00 0.00 0.00 10 |__revalidator43
05:08:29 AM 976 - 267546 0.00 0.00 0.00 0.00 0.00 14 |__revalidator44
05:08:29 AM 976 - 267547 0.00 0.00 0.00 0.00 0.00 26 |__revalidator45
05:08:29 AM 976 - 267548 0.00 0.00 0.00 0.00 0.00 11 |__revalidator46
05:08:29 AM 976 - 267549 0.00 0.00 0.00 0.00 0.00 21 |__revalidator47
05:08:29 AM 976 - 267550 0.00 0.00 0.00 0.00 0.00 24 |__revalidator48
05:08:29 AM 976 - 267551 0.00 0.00 0.00 0.00 0.00 12 |__revalidator49
05:08:29 AM 976 - 267552 0.00 0.00 0.00 0.00 0.00 9 |__revalidator50
How the revalidator threads operate
What do the revalidator threads do? The following comment in Open vSwitch's ofproto-dpif-upcall.c
file provides an initial answer:
/* A thread that processes datapath flows, updates OpenFlow statistics, and
* updates or removes them if necessary.
*
* Revalidator threads operate in two phases: "dump" and "sweep". In between
* each phase, all revalidators sync up so that all revalidator threads are
* either in one phase or the other, but not a combination.
*
* During the dump phase, revalidators fetch flows from the datapath and
* attribute the statistics to OpenFlow rules. Each datapath flow has a
* corresponding ukey which caches the most recently seen statistics. If
* a flow needs to be deleted (for example, because it is unused over a
* period of time), revalidator threads may delete the flow during the
* dump phase. The datapath is not guaranteed to reliably dump all flows
* from the datapath, and there is no mapping between datapath flows to
* revalidators, so a particular flow may be handled by zero or more
* revalidators during a single dump phase. To avoid duplicate attribution
* of statistics, ukeys are never deleted during this phase.
*
* During the sweep phase, each revalidator takes ownership of a different
* slice of umaps and sweeps through all ukeys in those umaps to figure out
* whether they need to be deleted. During this phase, revalidators may
* fetch individual flows which were not dumped during the dump phase to
* validate them and attribute statistics.
*/
This all happens in the revalidator threads created by udpif_start_threads()
. The entry point for all these threads is udpif_revalidator()
. I will describe this later in the "Triggering revalidation" section.
The leader, the first revalidator thread, decides when to run a new dump and sweep. So how does it determine when to run a new dump and sweep? The following criteria determine the answer:
- The minimum time between two runs is 5ms.
- The delay time is the lesser value of the
ofproto_max_idle
andofproto_max_revalidator
variables. - The
ofproto_max_idle
value is fixed and, by default, is 10,000ms when not configured. To configure it, set themax-idle
variable in theOpen_vSwitch
database. For example:$ ovs-vsctl set Open_vSwitch . other_config:max-idle=20000
- The
ofproto_max_revalidator
value is fixed and, by default, is 500ms when not configured. To configure it, set themax-revalidator
variable in theOpen_vSwitch
database. For example:$ ovs-vsctl set Open_vSwitch . other_config:max-revalidator=1000
The documentation describes these two parameters as follows:
other_config : max-idle: optional string, containing an integer, at
least 500
The maximum time (in ms) that idle flows will remain cached in
the datapath. Internally OVS will check the validity and activ‐
ity for datapath flows regularly and may expire flows quicker
than this number, based on real time network conditions. Tweak‐
ing this value is discouraged unless you know exactly what
you're doing.
The default is 10000.
other_config : max-revalidator: optional string, containing an integer,
at least 100
The maximum time (in ms) that revalidator threads will wait be‐
fore executing flow revalidation. Note that this is maximum al‐
lowed value. Actual timeout used by OVS is minimum of max-idle
and max-revalidator values. Tweaking this value is discouraged
unless you know exactly what you're doing.
The default is 500.
If the timer has expired or the thread is awakened earlier by a udpif->reval_seq
change, the dump phase starts, as explained in the previous documentation. The udpif->reval_seq
change happens through a call to udpif_revalidate()
.
The leader thread has the following tasks. All other threads will remain idle during these calculations:
- Determine the
flow_limit
. This value is dynamically adjusted based on the configuredother_config:flow-limit
value (default 200,000) and the time the most recent revalidator round took. The initial setting is the lesser value of the configuredother_config:flow-limit
and 10,000. Here are the formulas for resettingflow_limit
:if duration > 2000 ms: flow_limit /= (duration in seconds) elif duration > 1300 ms: flow_limit *= 0.75 elif duration < 1000 ms AND flow_limit < n_flows__last_run * 1000 / duration: flow_limit += 1000 flow_limit = MIN(other_config:flow-limit, MAX(flow_limit, 1000))
- Get the current number of flows in the datapath and set
n_flows
to that value. - Set the
max_n_flows
to the maximum ever seen: That is, setmax_n_flows
ton_flows
ifn_flows
is greater. - Reset
avg_n_flows
to reflect the number of flows at the current run: In other words, set it to (avg_n_flows
+n_flows
) / 2. - Create a flow dump. The dump is in terse mode if the datapath enables and supports unique flow identifiers (UFIDs). Otherwise, a full dump happens. A dump in terse mode returns only the UFID and statistics. The following comment from
dpif.h
explains the sequence of events and functions called during a flow dump:
/* Flow dumping interface
* ======================
*
* This interface allows iteration through all of the flows currently installed
* in a datapath. It is somewhat complicated by two requirements:
*
* - Efficient support for dumping flows in parallel from multiple threads.
*
* - Allow callers to avoid making unnecessary copies of data returned by
* the interface across several flows in cases where the dpif
* implementation has to maintain a copy of that information anyhow.
* (That is, allow the client visibility into any underlying batching as
* part of its own batching.)
*
*
* Usage
* -----
*
* 1. Call dpif_flow_dump_create().
* 2. In each thread that participates in the dump (which may be just a single
* thread if parallelism isn't important):
* (a) Call dpif_flow_dump_thread_create().
* (b) Call dpif_flow_dump_next() repeatedly until it returns 0.
* (c) Call dpif_flow_dump_thread_destroy().
* 3. Call dpif_flow_dump_destroy().
*
* All error reporting is deferred to the call to dpif_flow_dump_destroy().
*/
NOTE: The upcoming "Using upcall/show to see internal variables" section explains how to view the values of these dynamic variables.
After the tasks are complete, the leader thread is the last thread to hit ovs_barrier_block()
, following which all revalidator threads continue by calling revalidate()
.
revalidate()
does the following:
- Call
dpif_flow_dump_thread_create
(). - Call
dpif_flow_dump_next
() to get a maximum ofREVALIDATE_MAX_BATCH
(50) flows, and return if zero flows are received. - Determine the
max_idle
timeout. - For each batch where
n_dp_flows
>=flow_limit
, increment theupcall_flow_limit_hit
coverage counter. - For each flow in the batch, follow these five steps:
-
Call
ukey_acquire()
, which tries to find the(struct udpif_key *) ukey
corresponding to the datapath flow.If ukey exists but not locked, then some other thread is using it, and the revalidation for this round is skipped. If the error is
EBUSY
(can't take the lock), the function increases theupcall_ukey_contention
coverage counter. All other acquisition errors calllog_unexpected_flow()
and add a delete operation for this flow, nameddelete_op_init__()
.If the ukey is not found at all, an entry is created so that the ukey will be re-evaluated. This algorithm handles the case where OVS restarts, and we are not blindly deleting all the datapath flows.
-
Based on the
ukey->dump_seq
value, determine whether we have already looked at this entry. If so, increment one of two overage counters (dumped_duplicate_flow
ordumped_new_flow
) and continue to the next datapath flow. -
If the
ukey state <= UKEY_OPERATIONAL
, change the value toUKEY_OPERATIONAL
. If the value is larger thanUKEY_OPERATIONAL
, the entry is in some delete state. Log the information and continue with the next entry. -
If the flow has been idle for
max_idle
milliseconds or we are in recovery mode, mark the ukey state asUKEY_DELETE
. Otherwise, callrevalidate_ukey()
, which marks the ukey state asUKEY_KEEP
,UKEY_DELETE
, orUKEY_MODIFY
. See the "The revalidate_ukey function" section for more details. -
If the ukey state is not marked as
UKEY_KEEP—
that is, the state isUKEY_DELETE
orUKEY_MODIFY—
add the desired datapath operation to the operation queue. ForUKEY_DELETE
, also set the ukey state toUKEY_EVICTING
.
-
- Now that the batch is completed, push all the operations to the datapath by calling
push_dp_ops()
. ForUKEY_DELETE
, this step also updates the statistics and sets the ukey state toUKEY_EVICTED
. - Try to receive another batch of flows and repeat this sequence on them. If no more flows are retrieved, continue:
The following comment from ofproto-dpif-upcall.c
explains the duration of the flows:
/* In normal operation we want to keep flows around until they have
* been idle for 'ofproto_max_idle' milliseconds. However:
*
* - If the number of datapath flows climbs above 'flow_limit',
* drop that down to 100 ms to try to bring the flows down to
* the limit.
*
* - If the number of datapath flows climbs above twice
* 'flow_limit', delete all the datapath flows as an emergency
* measure. (We reassess this condition for the next batch of
* datapath flows, so we will recover before all the flows are
* gone.) */
NOTE: The number of datapath flows compared for each batch, but the number is read from the datapath only every 100ms. So these actions might not be taken on all the flows.
At this point, the thread has exited from revalidate()
. It waits until all threads finish revalidate()
. Once this happens, the sweep phase starts with a call to revalidator_sweep()
.
The revalidator_sweep()
function carries out the following tasks:
- Loop over all ukey entries in the threads' responsible umaps. For each entry, follow these three steps:
- Try to lock the ukey's mutex. If the lock cannot be taken, skip the entry in this round.
- If the ukey state is
UKEY_OPERATIONAL
and the entry missed therevalidate_ukey()
call or dump phase, callrevalidate_ukey()
andpush_dp_ops()
, just as in the dump phase. - If the new ukey state is
UKEY_EVICTED
, callukey_delete()
, which removes the ukey from the umap, sets the ukey state toUKEY_DELETED
, and frees the data structure using Linux's read, copy, update (RCU) system.
Now we are done until the next revalidate timeout or udpif->reval_seq
change, which starts another dump and sweep phase.
The revalidate_ukey function
If needed, this function revalidates the datapath flow and updates the OpenFlow statistics.
The first thing this function does is make sure we need to revalidate flows. OpenFlow revalidation is triggered through a call to udpif_revalidate()
, as described in the upcoming section, "Triggering revalidation." The decision is made by comparing udpif->reval_seq
with ukey->reval_seq
. If they are different, the system needs a revalidation.
Let's explain this function with this pseudocode:
def revalidate_ukey(udpif, ukey):
result = UKEY_DELETE
IF udpif->reval_seq != ukey->reval_seq:
IF !used #[1]
OR duration < (other_config:max-revalidator / 2) #[2]
OR flow_pps > other_config:min-revalidate-pps: #[3]
IF xlate_ukey():
IF ukey->actions == xlate_actions:
result = UKEY_KEEP
ELSE:
result = UKEY_UPDATE
ELSE:
# Delete ukey as it can not be translated
ELSE:
# Delete ukey, to expensive to revalidate
ELIF !n_packets OR ukey->xcache OR !populate_xcache():
result = UKEY_KEEP
IF result != UKEY_DELETE:
update_statistics #[4]
ukey->reval_seq = udpif->reval_seq
RETURN result
The following list explains the lines marked by numbers on the right side of the code:
- When
used
is set, the datapath flow was programmed, but no data has passed through it. - This calculation determines whether it's likely that the function can do the whole revalidation in
max-revalidator
time. - The
other_config:min-revalidate-pps
configuration option is documented as follows:other_config : min-revalidate-pps: optional string, containing an integer, at least 1 Set minimum pps that flow must have in order to be revalidated when revalidation duration exceeds half of max-revalidator con‐ fig variable. The default is 5.
- Another part of the code updates the statistics for the
UKEY_DELETE
state.
Triggering revalidation
As seen before, calling udpif_revalidate()
can trigger a revalidation, which just issues the call seq_change(udpif->reval_seq)
. udpif_revalidate()
is called only from the type_run()
function in ofproto-dpif.c
, and is called only if backer->need_revalidate
is non-zero. This value also indicates the necessity of revalidation. However, the value might not be 100% accurate because multiple revalidates could be triggered before type_run()
is called.
ofproto-dpif.h
defines the following revalidation reasons:
enum revalidate_reason {
REV_RECONFIGURE = 1, /* Switch configuration changed. */
REV_STP, /* Spanning tree protocol port status change. */
REV_RSTP, /* RSTP port status change. */
REV_BOND, /* Bonding changed. */
REV_PORT_TOGGLED, /* Port enabled or disabled by CFM, LACP, ...*/
REV_FLOW_TABLE, /* Flow table changed. */
REV_MAC_LEARNING, /* Mac learning changed. */
REV_MCAST_SNOOPING, /* Multicast snooping changed. */
};
For each type, a corresponding coverage counter might help administrators debug why revalidation happens. Note that only non-zero coverage counters are shown in the following example:
# ovs-appctl coverage/show | grep -E "rev_|Event"
Event coverage, avg rate over last: 5 seconds, last minute, last hour, hash=d9180580:
rev_port_toggled 0.0/sec 0.000/sec 0.0000/sec total: 1
rev_flow_table 0.0/sec 0.000/sec 2.7761/sec total: 9995
Using upcall/show to see internal variables
The variables we discussed in the previous sections can be read using the ovs-appctl
command. Here is a sample output:
$ ovs-appctl upcall/show
system@ovs-system:
flows : (current 10000) (avg 9999) (max 10000) (limit 172000)
dump duration : 9ms
ufid enabled : true
19: (keys 2001)
20: (keys 1956)
21: (keys 1980)
22: (keys 1982)
23: (keys 2081)
These values are related to the current run, or the last run if the system is waiting for the next run to start. The displayed names match up with internal variables as follows:
current
:n_flows
avg
:avg_n_flows
max
:max_n_flows
limit
: Dynamically calculatedflow_limit
dump duration
: The total time of the most recent dump and sweep phasexx: (keys <nm>)
: xx is the OVS thread ID of the revalidator thread, which you can see with one of the following commands:# pidstat -t -p `pidof ovs-vswitchd` # top -H -o +%CPU -p $(pidof ovs-vswitchd) -b -n 1
keys
value represents the number of ukeys mapped to this thread's umaps.
Getting insights over time
Now that we know all of the internals, it would be nice to visualize some of them. Recently, a patch introduced some user statically-defined tracing (USDT) probes and a script to enable this visualization.
The enabled script shows some real-time plots of the various variables. Figure 1 shows an example. The dot on the n_flow
graph indicates a triggered run.
Here is an explanation of some of the events shown in the graph:
- The first drop in
actual_time
took place because we added an OpenFlow rule to the existing 1,000. Immediately after that drop, all existing 1,000 flows were invalidated, taking about 9ms, as shown in the line forn_reval_flows
. - The second drop of
actual_time
came because I added 1,000 flows with individualovs-vsctl
commands, causing a lot ofudpif_revalidate()
invocations (i.e., short instances ofactual_wait
). In addition, this change caused theflow_limit
to grow further.
When you exit the script, it shows a similar graph reflecting data from the full run. Figure 2 shows an example.
There is also an option to save all the events and run the script offline to generate the graphs. In addition, the script always dumps the statistics to the console in CSV format:
$ ./reval_monitor.py -r captured_reval.events
- Reading events from "captured_reval.events"...
ts_start, ts_complete, n_flows, n_reval_flows, avg_n_flows, max_n_flows, flow_limit, dump_duration, poll_wait, actual_wait
344704473543451, 344704473963582, 1000, 0, 999, 1000, 67000, 1, 500, 500.62
344704974588221, 344704975002936, 1000, 0, 999, 1000, 68000, 1, 500, 499.62
344705474624368, 344705475047878, 1000, 0, 999, 1000, 69000, 1, 500, 499.62
344705974669681, 344705981642974, 1000, 0, 999, 1000, 70000, 7, 500, 493.57
344706475208230, 344706475330465, 0, 0, 499, 1000, 70000, 1, 500, 500.52
344706975854978, 344706975929932, 0, 0, 249, 1000, 70000, 1, 500, 500.45
344707476384490, 344707476431131, 0, 0, 124, 1000, 70000, 1, 500, 500.53
344707976957291, 344707977003421, 0, 0, 62, 1000, 70000, 1, 500, 499.45
344708476457462, 344708476503563, 0, 0, 31, 1000, 70000, 1, 500, 500.53
344708977030540, 344708977078802, 0, 0, 15, 1000, 70000, 1, 500, 500.52
344709477603353, 344709477650819, 0, 0, 7, 1000, 70000, 1, 500, 107.16
344709584813142, 344709584856818, 0, 0, 3, 1000, 70000, 1, 500, 5.19
344709590051075, 344709590093409, 0, 0, 1, 1000, 70000, 1, 500, 5.18
344709595270855, 344709595313071, 0, 0, 0, 1000, 70000, 1, 500, 5.18
344709600488455, 344709600530112, 0, 0, 0, 1000, 70000, 1, 500, 5.18
344709605705511, 344709605748460, 0, 0, 0, 1000, 70000, 1, 500, 5.15
Summary
You should now be able to analyze the datapath flow behavior of ovs-vswitchd
on any running system. This article was inspired because the OVS implementation details are not well documented or understood. We hope we have provided clarity and insight into the operation of the revalidator. Please feel free to comment below. We always welcome your feedback.