Across numerous industries, accurate timing is a common requirement. Applications will read the system clock and expect this to be the actual real-world time. But how accurate is it actually necessary? And is knowing the approximate time better than not having any idea what time it is?
Looking at our PTP operator more specifically, its responsibilities are threefold: managing receiving and transmitting time, controlling the system clock, and keeping metrics and raising events based on the clocks.
What is time?
When we talk about time, we often just look at the time on a wristwatch, and want the time to be whatever that reads, but that can be misleading. In the context of computing, we want the system time to be a value that is continuously increasing according to the real-world time. Most importantly, we want the time to be the same everywhere. Unfortunately, having the exact same time anywhere is a very hard problem to solve, so we settle for approximately the same time anywhere.
So where does the time on a computer's system clock come from? Most commonly, it's set through the network time protocol (NTP), using either chronyd or ntpd, which provide millisecond-level accuracy. This works by a client querying the time on a remote server, and then setting the system time to match. Obviously the speed of light is finite, so by the time the packets have travelled from the server to the client, the time has changed on the server, which can lead to differences in the level of milliseconds.
When NTP isn't good enough, you can use precision time protocol (PTP), which can achieve accuracy to within 100 nanoseconds. This works similar to NTP, setting the client time based on the server time, but estimates the packet delay end to end, and then factors that in when setting the time locally.
Alternatively, you can use the time broadcasted by a global navigation satellite system (GNSS), essentially using that time as a source of truth. Remember, the most important thing is that we have the same time everywhere, and when using GNSS, you could be reading the same broadcasts from different geographical locations, so you should get approximately the same time. One limit of GNSS is that it's vulnerable to jamming or spoofing. Most clients using PTP as the source of time are connected (directly or through one or more PTP boundary clocks) to a local grandmaster that uses GNSS as the source of time.
Yet another factor is holdover. Some hardware is capable of maintaining a certain level of accuracy for some duration without a new time source.
Failing over
We've usually operated on the assumption that if a customer is choosing to use GNSS as their time source, they absolutely require a high level of accuracy. If they're out of that specified tolerance, the consumer application must be notified and handled accordingly. In the case of a momentary loss of GNSS, the hardware's holdover capabilities maintain a level of accuracy. Unfortunately, in the case of something long-term (signal jamming, for example), the result would be the loss a source for accurate time.
The solution we developed was to use GNSS as the primary source of time, but in the event that GNSS is lost and the hardware exits holdover, then failover to using a configured NTP server or pool as the time source. This allows us to keep the system clock within 1 millisecond of accuracy. Again, an approximate time is often better than not having any idea what time it is.
Making it work
Once we knew what we wanted to do, making it work was a fun challenge. The first challenge was that while our PTP operator managed PTP and GNSS, it did not manage NTP at all. NTP is enabled or disabled by the Machine Config operator (MCO), and we just required users of the PTP operator to have it disabled on their system.
In keeping with the self-contained design of the PTP operator, we chose to add support for running a containerized chronyd to the PTP operator, allowing it to be configured through a ptpconfig custom resource definition (CRD).
Next thing was adding precise process controls internally within the PTP-operator, essentially allowing it to enable or disable individual processes without reconfiguring and restarting everything. This was necessary for phc2sys, which we would disable by terminating without restarting, and chronyd, which we would disable by setting it to offline, but keep running in the background to allow faster failover.
In addition, we improved the processing of ts2phc logs, to allow us to determine when ts2phc was going to SERVO_UNLOCKED (PHC in freerun), which was our failover criteria, or SERVO_LOCKED_STABLE (PHC locked to time source or in holdover), which was our recovery criteria.
Once all these were in place, we added an internal state machine. This used the ts2phc log processing work we did to trigger either failover or recovery, and then to enable or disable each of phc2sys and chronyd, with exactly one of them enabled at any time, based on whether GNSS (using phc2sys) or NTP (using chronyd) was the time source.
Finally, we updated our event handling to keep the system clock being reported as locked after failing over from GNSS to NTP as the source of time for the system clock.
Going forward
Now that we've integrated NTP with our PTP operator, the next step is likely working with the Open Radio Access Network (O-RAN) community to add event types for NTP. Further work may also be considered to add support for GNSS to PTP failover, or potentially to look at other criteria for controlling failover or recovery.