In mid-July, a faulty configuration update caused a significant global IT disruption, leading to transportation delays, point-of-sale issues, telecommunications outages and more. Affected machines entered a boot loop or boot recovery mode, rendering them inoperative; this underscores the critical need for robust automated recovery mechanisms in IT infrastructure. One answer being iterated upon by the open source community is Greenboot, which is currently available in Red Hat Enterprise Linux (RHEL) for Edge, Red Hat In-Vehicle Operating System and Fedora IoT.
Greenboot is designed to automatically recover from failed upgrades by integrating with systemd and atomically updated distros. It acts as a guardian for your system’s health, so that if an update goes awry, the system can automatically rollback to a previously working state. Here is how Greenboot can help prevent outages similar to July’s global disruption.
Automated health checks and rollbacks
Greenboot performs health checks every time the system boots. It uses scripts to verify critical components and services, verifying that they are functioning as expected. If any health check fails, Greenboot can automatically rollback the system to a previous, stable state, minimizing downtime.
Customizable health checks
Administrators can define custom health checks tailored to their specific system needs. These checks can be categorized into mandatory checks that must pass for the system to be considered healthy and optional checks, whose failure will not trigger a rollback but will be logged as a failure.
Integration with systemd and OSTree
By leveraging systemd for service management and OSTree for version control, Greenboot provides a powerful, integrated solution for maintaining system health. The ability to create and manage bootable system snapshots enables Greenboot to rollback effectively if an update causes issues.
How Greenboot works
Greenboot follows a structured approach to manage system health:
System boot
Health check outcomes
Rollback
System boot
During boot, Greenboot runs health check scripts that are located in /etc/greenboot/check/required.d
and /etc/greenboot/check/wanted.d
. The scripts in required.d
must pass for the boot to be successful, while failures in wanted.d
are logged but do not trigger a rollback.
Health check outcomes
Success: If all required health checks pass, Greenboot executes any scripts in
/etc/greenboot/green.d
to finalize the boot process.Failure: If any required health check fails, Greenboot runs scripts in
/etc/greenboot/red.d
to attempt corrective actions before rebooting. If the issue persists after several retries, Greenboot triggers an OSTree rollback to the previous stable version.
Rollback mechanism
In the event of repeated failures, Greenboot uses rpm-ostree rollback --reboot
to revert the system to the last known good state. This ensures that the system can recover from failed updates without manual intervention. In the case where Linux userspace may not be reached effectively, the bootloader maintains boot counters, to automatically rollback after a number of failed boots.
+-------------------------+
+----------------------->| System Boot |
| +-----------+-------------+
| |
| |
+--------+ |
| reboot | |
+--------+ |
^ |
| |
Yes | v
| +-------------------------+
| boot_counter == | -1 ? |
+------------------------+-----------------------+-+
|
| No
v
+-------------------------+
| Continue boot process |
+-----------+-------------+
|
v
+--------------------------+
| greenboot-healthcheck |
+-----------+--------------+
|
v
+--------------------------+
| Run health check scripts |
| in `required.d` and |
| `wanted.d` directories |
+-----------+--------------+
|
v
+---------------------------+
Any required | Script failed ? |
+-----------+---------------------------+
| No | Yes
v v
+--------------------------+ +--------------------------+
| Boot successful | | Call `redboot.target` |
+-----------+--------------+ +-----------+--------------+
| |
v v
+-----------------------------+ +--------------------------+
| Reach `boot-complete.target`| | redboot-task-runner |
+-----------+-----------------+ | runs `/usr/libexec/ |
| | greenboot/greenboot red` |
v +-----------+--------------+
+-----------------------------+ |
| greenboot-grub2-set-success | v
| unsets `boot_counter` and | +--------------------------+
| sets `boot_success` to 1 | | Run scripts in `red.d` |
+-----------+-----------------+ +-----------+--------------+
| |
v v
+-----------------------------+ +--------------------------+
| greenboot-task-runner runs | | greenboot-status.service |
| `/usr/libexec/greenboot/ | | creates MOTD with error |
| greenboot green` to run | | details |
| scripts in `green.d` | +-----------+--------------+
+-----------+-----------------+ |
| v
v +--------------------------+
+-----------------------------+ | redboot-auto-reboot |
| greenboot-status.service | | checks if manual |
| creates MOTD with success | | intervention is needed, |
| message | | if not, reboots system |
+-----------------------------+ +--------------------------+
Conclusion
Greenboot offers a robust, automated solution to manage system health and recover from update failures, reducing the risk of downtime and operational disruptions. By leveraging customizable health checks and integration with systemd and OSTree, Greenboot can effectively mitigate the impact of faulty updates, helping your systems to remain reliable and resilient. Whether managing a few servers, many edge devices or many vehicles, Greenboot is an essential tool for maintaining uptime and system integrity in today’s complex IT environments.