Featured image for How to design robust state machines

Backend services and microservices typically use state machines to maintain the state of their resources, whether the state machine is defined explicitly or not. To create a well-written service, you must expressly and clearly define its state machine, so that users know what to expect and can ensure that tests cover all possible (and impossible) transitions.

In this article, you'll learn important guidelines for building state machines. You'll also get an introduction to stateswitch, a Go library that you can use to organize a state machine's transition logic.

What is a state machine?

When we discuss state machines in this context, we refer to deterministic finite state machines, which are widely used in mathematics and computer sciences to model problems. State machines have a single start state and one or more accepting, or final, states. Transitions between the states describe the input or event required for moving from one state to another. When designing and developing services or micro-services, it is useful to model the resources that those services manage as state machines.

As an example, consider a state machine for a user subscription, as illustrated in Figure 1.

Diagram of an example state machine.
Figure 1. An example of a state machine.

This example shows a few kinds of states and transitions. There is a start state, creating, and a final state, canceled. The backend handles the transition from creating to created once processing is complete, making creating an intermediate state. You should use intermediate states when a transition cannot occur within a single database transaction. This avoids races and indicates to users that an operation is in progress. Explicit user actions trigger the transitions to and from suspended and to canceled.

Implementing state machines

Visualizing a state machine, as we've done here, can help you determine whether it is robust. The example diagram in Figure 1 should raise some questions: Can a user cancel a suspended subscription, for instance? And is there an error state for subscriptions with an unrecoverable error?

Here are a few general points to keep in mind when designing state machines for microservices:

  • A resource's state should generally be stored with that resource's metadata—in an SQL database or key-value store, for example. All changes to the resource, including the state change, should be made to in-memory objects and committed in a single transaction at the end of the operation to avoid races.
  • Maintain a state-info property with human-readable text explaining the state and the reasons why the resource might have transitioned there. For example, the state-info corresponding to the canceled state might be "Subscription canceled per user request."
  • Each intermediate state should have a timeout defined so that the service can recover from stalled or aborted operations. You can keep track of the timeout duration with a state-updated-at property containing a timestamp updated when the state changes.
  • At the start of each API action, check to see if the action applies during the current state. If not, return an appropriate error to the user, such as "409 Conflict" in REST.

Stateswitch: Organizing transition logic

You should avoid using lengthy sequences of if-else statements to define transitions, as such code is prone to bugs. Instead, you can use stateswitch, a library written in Go, to organize the transition logic. With stateswitch, you define each transition by specifying the source states, a destination state, the condition upon which to transition, and a method to call upon transitioning.

As an example, consider the Red Hat OpenShift assisted installer. This service exposes REST and Kubernetes APIs that allow users to discover hosts, group them into OpenShift clusters, and monitor cluster installations. The service has state machines for host and cluster resources.

Use APIs to expose service states

A final point to consider is using APIs to expose states to users. In REST, it is natural to include the state in the resource's properties. The user sees a concise single-string value, and a user interface (UI) can easily display it. On the downside, the state's meaning might not be apparent to users who don't know how your state machine works. Another downside is that scripts might rely on the order of your state machine's transitions, such that even adding a state could represent a breaking API change.

You could represent the state properties in OpenAPI format as follows:

      state:
        type: string
        description: State of the subscription.
        enum:
          - creating
          - created
          - suspended
          - canceled
      state_info:
        type: string
        description: Additional information pertaining to the subscription state.
      state_updated_at:
        type: string
        format: date-time
        description: The last time that the subscription state was updated.

Kubernetes-native APIs would typically use conditions to expose the state. In our example state machine, the conditions might be active and suspended. The Kubernetes controller would set the active condition to TRUE if the state is created or suspended, and would set the suspended condition to TRUE if the state is suspended.

Users and scripts can precisely examine the conditions they are interested in without needing to understand the underlying state machine, and new states can be added without breaking the API. The downside is that the state is not expressed concisely, which can cause problems if it's displayed in a UI. Of course, conditions can be added to a REST API or a concise state added to a Kubernetes API.

Conclusion

State machines help engineers visualize a resource's behavior, find gaps or inconsistencies, organize code, and improve test coverage. They also allow users to understand the machine's behavior and know what to expect. In this article, you've learned important guidelines for implementing a state machine, and have seen how the stateswitch library can help.

Last updated: September 20, 2023