Cover graphics_Reliability Nightmares

Reliability Nightmares: The Coloring Book

Máirín Duffy, Jeremy Eder, Irit Goihman, David Martin, and Craig Robinson
English

Overview

Gino's family-owned restaurant is in trouble: its reputation is suffering due to poor quality food and bad service. On top of that, the owner, Leo, is surprised to learn that he’ll be hosting his sister’s upcoming wedding reception.

Join Chef Cookie Cache as she applies site reliability engineering (SRE) principles to help improve Gino's restaurant operations in time for the wedding.

This coloring book is for developers who are looking to improve the operability of their codebase.  It walks through five key tenets of operable software, drilling down and providing fun examples of each one:

  • Observability
  • Safety and self-healing
  • Scalability
  • Shifting left
  • Zero downtime upgrades

How do you score on the operability quiz? Download Reliability Nightmares and find out!

Excerpt

The first challenge is to set up observability for the restaurant. This is foundational for improving Gino’s service.

  1. We need to break down the silos between each role with regular open communication.
  2. We must be in consensus about the level of service we provide.
  3. We enforce accountability by measuring causes of issues and working to address them.
  4. We need to prioritize our work...
  5. ...so we can keep our customers happy!

You need to know your equipment works, your food supplies are fresh and stocked, and that customers are served in a timely manner. You need data to achieve all of this.

How do we do it?

To accomplish this, we’ll define service-level objectives for Gino’s. SLOs are a set of goals you’d like to achieve that represent what your customers expect. We’ll collect data to measure whether we meet our SLOs or not.

We’ve had a lot of dishes sent back lately because they weren’t served right away and got cold. Can we set an SLO for that?

Related E-books