97 Things Every SRE Should Know (part 1)

Table of Contents

Site Reliability Engineering floating over the cover of the book “97 Things Every SRE Should Know”

This is a new series of articles concerning the book 97 Things Every SRE Should Know curated by Emil Stolarsky and Jaime Woo.

It’s a collection of articles / essays by a bunch of different SREs or SRE-adjacent folks, and I think everyone who has any professional interest in “SRE” (Site Reliability Engineering) in any form should read this book.

That said, some of the articles are controversial, or only apply to certain situations and viewpoints. So I’d like to add my perspective to their words.

This is an interesting exercise for me in reviewing SRE thoughts and practices from 2020.
Hopefully I can share some additional ideas which clarify my understanding of SRE and help make it more accessible to other people.

Why should you care what I think about any of this? Fair question.

As per my CV, I’ve done a bunch of different SRE roles in a bunch of different teams and organizations, from massive multinational corporations to tiny startups. And I like sharing what I’ve learned along the way.

For each of the 97 things in the book, I’m going to make my own summary of the contents, then some commentary.

1- Site Reliability Engineering in Six Words
#

tl;dr: “SRE” is hard to describe, but SREs should:

Measure
Analyze
Decide
Act
Reflect
Repeat

This goes back to one of the basic quotes of SRE:

If you can’t measure it, you can’t improve it.

Pedants will immediately point out that this is not true– you can of course improve something without measuring it. However, you can’t prove to anyone else that you have improved it, or how much it has been improved, so it’s always a good practice to measure first and then make changes.

I’m not sure condensing SRE down to these particular 6 words is necessarily helpful, as you still have to explain what they mean in context.

I think it would be clearer to describe practices rather than trying to get the smallest sound bite:

Collect and use evidence to make decisions that support actions.
Aim for continuous improvement.

2- Do We Know Why We Really Want Reliability?
#

tl;dr: “Reliability” is not well-defined, which allows all sorts of logical fallacies when discussing it.

The basic premise here is that (anecdotally) reliability itself is never a competitive advantage, and only becomes significant when customer losses (due to unreliability) exceed customer gains. The conclusion being: if you are gaining customers anyway, you don’t need reliability; if you are losing customers, increasing reliability won’t get them back.

Cynically, I would point out that this submission was from Microsoft, so of course they don’t see the point of reliability.

However, I think there is a deeper point (which is not made clearly in the article):

SRE is not just about reliability. “Reliability” by itself is a nebulous goal, and SREs do a lot of work which is both tangible and valuable. While that may often include developing a specific, concrete, and actionable definition of “reliability” for the situation at hand, SRE typically also involves:

Observability
Operability (scaling, automating, and managing cognitive load for operations)
Incident management
Capacity management

and more.

So it’s worth considering SRE even if you don’t think you have a reliability problem (although you can’t be sure of that until you measure it), and increased reliability isn’t the only outcome from doing SRE work.

1- Site Reliability Engineering in Six Words #

2- Do We Know Why We Really Want Reliability? #

1- Site Reliability Engineering in Six Words
#

2- Do We Know Why We Really Want Reliability?
#