An SRE Manifesto

Table of Contents

Site Reliability Engineering rendered as copper blocks

“An” not “the” SRE Manifesto- there are other ways to SRE but this is mine.

This is a collection of guiding principles, maxims, beliefs, and behaviours.

A lot of SRE is identifying and solving problems.

Problems that look the same may not be the same. By all means, try the same solution, but be prepared for it not to work in a different situation. [Law of Non-Universality]

Not all problems can be solved right now. [Limoncelli Exclusion Principle]

Be clear about which problem you are solving.

Make mindful and conscious choices about which problem to solve.

Make sure your solution, however clever, solves the right problem.

All generalisations have some exceptions, including this one.

Give things helpful names. [Rule 7]

If you ever find yourself in a situation where you are sure the name of a thing is not important, you are wrong. Feel that outrage and defensiveness. Stoke that personal flame of ire. Use that energy to choose a helpful name.

Be curious. Curiosity is a fantastic quality in an SRE. How does this work? Why did it fail in this way? If this is true, what else might be true?

Learn from everything. Other people. Other industries.

Start small. Understand the structure of the problem. Then go big. Understand the scale of the problem. Then fill the gaps.

Hindsight is wonderful. Make the most of it to learn and make better choices in future.

Never criticise anyone (including yourself) for not knowing something. We are learning all the time, which means there are always things we don’t know yet.

Find a balance between caring about the quality of your code and avoiding hurt feelings when that code is inevitably modified or replaced.

Simplify where possible.

Use abstractions wisely. When a system is complex, add abstractions to simplify it. When an abstraction is no longer needed, remove it.

YAGNI (You Ain’t Gonna Need It) is not the whole story. Don’t add features you are not going to use immediately, but also write your code in a way that makes it easy to add them later.

DRY (Don’t Repeat Yourself) is not the whole story either. Sometimes repetition is a reasonable pragmatic trade-off. When removing duplicates or hard-coded values, be careful to only merge things which express exactly the same intention and semantics. Don’t merge two different quantities which just happen to have the same value right now.

Unused code paths are the “Here be dragons” of prod. The fewer times a code path executes, the more likely there are undiscovered bugs and unwanted behaviour.

Declarative code is not magic. All programming and config languages can put the right values in the right memory addresses. Let the requirements, constraints, and problem domain determine the viable solution(s) for the problem.

Do not let your familiarity with a particular tool or language prevent you from exploring other options when it is not a good fit for the current problem.

Ignore language / editor / OS / framework zealots. They are not listening to you about your specific problem, so don’t listen to their blithe assurance their solution will work. Also SRE zealots who insist they are the prophets of the One True SRE Way.

Assemble a good tool kit that works for you.

When making new tools, learn from the lessons of the great toolmakers who have gone before:

Do one thing and do it well.
No surprises.
Present a consistent interface.
Include helpful documentation.
Support –help.
Provide feedback during slow operations.
Have a dry-run or interactive option for destructive operations.
- Bonus point: run in dry-run by default so users exploring the tool do not accidentally change something.
Ensure it is always possible to undo destructive changes.

Forensics and accountability. It’s useful to know who changed what when. So we can find out why and what problem they were solving.

Remember Chesterton’s Fence. You should not remove something until you fully understand why it was put there in the first place.

Try extra hard to not delete the wrong thing. This is not the same as trying to delete the right thing.

Meta discussion of Site Reliability Engineering as a term and a role / function
#

Why is “SRE” attractive? Some folks are looking for hyper-linear scaling of support. Others want to deliver on 99.9% SLAs. SRE sounds cooler than “ops”. Some teams/orgs may just be indulging in DevOps / FAANG cosplay.

Don’t be snooty. Rather than dismissing a particular practice or approach as “bad” or “not SRE”, try to understand the context and offer novel suggestions for improvement. Find ways to be more SRE.

Remember SRE grew from other approaches and practices under a particular set of constraints and conditions.

SRE for larger teams is different than solo or small-team SRE. Solutions that work for N+2 team members will not work for smaller teams.

Avoid the “No true SRE” fallacy. All efforts to be “more SRE” are worthwhile. There is no threshold at which one changes from non-SRE to SRE, apart from job titles. And job titles have little to no effect on running systems.

Meta discussion of Site Reliability Engineering as a term and a role / function #

Meta discussion of Site Reliability Engineering as a term and a role / function
#