Skip to main content
Background Image

97 Things Every SRE Should Know (part 3)

·698 words·4 mins
Mike Wyer
Author
Mike Wyer
Table of Contents

Part three of my review/commentary on 97 Things Every SRE Should Know

6 - Infrastructure: It’s where the power is
#

tl;dr: Understanding a production stack and being able to triage and diagnose complex failures is a difficult, rewarding, and sometimes underrated skill.

Charity Majors is good with a sound-bite, and there is one line that really stood out to me:

Infra engineers have never seen an abstraction we trust to work as designed.

All software has bugs; every system has failure modes; every tool promising to solve all your problem(s) will end up trading off one problem for another set of problems.

I have seen incredible (literally not credible) claims about AIOps, MLOps, or MLDevOps and other AI-assisted tooling. I’ve also seen way too many otherwise-sensible managers and decision makers believe the vendor’s hype over the concrete evidence provided by their own engineers.

She also touches on the topic I call the physics of software engineering:

… software operates according to the laws of scientific realism.

and

At the base of every technical pile sits the speed of light,

There are some things that are just physically impossible, and it’s important to be aware of limitations around latency, bandwidth, lock contention, scaling complexity, etc.

7 - Thinking About Resilience
#

tl;dir: There are many techniques for trading off availability and latency for complexity and cost.

The article categorizes different techniques into broad categories:

  • Load reduction
  • Latency reduction
  • Load adaptation
  • Resilience
  • Meta-techniques

You could just as easily talk about horizontal scaling, cost management, capacity planning, CQRS and traffic segregation, operability, and many other distributed systems approaches and concepts.

It’s definitely a good idea to be aware of as many approaches and techniques as possible, although it can take a lot of experimentation and experience to find the best one(s) to apply in a given situation.

If your current platform does not support all of these approaches as options, you may struggle to respond effectively to all the problems you are likely to encounter as an SRE.

In all my years of SREing, I have never regretted having:

  • SLOs
  • Rate Limits
  • Access Controls
  • Fine-grained load balancer configs
  • Timeouts
  • Retries

I have had mixed results from:

  • Autoscaling
  • Circuit breakers
  • Queuing
  • Failover (hot/cold setups)

8 - Observability in the Development Cycle
#

tl;dr: The earlier you can catch bugs in the development cycle, the cheaper they are to fix. Observability can help with that, but won’t do it for you.

I do want to amplify Charity Majors and Liz Fong-Jones as experts in the field of observability (and far more visible and influential than me), while also adding the small disclaimer that I don’t completely agree with everything they have said.

This article touches on a bunch of different ideas, all of which could be explored in more depth (which I assume they do in the book Observability Engineering).

  • The earlier you catch bugs, the more context the developer has for understanding both the intended and unintended behaviours.
  • Don’t just log everything to your observability platform- it’s inefficient and expensive and could cost more than your serving stack.
  • Use developer tools (IDEs, debuggers, static analysis) for finding problems in code; observability tools are for finding component problems in systems.

Now, sometimes you can identify a specific bug in a specific line of code from a wonky metric or abnormal trace, but don’t expect that to be the general case.

I’ve certainly seen cases where high cardinality metrics and verbose logs have ended up costing far more than we ever charged the customer for their resources, and in several cases more than it cost us to host those resources. A useful guideline is to spend about 10% of the total cost of the system on monitoring. That should be sufficient to detect enough problems fast enough to maintain your SLO.

I would also advocate for monitoring your dev environment- the more bugs you catch there, the fewer make it into production. And it’s just as important to test and verify your monitoring code and configs as your serving code and configs.

If you can detect alerting conditions in dev and test them, you will have more confidence in the alerts in prod.


comments powered by Disqus