Part four of my review/commentary on 97 Things Every SRE Should Know
9 - There Is No Magic #
tl;dr: It’s all just code. As an engineer, it’s ok to look at the implementation of different layers to understand how they work.
This one is pretty straightforward- library code isn’t always correct, predictable, useful, or well-documented.
Hell, your own code made not be all of those things either. So use the code (when it is available) to understand how it works, how to set arguments correctly, or to avoid particular gotchas.
10 - How Wikipedia Is Served to You #
tl;dr: Wikipedia uses a CDN (Content Delivery Network) and multiple layers of caching to avoid expensive re-rendering of articles.
In contrast to #9, this article does claim to use magic and wizardry, thus proving that there is more than one way to SRE.
And in solidarity with #9, Wikimedia use open source software so they (and we) have the code for everything they run.
Caches are great for speeding things up, but they do introduce a bunch of new problems- consistency/invalidation (famously and notoriously hard, though not as hard as reputed in practice in many cases), capacity management (especially backend load when the cache is restarted or breaks)
11 - Why You Should Understand (a Little) About TCP #
tl;dr: Good ideas that work in isolation may interact badly when combined in the same system.
Sure, learn some TCP/IP. Things like MTU and fragmentation matter. Knowing the nominal behaviours of TCP/IP can help with spotting when something doesn’t look right, and (ideally) isolating the problem as much as possible to understand it.
This article is mostly about the bad interaction of Nagle’s Algorithm with delayed ACK in the large-write case.
Another takeaway from this article would be: Latency can come from weird places; it can be hard to tell which component or behaviour’s assumptions have been violated when performance suddenly tanks for no apparent reason.
12 - The Importance of a Management Interface #
tl;dr: Don’t put all your comms in one channel.
Good operability means being able to control your components, even when they are misbehaving or overloaded.
I’ve worked on systems where the admin interface was just another path in the main serving loop. This is simple and easy to implement, but it leaves you locked out if the server can’t properly start or something breaks the routing.
The same goes for all non-serving traffic- observability data, control signals, config, etc. If the main serving path is broken for any reason, it’s really useful to still be able to get metrics or make admin changes (to help recover).