Tales from the SRE trenches: What can SRE do for you?

This is the fourth part in a series of articles about SRE, based on the talk I gave in the Romanian Association for Better Software.

Previous articles can be found here: part 1, part 2, and part 3.

As a side note, I am writing this from Heathrow Airport, about to board my plane to EZE. Buenos Aires, here I come!


So, what does SRE actually do?

Enough about how to keep the holy wars under control and how to get work away from Ops. Let's talk about some of the things that SRE does.

Obviously, SRE runs your service, performing all the traditional SysAdmin duties needed for your cat pictures to reach their consumers: provisioning, configuration, resource allocation, etc.

They use specialised tools to monitor the service, and get alerted as soon as a problem is detected. They are also the ones waking up to fix your bugs in the middle of the night.

But that is not the whole story: reliability is not only impacted by new launches. Suddenly usage grows, hardware fails, networks lose packets, solar flares flip your bits... When working at scale, things are going to break more often than not.

You want these breakages to affect your availability as little as possible, there are three strategies you can apply to this end: minimise the impact of each failure, recover quickly, and avoid repetition.

And of course, the best strategy of them all: preventing outages from happening at all.

Minimizing the impact of failures

A single failure that takes the whole service down will affect severely your availability, so the first strategy as an SRE is to make sure your service is fragmented and distributed across what is called "failure domains". If the data center catches fire, you are going to have a bad day, but it is a lot better if only a fraction of your users depend on that one data center, while the others keep happily browsing cat pictures.

SREs spend a lot of their time planning and deploying systems that span the globe to maximise redundancy while keeping latencies at reasonable levels.

Recovering quickly

Many times, retrying after a failure is actually the best option. So another strategy is to automatically restart in milliseconds any piece of your system that fails. This way, less users are affected, while a human has time to investigate and fix the real problem.

If a human needs to intervene, it is crucial that they get notified as fast as possible, and that they have quick access to all the information that is needed to solve the problem: detailed documentation of the production environment, meaningful and extensive monitoring, play-books[^playbook], etc. After all, at 2 AM you probably don't remember in which country the database shard lives, or what were the series of commands to redirect all the traffic to a different location.

[^playbook]: A play-book is a document containing critical information needed when dealing with a specific alert: possible causes, hints for troubleshooting, links to more documentation. Some monitoring systems will automatically add a play-book URL to every alert sent, and you should be doing it too.

Implementing monitoring and automated alerts, writing documentation and play-books, and practising for disaster are other areas where SREs devote much effort.

Avoiding repetition

Outages happen, pages ring, problems get fixed, but it should always be a chance for learning and improving. A page should require a human to think about a problem and find a solution. If the same problem keeps appearing, the human is not needed any more: the problem needs to be fixed at the root.

Another key aspect of dealing with incidents is to write post-mortems[^postmortem]. Every incident should have one, and these should be tools for learning, not finger-pointing. Post-mortems can be an excellent tool, if people are honest about their mistakes, they are not used to blame other people, and issues are followed up by bug reports and discussions.

[^postmortem]: Similarly to their real-life counterparts, a post-mortem is the process of examining a dead body (the outage), gutting it out and trying to understand what caused its demise.

Preventing outages

Of course nobody can prevent hard drives from failing, but there are certain classes of outages that can be forecasted with careful analysis.

Usage spikes can bring a service down, but an SRE team will ensure that the systems are load-tested at higher-than-normal rates. They could also be prepared to quickly scale the service, provided the monitoring system will alert as soon as a certain threshold is reached.

Monitoring is actually the key part in this: measuring relevant metrics for all the components of a system, and following their trends over time, SREs can completely avoid many outages: latencies growing out of acceptable bounds, disks filling up, progressive degradation of components, are all examples of typical problems automatically monitored (and alerted on) in an SRE team.


This is getting pretty long, so the next post will be the last one of these series, with some extra tips from SRE experience that I am sure can be applied in many places.

Tales from the SRE trenches: Coda

This is the fifth -and last- part in a series of articles about SRE, based on the talk I gave in the Romanian Association for Better Software.

Previous articles can be found here: part 1, part 2, part 3, and part 4.


Other lessons learned

Apart from these guidelines for SRE I have described in previous posts, there were a few other practices I learned at google that I believe could benefit most operations teams out there (and developers too!).

Automate everything

It's worth repeating, it is not sustainable to do things manually. When your manual work grows linearly with your usage, either your service is not growing enough, or you will need to hire more people than there are available in the market.

Involve Ops in systems design

Nobody knows better than Ops the challenges of a real production environment, getting early feedback from your operations team can save you a lot of trouble down the road. They might give you insights about scalability, capacity planning, redundancy, and the very real limitations of hardware.

Give Ops time and space to innovate

If there is so much emphasis on limiting the amount of operational load put on SRE, that freed time is not only used for creating monitoring dashboards or writing documentation.

As I said in the beginning, SRE is an engineering team. And when engineers are not running around putting fires off all day long, they get creative. Many interesting innovations come from SREs even when there is no requirement for them.

Write design documents

Design documents are not only useful for writing software. Fleshing out the details of any system on paper is a very useful way to find problems early in the process, communicate with your team exactly what are you building, and convince them why it is a great idea.

Review your code, version your changes

This is something that is ubiquitous at Google, and not exclusive to SRE: everything is committed into source control: big systems, small scripts, configuration files. It might not seem worth it at the beginning, but having complete history of all your production environment is invaluable.

And with source control, another rule comes into play: no change is committed without first being reviewed by a peer. It can be frustrating having to find a reviewer and wait for the approval for every small change, and some times reviews will be suboptimal, but once you get used to it, the benefits greatly out-weights any annoyances.

Before touching, measure

Monitoring is not only for sending alerts, it is an essential tool of SRE. It allows you to gauge your availability, evaluate trends, forecast growth, perform forensic analysis.

Also very importantly, it should give you the data needed to take decisions based on reality and not on guesses. Before optimising, find really where the bottleneck is; do not decide which database instance to use until you saw the utilisation of the past few months, etc.

And speaking of that... We really need to have a chat about your current monitoring system. But that's for another day.


I hope I did not bore you too much with these walls of text! Some people told me directly they were enjoying these posts, so soon I will be writing more about related topics. In particular I would like to write about modern monitoring and Prometheus.

I would love to hear your comments, questions, and criticisms.