Tales from the SRE trenches

A few weeks ago, I was offered the opportunity to give a guest talk in the Romanian Association for Better Software.

RABS is a group of people interested in improving the trade, and regularly hold events where invited speakers give presentations on a wide array of topics. The speakers are usually pretty high-profile, so this is quite a responsibility! To make things more interesting, much of the target audience works on enterprise software, under Windows platforms. Definitely outside my comfort zone!

Considering all this, we decided the topic was going to be about Site Reliability Engineering (SRE), concentrating on some aspects of it which I believe could be useful independently of the kind of company you are working for.

I finally gave the talk last Monday, and the audience seemed to enjoy it, so I am going to post here my notes, hopefully some other people will like it too.

Why should I care?

I prepared this thinking of an audience of software engineers, so why would anyone want to hear about this idea that only seems to be about making the life of the operations people better?

The thing is, having your work as a development team supported by an SRE team will also benefit you. This is not about empowering Ops to hit you harder when things blow apart, but to have a team that is your partner. A partner that will help you grow, handle the complexities of a production environment so you can concentrate on cool features, and that will get out of the way when things are running fine.

A development team may seem to only care about adding features that will drive more and more users to your service. But an unreliable service is a service that loses users, so you should care about reliability. And what better to have a team has Reliability on their name?

What is SRE?

SRE means Site Reliability Engineering, Reliability Engineering applied to "sites". Wikipedia defines Reliability Engineering as:

[..] engineering that emphasizes dependability in the life-cycle management of a product.

This is, historically, a branch of engineering that made possible to build devices that will work as expected even when their components were inherently unreliable. It focused on improving component reliability, establishing minimum requirements and expectations, and a heavy usage of statistics to predict failures and understand underlying problems.

SRE started as a concept at Google about 12 years ago, when Ben Treynor joined the company and created the SRE team from a group of 7 production engineers. There is no good definition of what Site Reliability Engineering means; while the term and some of its ideas are clearly inspired in the more traditional RE, he defines SRE with these words1:

Fundamentally, it's what happens when you ask a software engineer to design an operations function.

Only hire coders

After reading that quote it is not surprising that the first item in the SRE checklist2, is to only hire people who can code properly for SRE roles. Writing software is a key part of being SRE. But this does not mean that there is no separation between development and operations, nor that SRE is a fancy(er) name for DevOps3.

It means treating operations as a software engineering problem, using software to solve problems that used to be solved by hand, implementing rigorous testing and code reviewing, and taking decisions based on data, not just hunches.

It also implies that SREs can understand the product they are supporting, and that there is a common ground and respect between SREs and software engineers (SWEs).

There are many things that make SRE what it is, some of these only make sense within a special kind of company like Google: many different development and operations teams, service growth that can't be matched by hiring, and more importantly, firm commitment from top management to implement these drastic rules.

Therefore, my focus here is not to preach on how everybody should adopt SRE, but to extract some of the most useful ideas that can be applied in a wider array of situations. Nevertheless, I will first try to give an overview of how SRE works at Google.

That's it for today. In the next post I will talk about how to end the war between developers and SysAdmins. Stay tuned!

All the artciles in the series: part 2, part 3, part 4, and part 5.

http://www.site-reliability-engineering.info/2014/04/what-is-site-reliability-engineering.html ↩
SRE checklist extracted from Treynor's talk at SREcon14: https://www.usenix.org/conference/srecon14/technical-sessions/presentation/keys-sre ↩
By the way, I am still not sure what DevOps mean, it seems that everyone has a different definition for it. ↩