Google already has 8 products that have 1+ billion users. Have you ever asked yourselves how is Google able to manage all those services? How do its teams work to keep everything in an optimal state? How does it deploy new functionalities?

When Google launched its browser in the early aughts, it already was aware that one of the biggest problems it had to manage its services was the tussle between the development people and the operations people.

The dev teams wrote code and sent it to the ops teams for deployment. The latter tried to make it work; if they failed, they sent it back to the former. As a general rule, operators did not know much about coding and developers did not know much about operating practices.

Developers were worried about finishing code and deploying new functionalities, whereas operators paid attention to reliability and keeping things running smoothly. In short, they had different objectives: functionality and stability respectively.

The concept of site reliability engineering (SRE) was developed by Ben Treynor at Google to solve this problem and encourage reliability, responsibility and innovation in products.
The basic principles – which its creator explains here this speech – can be summed up as follows:

The error budget

How is the conflict between the dev team, which wants to deploy new functionalities, and the users who adopt their products and the ops team, which does not want everything to blow up on their watch, solved? The solution that has been reached with SREs is to allocate an error budget to them.

To this end, the people in charge of the business or the product must set the availability goal they want the system to have. Once they have done this, one minus the availability goal is what is known as the error budget.

If the system has to be available 99.99% of the time, this means it can be unavailable 0,01% of the time. This unavailability time is the budget. This budget can be spent launching or doing those other tasks the team deems appropriate – provided it is not exceeded.

The error budget is something admissible because a 100% availability is the wrong reliability goal for almost everything. In most cases, the user will not see the difference between a 100% available system and e.g. a 99.999% available system.

Another thing to keep in mind is that the closer to 100% the availability goal is to be, the more effort and time will have to be spent. This is more expensive and technically more complex, and users will usually not notice it.

Furthermore, bear in mind that the greater the efforts to make the system more reliable above what it actually needs to be, the more heavily penalised the product will be because the deployment of new functionalities will be slowed down.

When the error budget is exceeded, deployments have to be frozen and time spent stabilising the system and making improvements to ensure the SLO is achieved.

Monitoring

SRE teams usually deal with three types of monitoring results:

Problem solving

Things go wrong and we need to be prepared when this happens – sooner or later. This is why SRE teams spend most of their time building fault-tolerant systems. And they do it by following to two approaches: gradual degradation, and defence in depth.

These automatic responses provide systems with high availability. In the end, user experience is not significantly affected by faults and there is enough time to fix them without users experiencing a problem.

Capacity planning

Demand forecasting and capacity planning are essential to guarantee the defence in depth of services. Unfortunately, they are not done most of the times. Nevertheless, it is vital to benchmark the services, measure how they behave under high loads and know what excess capacity they have during peaks in demand.

By doing this, demand can be forecast and capacity planned to prevent multiple emergencies from arising, service outages from happening and sleepless nights from becoming the norm.

SRE vs DevOps

The term DevOps does not have a uniform definition. The theory arises from a great idea: To have the development people working side by side with the operations team. In practice, however, there seems to be great variability in how the industry interprets it.

DevOps is a set of practices and a model designed to bring down the barriers between developers and operators. It reduces organisational silos, accepts faults as something normal, implements changes gradually, makes the most of tools and automation, and tries to measure all that can be measured.

Google has been refining the definition of SRE for the past 15 years to include all the pieces we have already seen: the right balance between the SRE team and the dev team; free mobility between teams; the limits to the operating load; error budgets; and everything else.

The result of all this is that SRE is a highly detailed implementation of the DevOps proposal. It shrinks silos by sharing the ownership of the product among the SRE team and the dev team. It considers faults something normal by means of the error budgets and the blameless post-mortem analyses.

Changes are gradually made by means of canary-type deployments in a small portion of the system. All possible work is automated, and the effort made and the reliability of the system are measured.

Google has created a website (it shows us their methodology) so that we all can learn from its experience and adopt SRE in our daily work. SRE is currently used by companies as varied as Facebook, Dell, Atlassian, Twitter, Apple, Oracle, Dropbox, Amazon and Microsoft.

It is evident that in order to implement SRE each company will have to make specific adaptations so that it can fit its ecosystem and meets its needs. Any change requires some effort and a certain type of attitude, but it is worth it because SRE is an excellent tool to help us to digitally transform our companies.

It makes all participants feel they are part of the same team and aligns their objectives toward a common goal. This methodological framework solves many of the problems we face when we develop, manage and maintain our digital products.

Tell us what you think.

Comments are moderated and will only be visible if they add to the discussion in a constructive way. If you disagree with a point, please, be polite.

Subscribe

We are committed.

Technology, people and positive impact.