Google already has 8 products that have 1+ billion users. Have you ever asked yourselves how is Google able to manage all those services? How do its teams work to keep everything in an optimal state? How does it deploy new functionalities?
When Google launched its browser in the early aughts, it already was aware that one of the biggest problems it had to manage its services was the tussle between the development people and the operations people.
The dev teams wrote code and sent it to the ops teams for deployment. The latter tried to make it work; if they failed, they sent it back to the former. As a general rule, operators did not know much about coding and developers did not know much about operating practices.
Developers were worried about finishing code and deploying new functionalities, whereas operators paid attention to reliability and keeping things running smoothly. In short, they had different objectives: functionality and stability respectively.
The concept of site reliability engineering (SRE) was developed by Ben Treynor at Google to solve this problem and encourage reliability, responsibility and innovation in products.
The basic principles – which its creator explains here this speech – can be summed up as follows:
- Only hire people who know how to program, since the first obligation of a site reliability engineer (SRE) is to write code. The consequence of putting someone who develops to carry out operations is that they will try to automate their work.
- Define the Service-Level Objective (SLO) for each service – which will typically be its level of availability.
- Use SLOs to measure the performance of every service and make reports about them.
- Use an error budget as a launch criterion (we will further explain this concept below).
- Both SREs and developers belong to the same staff pool and they are all treated as developers. There is no line separating them. Developers are invited to try the job of SRE for a while and to stay in it if they so desire it.
- Any surplus ops work will fall on the development team.
- Limit the operational load of each SRE to 50% of their time in order for them to at least spend the remaining 50% automating things and improving reliability.
- The SRE team shares 5% of the ops work with the dev team. If functionalities are added that make the system unstable, the SRE team may send the product back to the dev team because it is nor ready to support it. Programmers will support of the product on a full-time basis if it is not ready to be supported in production.
- A standby team is made up of 8 engineers at 1 location – or 6 engineers at 2 locations – at the very least. Having enough engineers guarantees each will have an acceptable workload without leading to exhaustion.
- Every engineer may deal with 2 incidents per shift at the most. The actual solving of the problem may only take a few minutes but the full resolution process takes longer since it includes the post-mortem analysis, its review, establishing a set of actions, entering them the ticketing system, and so on. Thus, on average engineers cannot take care of anything else in one shift.
- Every event is subjected to a post-mortem analysis.
- The purpose of these port-mortem analyses is not to blame people as they do not focus on people: they focus on the process and the technology since in most cases when something goes wrong the problem usually lies in the system, the process, the environment or the technology stack.
The error budget
How is the conflict between the dev team, which wants to deploy new functionalities, and the users who adopt their products and the ops team, which does not want everything to blow up on their watch, solved? The solution that has been reached with SREs is to allocate an error budget to them.
To this end, the people in charge of the business or the product must set the availability goal they want the system to have. Once they have done this, one minus the availability goal is what is known as the error budget.
If the system has to be available 99.99% of the time, this means it can be unavailable 0,01% of the time. This unavailability time is the budget. This budget can be spent launching or doing those other tasks the team deems appropriate – provided it is not exceeded.
The error budget is something admissible because a 100% availability is the wrong reliability goal for almost everything. In most cases, the user will not see the difference between a 100% available system and e.g. a 99.999% available system.
Another thing to keep in mind is that the closer to 100% the availability goal is to be, the more effort and time will have to be spent. This is more expensive and technically more complex, and users will usually not notice it.
Furthermore, bear in mind that the greater the efforts to make the system more reliable above what it actually needs to be, the more heavily penalised the product will be because the deployment of new functionalities will be slowed down.
When the error budget is exceeded, deployments have to be frozen and time spent stabilising the system and making improvements to ensure the SLO is achieved.
SRE teams usually deal with three types of monitoring results:
- First of all there are alerts, which tell people that they should act immediately. They are triggered when something is happening – or about to happen – and someone should do something fast to reverse the situation.
- Tickets are the second kind of result. In this case, it is also necessary that a person take action – but not immediately; maybe in a few hours or even days.
- The third category consists in logs. They are information that, by and large, no-one needs to take a look at but is available for diagnostic or forensic purposes.
Things go wrong and we need to be prepared when this happens – sooner or later. This is why SRE teams spend most of their time building fault-tolerant systems. And they do it by following to two approaches: gradual degradation, and defence in depth.
- Gradual degradation is the system’s capacity to tolerate faults without collapsing entirely. For example, if a user’s network is slow, a website can stop providing irrelevant content to keep the most critical processes running.
- Defence in depth consists in automatically fixing faults so that they do not affect users. Mean repair times can be markedly improved by automating the debugging. The different system layers are designed to tolerate points of failure without it being necessary for anybody to do anything – even to become aware of them.
These automatic responses provide systems with high availability. In the end, user experience is not significantly affected by faults and there is enough time to fix them without users experiencing a problem.
Demand forecasting and capacity planning are essential to guarantee the defence in depth of services. Unfortunately, they are not done most of the times. Nevertheless, it is vital to benchmark the services, measure how they behave under high loads and know what excess capacity they have during peaks in demand.
By doing this, demand can be forecast and capacity planned to prevent multiple emergencies from arising, service outages from happening and sleepless nights from becoming the norm.
SRE vs DevOps
The term DevOps does not have a uniform definition. The theory arises from a great idea: To have the development people working side by side with the operations team. In practice, however, there seems to be great variability in how the industry interprets it.
DevOps is a set of practices and a model designed to bring down the barriers between developers and operators. It reduces organisational silos, accepts faults as something normal, implements changes gradually, makes the most of tools and automation, and tries to measure all that can be measured.
Google has been refining the definition of SRE for the past 15 years to include all the pieces we have already seen: the right balance between the SRE team and the dev team; free mobility between teams; the limits to the operating load; error budgets; and everything else.
The result of all this is that SRE is a highly detailed implementation of the DevOps proposal. It shrinks silos by sharing the ownership of the product among the SRE team and the dev team. It considers faults something normal by means of the error budgets and the blameless post-mortem analyses.
Changes are gradually made by means of canary-type deployments in a small portion of the system. All possible work is automated, and the effort made and the reliability of the system are measured.
Google has created a website (it shows us their methodology) so that we all can learn from its experience and adopt SRE in our daily work. SRE is currently used by companies as varied as Facebook, Dell, Atlassian, Twitter, Apple, Oracle, Dropbox, Amazon and Microsoft.
It is evident that in order to implement SRE each company will have to make specific adaptations so that it can fit its ecosystem and meets its needs. Any change requires some effort and a certain type of attitude, but it is worth it because SRE is an excellent tool to help us to digitally transform our companies.
It makes all participants feel they are part of the same team and aligns their objectives toward a common goal. This methodological framework solves many of the problems we face when we develop, manage and maintain our digital products.