October 17, 2023
There is a constant tension between development and operations. Development wants to develop code, deliver applications and build features as quickly as possible, while operations wants to focus on cost, availability, scalability and change management.
Although the indirect costs of the development/ops-split can be subtle, they’re often more expensive than direct costs. There are ways around this conundrum, however, and they can lead to greater efficiencies, scale and lower costs. Namely, through Site Reliability Engineering or SRE.
Originating at Google in 2003, SRE is responsible for a combination of the following within a broader engineering organization: system availability, latency, performance, change management, monitoring, emergency response, and capacity planning. To do it, it leans on automation, system design, and improvements made to system resilience.
When you consider the costs that can arise from DevOps, it’s important to understand a few things about the two teams, which are quite different in background, skill set, and incentives. They use different vocabulary to describe situations; they carry different assumptions about both risk and possibilities for technical solutions; they have different assumptions about the target level of product stability.
In fact, the split between the groups can easily lead to problems with communication, goals, and eventually, trust and respect. That’s where SRE, a methodology for business continuity, can help.
But it takes more than merely creating an SRE team composed of development and operations engineers. That’s not enough. To create a successful team requires a cultural shift in thinking; one that fosters collaborative work and aligns to goals that are business-focused, rather than team-focused.
What is the business trying to do and what role does their specific discipline play? What, in fact, is the overall objective?
From a best-practices standpoint, business leaders embarking on such SRE led team building must be cognizant of the all-too common territorial behavior that can exist in such organizations, as well as the tendency for people to zero-in on the specific work. To wit, selecting members for an SRE led team should be based on their collaborative and empathetic qualities, as much as their technical skills.
The key to success of any project is the coming together of people, process, and technology. Hitachi’s guiding principles from the very beginning have been Wa (Harmony – people), Makoto (Sincerity -process) and Kaitakusha-seishin (Pioneering spirit – technology). These principles are part of the SRE culture.
The SRE led culture cannot be implemented from the top down. It must Involve people at every level, and it must allow people to play to their strengths and encourage everyone to do their best. The SRE led culture supports communication and alignment between teams with common goals. The SRE led culture emphasizes the concept of Wa where the common goal is more important than an individual’s ambitions or desires. Wa is about harmony and not conformance. It is the orchestration of individual strengths that lead to a greater whole.
The SRE led culture is also about communications and alignment between teams. Shared metrics like SLOs make sure everyone’s focus is aligned on customer satisfaction. Runbooks codify information so everyone feels confident responding to incidents. Makoto is about sincerity and transparency in communications.
One of the cultural changes that an SRE led team must accept is that failure is normal. 100% availability is not possible due to the complexity of cloud and the dependencies we have on things which are out of our control. When a failure occurs, instead of finding fault with individuals and assigning blame, look for systemic causes instead. If someone makes a bad decision, what information or tools would help him make the right decision in the future. SRE led approach should foster a blameless culture, where individuals will feel free to take risks and experiment to improve operations knowing that they will not be blamed or punished. In Hitachi we call that Kaitakusha-seishin, (Pioneering Spirit).
The common concerns you have may sound like this: “I’ve got this large development team with a lot of development going on;” “I’ve got large op teams with workloads in production that are already being managed.” So, you ask, how should we start thinking about this?
For starters, you need to look for end-to-end lifecycle cloud management services, that enable you to design, build, observe, operate, and manage “workloads” seamlessly across people, processes, and technology. Do that, and you can provide access to the right tools and technology and services. This is in line with the new Hitachi Application Reliability Services we introduced last week. The services within the portfolio establish guardrails to enable secure & efficient public and private cloud workloads through an automated factory delivery model. The result is that organizations are able to extract value from workloads on a continuous basis, reducing cost per transaction while increasing revenue per transaction.
With a thoughtful SRE integration plan and the proper cloud management services, organizations can begin designing cost-effective cloud environments that are resilient and self-correcting, accelerating their digital and hybrid cloud transformations.
Related News
Check out more great stories on Insights.