Contact Us
Cloud Reliability & the Rise of Engineering-Led Ops

Cloud Reliability & the Rise of Engineering-Led Ops

Premkumar Balasubramanian

Premkumar Balasubramanian

Premkumar Balasubramanian is Senior Vice President and CTO, Digital Solutions, Hitachi Vantara

Premkumar Balasubramanian (“Prem”) leads Hitachi Vantara’s Technology office globally and is responsible for strategizing and supporting all its go-to-market (GTM) pursuits. This includes offering shaping, architecting repeatable solutions for customers, and providing technology and thought leadership in modern engineering principles, application modernization, cloud engineering, data and analytics. Prem spent the early part of his career developing device drivers, network sniffers, protocols and improving reliability of application. He is a native of Chennai, India, now residing in Dallas with two amazing kids and a wonderful wife. He enjoys gardening, chess and photography.

Read Bio +

August 29, 2023

We’ve all seen how DevOps has helped organizations become more secure and agile. And just as DevOps transformed the development world, we’re on the cusp of another transformation, this time in the way that IT organizations handle modern workloads from the edge to the cloud.

Consider the changes that were ushered in by DevOps. Rather than a fragmented development approach that featured separate teams working on distinct areas, like security, development, infrastructure, quality assurance and support, DevOps promoted collaboration. At the same time, it eliminated the need to separate tasks like testing and infrastructure provisioning, each of which got provisioned into automated workflows.

The upshot: the individual teams grew a better understanding of the needs of other departments while a single delivery pipeline emerged that streamlined what had been a disconnected, often unwieldy process. Teams became more agile and developed better and more secure products.

DevOps demonstrated the need to test assumptions and think vigorously about how to best manage applications in a modern architecture and leverage contemporary engineering processes for maximum benefit. 

That was just one step in the ongoing history of development and the status quo worked, as long as the applications didn’t need to meet “real-time” demands. But as more organizations gravitated toward an always-on business model to provide anywhere anytime access, they also shifted to the cloud to gain greater agility. As reaction times narrowed, response times had to be faster and there was little room for errors or iterations in the interactions between development and operations.

The divide between development and operations still exists today and the accelerated pace of the “move to cloud” has accelerated the need to address this challenge. Cloud workloads are constantly being run, developed, deployed and updated. The always-on nature of the cloud requires an integrated always-on approach between application engineering and operations.

In the context of the cloud, we’re reaching the point where we shouldn’t need to treat Engineering and Operations as separate activities. Indeed, Engineering-led Operations is the big first step towards moving the organization to the point where operations as a separate organization doesn’t exist independently.

In this emerging world, wouldn’t it be better to just automate the more mundane tasks, plug them into your pipeline and allow engineers to automatically look at what’s happening in production to fix any problems? Imagine code that can fix itself with the right actions as soon as a problem is detected – or possibly before it becomes a problem. Support tickets would not need to be raised and alerts would not need to get triggered. That is the vision behind Engineering-led Operations.

Similarly, before the advent of DevOps, we relied upon manual intervention for things like code quality checks and security compliance. Nowadays, most of those routines have been converted into automatic rules plugged into tools that are part of the regular DevOps pipeline. For instance, very few would still view testing as an independent process anymore. It’s all part of development. 

The same logic applies to operations which are increasingly a part of the development lifecycle and leading to an engineering-led evolution.

Cusp of Something Big

As an industry, embracing the concept of Engineering-led Ops, sooner, rather than later, would bring greater agility to any cloud environment. Today, most organizations still struggle with engineering a good cloud setup. As they grapple with the complexity of cloud management, Engineering-led Ops can help solve the challenge.

Let’s look more closely at how an engineering-led approach might look in practice.

Today, dev teams have a features backlog, and everything goes into it. All feature requests are competing, regardless of their impact on the bottom line. Lost in the massive backlog might be a feature that is critical to the workload’s stability. This stability feature indirectly impacts revenue, but might be de-prioritized in favor of a feature identified as revenue generating.

With Engineering-led Ops, engineering and ops teams agree on error budgets, service level objectives and burn rates. When these exceed allowable norms, both teams de-prioritize new feature requests and prioritize reliability-related features until production is stable.

Engineering-led Operations is about bringing infrastructure, application and data closer together making infrastructure almost touchless. Engineering-led Operations is not only about bringing agility to operations like DevOps but also speaks to the integration of these two functions with shared goals, processes as well.

All this is part of a broader trend toward function integration, where modern software engineering becomes less about building products and services and more about building modern businesses. Unfortunately, we still find examples where too many manage infrastructure, data and applications in silos.

An Unreliable Cloud is an Expensive Cloud

In the cloud era, the challenge of how to run and sustain workloads efficiently has become the litmus test for success or failure, particularly for “always on” businesses where customers must have access anytime and anywhere, they want. When glitches crop up, outages will threaten both your credibility as well as your revenue.

Each news-breaking, cloud-related outage reminds us that the cost of being unreliable is often exponentially more expensive than the cost to be reliable. Consider what happened when Facebook suffered a 6-hour outage last fall. The problem – a configuration change to its routers – took down the social network along with Instagram, Messenger, WhatsApp, and OculusVR. The company’s estimated financial loss: at least $60 million.

An integrated, engineering-led approach can help avoid disconnects like this. A constant feedback loop can help keep everyone working on the project involved in the decision-making process and up to date on shared practices, shared goals and any changes that needed to be made.

There are many other examples of how reliability impacts cost and credibility, but they all support the adoption of Engineering-led Ops. In a world of “everything-as-code”, we’re running out of reasons to delay.

Related Articles