The origins of Customer Reliability Engineering
Customer Reliability Engineering (CRE) is a practice that was initiated by Google in 2016, following the success and value brought by their Site Reliability Engineering (SRE) team.
The Google SRE team is responsible for empowering product development teams with tools and processes to maximize the reliability and resilience of their Cloud infrastructure.
In other words, they concentrate their efforts on the Google Cloud Platform (IaaS and PaaS) availability from a customer standpoint, throughout maintenance, upgrades, and other operational activities.
If you’re interested in Site Reliability Engineering, you can find some interesting articles here.
But as we all know, a solution is only as reliable as its weakest link. Amazon, Microsoft, Google, and others built very strong credibility with customers for their cloud services, betting on exhaustive and transparent communication on their availability, shooting for some of the highest SLA commitments (>= 99.99% for their most critical “Compute” resources — or less than 1h of downtime per year).
Service credit usually backs this commitment starting at 10% of your monthly bill, to increase the customer confidence in the Cloud Service Provider (CSP) ability to deliver.
The issue with current cloud application adoption
Feeling slightly comforted on that front, customers now turn all their attention (and anxiety) to the application that runs on top of their cloud infrastructure.
And while we’ve seen a plethora of new technologies emerge over the past years to help with application reliability, it is often not clear to customers if and how they can leverage them for a particular application. Especially legacy applications coming with a pretty heavy on-premises history.
Customers are still dependent on the same vendor resources they’ve received for the past few decades: product documentation, white papers, professional services, support.
Which, despite being critical, is not enough to support the safe and fast pace that customers look for with cloud deployments or anticipate a certain level of reliability. Vendors need to show their customers that they share their operational fate. But, how?
Enter Customer Reliability Engineering.
What is the Customer Reliability Engineering value proposition?
Let’s take some industry trends and state-of-the-art practices to try to imagine how this CRE team would impact Axway product development and delivery.
In that scenario, the Customer Reliability Engineering team is ultimately responsible for the smooth customer experience with the Axway cloud solutions.
This implies owning a dynamic roadmap with a set of strategic customers to help them realize value from their investments. This covers the definition of a path to growth, leveraging the full extent of what Axway solutions can offer them.
The execution of this roadmap triggers technical advisory needs around architecture, deployment, maintenance, and operations of the Axway cloud ecosystem.
The outcomes of this work directly translate into improved adoption and quality of delivery, as the CRE team is closely partnering with Product and Engineering.
Axway internal teams
What is the impact on Axway internal teams? It’s simple. Product Management, Product Engineering, Engineering Operations (responsible for the Axway SaaS and Managed Cloud environments), and CRE teams establish and act according to a “social contract.”
This contract materializes in the form of a Production Readiness Review document (or PRR) that verifies the alignment with three key concepts:
- The “error budget,” is dictated by the Service Levels Objectives (SLO) that the Product team defines and communicates on, with internal teams and customers.
This error budget drives the amount of time allowed for upgrade and maintenance downtime. The more and earlier you use it, the less is left should you urgently need it. This forces the team to plan wisely, but more importantly: be creative to only use this “budget” when necessary.
Note that I’m not recommending limiting application deployments, as this would be against one of the main benefits of the Cloud and DevOps practices. But instead, organize the patch/upgrade process so that the service remains available from a customer perspective.
- The focus is on “Everything as code,” and by “everything” I mean application and infrastructure deployment, testing, and operations.
No room for manual, as manual loses time, is error prone and worth of all: not predictable. And how can our customers feel comfortable with a solution, if they can’t predict the outcomes of some standard operations?
- The ability to push back on software delivery that doesn’t respect the contract. This is a big change we’re already embracing within Axway, with our Engineering Operations team being a gatekeeper for products’ General Availability (GA) status.
Working with their Product and Engineering counterparts, they built a standard set of requirements for our product to be considered “cloud-ready.” New product releases are assessed against this list of prerequisites to allow for GA.
From a customer engagement point of view, product releases that bring in new complex features or lead to a different architecture will require the approval of at least two customers (early adopters) to move forward with GA.
How will this play out in real life?
One thing is sure: all of this is not going to happen overnight or around the same time for all Axway products. But an opportunity lays in front of us: our Managed Cloud offering for MFT, B2B, and Amplify API Management Platform.
The Managed Cloud services already define a list of availability objectives based on your support level of choice (silver, gold, or platinum).
This is quite appropriate for a first step in aligning all teams behind a shared business-oriented goal. In a second time, the CRE team will start engaging with Customers and consolidate their perspectives and requirements.
This new set of SLO will augment the standard set of objectives for Product and Engineering. Leading us to establish a framework and methodology to drive increased success around customer adoption.
Ideally, we want to get to a situation where a new product release will be seamlessly handed over from Product to Engineering to Engineering Operations and Customers with the right level of quality and packaging.
Therefore, ensuring that existing Axway Managed Cloud (AMC) customers will always be on the latest and greatest version of our products, without seeing any impacts on their business.
While providing a great impression in terms of adoption and comfort for non-AMC customers. And we all know how important a first impression is to establish a successful relationship.
Learn more about Axway’s Managed Cloud Services.