Site Reliability Engineering: The most important software feature is reliability
6 min

Site Reliability Engineering: The most important software feature is reliability

Reliability is the most important software feature. Both management and the software team at FinEasy AG realise this when some customers suddenly complain about long response times – and even outages – with the Saving Agent app. It’s important to act quickly to avoid a loss of sales, dissatisfied customers and, possibly, irreparable image damage.

Disclaimer: This is a fictitious example; proper names, persons and entities have been made up.

FinEasy AG, based in central Switzerland, is a new player in the Swiss FinTech market and has ambitious goals: not only does it want to become the leading payment service provider in Switzerland, but it also wants to support private customers with useful services in the areas of asset management, investment and savings advice. In the course of its endeavours, FinEasy developed the Saving Agent app, in which users document their income and expenditure and receive a customised catalogue of measures for their personal savings strategy.

The service quickly became a success, but recently the application, which was developed and runs in the cloud, has no longer worked reliably, and the first users have already complained about long response times and isolated failures. Management is alarmed. They call Martin, a software engineer and product owner of Saving Agent, into the office: he is to fix this problem and also find a way to ensure the reliability of the application in the longer term. This is no easy task: if the application does not work properly again soon, FinEasy AG will be threatened with lost sales, a large number of dissatisfied customers and almost irreparable damage to its image. The ongoing development speed of Saving Agent would also be reduced – an absolute nightmare scenario in view of the long roadmap that FinEasy AG envisages for the application and the large backlog of planned features. Martin realises how much pressure he is now under. How is he supposed to overcome all of these hurdles alone?

SRE improves software reliability

Martin decides to bring an external expert on board and thinks of the specialists from Swisscom, because they are very familiar with Site Reliability Engineering – SRE for short. SRE is a service management model developed by Google, which comprises various software-based methods and practices that are used to create extremely scalable and reliable software systems. Especially in cloud-native environments, SRE also helps to find a balance between the release of new features and their reliability for users. SRE therefore ties in directly with the DevOps approach and describes very specific procedures for implementing the theoretical DevOps concept in a specific workflow. With a comprehensive service portfolio, the Customer Reliability Engineers (in short: CREs) from Swisscom assist companies with implementing SRE and help them to sustainably improve the reliability of their applications.

Benjamin Treynor Sloss, inventor of SRE at Google, talks about the history of Site Reliability Engineering.

One thing is clear from the start: the reliability of Saving Agent – and any other applications FinEasy develops in the future – can only be achieved together. Both Swisscom as well as Martin and his staff must make their contribution to this. Commitment goals are therefore formulated in advance, which are binding for both sides: What contribution do Swisscom and FinEasy AG need to make in order to ensure applications run reliably? Where do the responsibilities lie?

The CRE then gains an overview of the application – from various perspectives: together with Martin, he not only evaluates the business goals that FinEasy is pursuing with the application, but also takes an in-depth look at the overall application architecture and its dependencies. In terms of operational readiness, he also carries out a risk assessment and develops measures to meet business requirements in terms of availability and reliability.

Software development according to clear rules

All of these points are recorded in an Application Reliability Review. This serves as the basis for Martin and his CRE to define Service Level Objectives (SLOs) and Service Level Indicators (SLIs) together. While SLOs describe the targets for the desired reliability of a piece of software, SLIs represent metrics to determine the actual availability of the system – for example, the ratio of successful and unsuccessful software queries.

SLIs and SLOs play a fundamental role in Site Reliability Engineering. SLIs should therefore be understood as metrics for user satisfaction: they determine, track and measure the success of the user journey and show whether the SLO goals are actually being met – or whether you run the risk of making the users of your own application dissatisfied.

Accordingly, SLIs and SLOs also record the error budget available to software engineers for the further development of an application: new services or software features may only be rolled out if they meet or exceed the defined SLO value. However, if they still have too many errors or fail for longer than the error budget allows, no further rollouts may take place until the application is back within the SLOs.

Site Reliability Engineering with Swisscom

  • Swisscom has many years of experience in operating its own as well as external cloud platforms and applications.
  • Thanks to their Swisscom Customer Reliability Engineer, customers have access to a broad base of employees with extensive technology expertise and many best practice approaches.
  • The Swisscom SRE Service includes end-to-end monitoring of the application. This provides the CRE with all of the necessary information to detect problems at an early stage or to carry out root cause analyses.The SRE/CRE model from Swisscom is hybrid. This means that, unlike other providers in the Swiss market, Swisscom operates in a platform-independent manner.

The offer in detail: factsheets (PDF)

External support

Now that it’s clear which SLOs are being pursued and which SLIs are being used, Swisscom can provide professional support to Martin in relation to operating Saving Agent: Swisscom integrates FinEasy monitoring into its Reliability Management System and sets up an alarm. In the event of future incidents, a Customer Reliability Engineer from Swisscom is on hand to advise Martin. He is immediately notified as soon as the service targets agreed for Saving Agent are breached, identifies the root cause of the event, and informs the relevant operations teams about incidents.

Every incident is followed up by a so-called post-mortem: not only are the incident, its effects and the measures needed to resolve the problem recorded, but the follow-up measures to prevent the incident from happening again are too. In this way, SRE also develops preventive approaches to solutions that contribute to better software reliability. And the monthly reporting on SLO violations and comprehensive quarterly reports also help Martin, together with Swisscom, to further improve the reliability of Saving Agent and keep repair times as short as possible.

Martin can breathe a sigh of relief. Thanks to the help of Swisscom, he was not only able to resolve the long response times and outages of Saving Agent; he now knows that he will also receive professional support in the future when it comes to operating the application. Site Reliability Engineering and working with appropriate professionals help him to ensure the reliability of one of FinEasy AG’s most important business applications.

Read now