To lay the foundations of a successful inquiry into Site Reliability Engineering (SRE) and how it can inform the way we deliver and maintain our Software, and run our Technical Operations at Eagle Eye, Steve Rothwell (CTO and founder of Eagle Eye) asked the senior Leadership team to attend the Linux Academy to obtain our first level Site Reliability Engineering qualifications.
Steve wanted the entire leadership team, technical or otherwise, to be on-board with the thoughts behind the approach and its language and concepts.
Here I interview our CEO, Tim Mason, to hear about his experience and thoughts after completing the initial SRE training.
Q. Tim, many of our Technical people have enjoyed learning on the Linux Academy platform, but you are our first non-technical Guinea pig, how was it for you?
I attended the Linux Academy like a kid at new school but found it fast moving and approachable.
Q. From the course, can you describe SRE for us?
So, the purpose of Site Reliability Engineering (SRE) is to “enable the making of better software faster”. The concept of the stability of the platform rather than the amount of delivered development being a constraint was new to me but completely logical when you get it.
Q. Any immediate insights for the retail world?
It sort of explains why there is so little dev on retailers point-of-sale systems not because the new function isn’t needed but because reliability is the be all and end all. It also explains why the introduction of a flexible marketing platform, like Eagle Eye AIR, linked to the POS by APIs can have such a super-charging effect.
Q. With your retail experience, I guess that customer service has always been on your mind?
I was more familiar with the concept of service levels, as a grocer by background with particular exposure to fresh foods, I had years of experience of managing customer availability (a.k.a. service level) whilst managing product waste. Early in my career, my colleague’s response to finding empty shelves of coleslaw during a snap heat wave one Easter was a formative experience. I also recognised the objective that the measurement of performance should reflect customer experience. In supermarket terms Tesco did this in the early 2000s, switching to actual on-shelf availability as experienced by our own Tesco.com pickers rather than relying on supply chain systems reporting which measured what should be on-shelf rather than what was on shelf.
Q. How do you think the concepts of SRE would help us maintain our service levels to customers?
The breaking of service levels into Service Level Agreements (SLAs), Objectives (SLOs) and crucially Indicators (SLIs) seems a great step forward in delivering agreements to clients on a continued and reliable basis. The holy grail being lead not lag indicators.
Q. What does that mean for our Engineering and Technical teams?
Well, I also recognised elements of Lean Operations on the course particularly in the area of “right first time”, logically this will be as beneficial for the production of software as it was for the production of cars etc.
The checklist for making Engineering and Operations more effective speaks to this:
- Shared ownership
- Same tooling
- Same techniques
- No fault post-mortems (better described as our Agile retrospectives)
- No 2 failures the same
It seems that SRE and its concern for service provides a very handy way of managing the risk of innovation through something called an Error Budget, which allows you to calibrate your pace of new feature releases against the data coming back from the various service metrics – so the more you focus on minimising errors in the live software, the more capacity you have, or budget, to release innovative features.
Q. Were there any other concepts which informed your thinking?
Well, finally I was taken by the concept of toil, again this resonated with my past where we had tried to identify all back office processes undertaken by store staff with the aim of removing them in order to create more time for these staff to serve customers. It is our duty as managers to champion the removal of ‘bad’ work i.e. toil to improve the jobs of our colleagues and make our organisations as efficient as possible.
In summary it was 2 hours well spent and I was proud to download my Linux academy certification.
Q. I’m concerned our Software Engineers and Technical Teams might be worried about you taking more technical courses and wanting to join our teams.
Ha, you have nothing to worry about, what those people do is awe inspiring, I wouldn’t know where to start, but am glad we have so many talented individuals as the beating heart of our business applying their technical knowledge to ensure we deliver to our strategic goals. To be honest this is how Steve and I work together, I have a life long interest in doing better marketing and he has a life long interest in applying technology to do better marketing, both of us want to see our clients doing a better job for their customers and it’s that which unites us and drives us on.
Thank you so much Tim for taking the time to articulate your understanding of SRE and its potential implications for us as a business.
As we continue to seek ideas to improve our processes, learning from others, and implementing them into the ways we work, it’s vital that we have the culture of inter-departmental understanding and support so that we feel we are all heading in the same direction.
We will be opening the Linux Academy SRE course to others across the business so that we develop a common understanding as we take our first steps to augmenting our Software Development Lifecyle, Technical Operations, Quality Control and Support with the disciplines of SRE.