Jack Maher

Expert in Digital Transformation, including DevOps, Lean, and Agile practices through the full value stream. Digital Transformation Evangelist · DevOps Ambassador · Author · Consultant · Educator · Coach · Speaker.

Site Reliability Engineering is one of the hottest and fastest growing roles in today’s IT talent market

2 min read

SRE site reliability engineering

In January 2019 LinkedIn called the SRE role the 2nd most promising job in the US, behind only Data Scientist, with these statistics:

  • Median Base Salary: $200,000
  • Job Openings (YoY Growth): 1,400+ (72%)
  • Career Advancement Score (out of 10): 9

From <https://blog.linkedin.com/2019/january/10/linkedins-most-promising-jobs-of-2019>

Read the definitive book on the SRE role for free here:  https://landing.google.com/sre/books/

SRE books

What Does a Site Reliability Engineer Do?

Today’s Site Reliability Engineer spends about half of their time on the ops side making sure that issues are addressed and appropriately resolved, being on-call, and able to troubleshoot and intervene when problems arise.  The other half of their time is spent on new feature implementation, automation, and performance monitoring.

As with many “overnight success” stories, this one has been years in the making.

How the SRE Concept Was Born

While Google really put the role on the map in today’s context, the foundational principles and first technical implementation of the concepts of resilience, recoverability, and reliability in a highly scalable environment took place in Bell Labs more than 20 years ago, and almost concurrently at Google circa 2003.

In the heyday of Bell Labs, they were managing exponentially growing service offerings and business volume as services like 800 numbers, voicemail, and other network services become part of everyone’s lives, and then exploded with cellular services.  Being able to scale was something that they were already pretty good at.  They had a lot of practice managing systems when there were periodic sustained spikes, such as special events (e.g. natural disasters) and even predictable massive surge events (e.g. Mother’s Day). But the ability to withstand dramatic changes, hardware or software failures, and in many cases with a requirement to self-diagnose & correct due to the distributed nature of the network services, with many being in “light’s out” facilities (meaning completely automated, no people on premises) resilience took on a whole new meaning.

A Personal Experience

“Necessity was the mother of invention”, and so they did, as evidenced by many patents that were awarded these teams, such as this one, that includes my co-author Carmen DeArdo.  He shares his first hand experiences in our book Standing On Shoulders: A Leader’s Guide to Digital Transformation. Later, when he and I worked together at Nationwide, we relied on his experience there, and current organizational needs to develop the initial prototype of this role for our DevOps implementation there.

Today’s SRE role builds on that foot in each camp role
 – one foot in operations and the other in software development. 

The Foundations of SRE 

The Site Reliability Engineering domain foundation is based on Scalability, Availability, Incident Response, and Automation.  When we look at what they do, we need SREs to balance their time and focus on the operational issues and opportunities for improvement qualitatively and quantitatively with equal efforts in the development of new features and capabilities such as application telemetry, in-production testing, and evolving needs.

The Site Reliability Engineer role demands a pi shaped skillset, with skills and experience in both the development and operations spheres.  As such, career path planning and supporting skill development and experience must be developed to help grow the population of resources needed.

Professional Development Options for SRE

In anticipation of this and other skill development needs, the DevOps Institute has kicked off the SKILup initiative to increase focus on and to deliver skill advancement information and opportunities.

We also need to educate our organizations about how we think about work starting with making all work visible and extending the collaborative approach broadly, where failure to maintain service level objectives results in an inquiry of root cause and issue resolution based on data and aligned with mission and value.

Global Lynx, as a REP of the DevOps Institute offers the Site Reliability Engineering Foundation course, which you can explore to learn more and get a globally-accepted certification in SRE Foundation.

Jack Maher
Jack Maher

Expert in Digital Transformation, including DevOps, Lean, and Agile practices through the full value stream. Digital Transformation Evangelist · DevOps Ambassador · Author · Consultant · Educator · Coach · Speaker.