Staffing levels and support coverage, part 3

                August 28, 2023

            Staffing levels and support coverage, part 3

                What's a 7x math error among friends?

            Photo by Rob Curran on Unsplash
Last week we talked about a way to achieve 24/7 support coverage on a budget, through a combination of careful staffing and leveraging outside assistance in the form of contractors. But that’s not going to be enough forever, at least not if your customer base keeps growing globally. Eventually you’re going to need to consider a true 24/7 support organization.
24/7
Maybe your customer base is large enough, or your product is critical enough, that the contractor/escalation process described above just isn’t going to cut it. You need to have a fully trained engineer standing by at all times to handle issues, whether they’re coming in at 9AM on a Monday or 2AM on a Saturday. You need a full follow-the-sun support coverage model, but that is far easier said than done.
Let’s start off by looking at those eye-popping numbers I mentioned a few weeks ago. 
First, a disclaimer: I can’t believe nobody called me out on this, but I completely flubbed the calculation. 24 x 7 x 365 isn’t right, of course. It’s 24 hours per day, 365 days per year. Sorry about that!
There are 8,760 hours in a year, and a 40-hour workweek gives you 2,080 hours per year per engineer. Even before we take into account time off, we’re looking at 4.2 full time employees to cover those 8,760 hours. Adding in time off adds more than 13% to your staffing needs, by my very rough calculation:
4 weeks vacation = 160 hours
5 days sick time = 40 hours
10 holidays = 80 hours
280 hours time off / 2080 hours total = ~13.5%
4.2 * 113.5% = 4.8
So that 4.2 people has already become 4.8 to account for time off. Of course you can lower that number by expecting more than 40 hours/week of work, but that will run you headlong into burnout sooner rather than later.
And of course time off isn’t the only thing that will drive that number higher. Consider:
People aren’t going to perfectly coordinate their time off. People will want to take vacations around the same time in the summer and winter. People will get sick in clumps. Most of the team will be taking holidays off, and so on. This number is an absolute lower bound, not a practical staffing suggestion.
There is no flex with this minimum team size: there will be one (or zero) people available at any given time to work, which leaves no room for ticket surges or other high-volume support incidents.
If you’re hiring exclusively in one country, that gives you only a few time zones to work with (depending on which country we’re talking about. I’ll assume the US). Are you going to pay extra for folks to work outside their normal hours, or hire outside the country? Each comes with its own headaches.
The larger the team, the more need for coordination and management. As a rule of thumb, you’ll need one manager for every eight or so employees, which means at least 4-5 managers (and a full team owner, who might be one of those four or another person entirely).
A larger discussion of this will have to wait for another time, but these numbers are assuming that your engineers are working on support issues for their entire work day, every day. If at all possible you’ll want to introduce some variety to the day-to-day, which directly translates into fewer hours on front-line support duty. If you’re dropping the on-duty time to 4-6 hours, add another 50-100% to the above staffing numbers to account for that.
Considerations of burnout and time-to-replace are less pronounced on an individual level as the team gets larger. Losing one person on a ten-person team is much less of an additional burden on the rest of the team than it is when there are only three to begin with. But that doesn’t mean you can disregard them entirely! If the conditions for burnout exist, you are liable to lose a lot more than one engineer if you don’t act quickly to improve the team’s situation.
Partial case study
Last week I talked about our coverage model on my last team: US-based support for weekdays and rotating weekend coverage, EU-based contractors to field and escalate off-hours issues. I mentioned that towards the end of that time we expanded into full-time Europe-based support engineers as well, so let’s talk about that a bit. 
Like many smaller companies we considered using a PEO (Professional Employer Organization) to avoid the necessity of incorporating in every country we hired from, but by the time I left we were still treating those full-time engineers as independent contractors, which was a level of bureaucracy and headache of its own. Each engineer had to submit a bill to the company monthly, it wasn’t necessarily paid for several weeks, and everyone involved never got to forget that, strictly speaking, they weren’t employees in the same way that our US-based team members were.
That said, our EU engineers got us pretty close to a full follow-the-sun support model, but you know what they say about close. We never achieved full in-house coverage, and given our company size and budget that is perhaps not surprising. What kept us from getting the rest of the way there? There were a few major factors.
Off-hours redundancy. We had two engineers that were based in the EU, but that left very little room for error. If one engineer was on vacation, six hours of the day was on the shoulders of just one person. 
Six hour shifts. Unlike our US-based team, where we had three shifts of four hours each, the EU team was on duty for six hours. In the short term it worked okay, since the load was still a lot lighter than it was during US hours, we knew that that wouldn’t always be the case as we expanded our customer base to more EMEA and APAC-based organizations. As we grew, the ticket load during those hours would certainly increase, and burnout would become more and more of an issue.
The swing shift. As I mentioned before we had our CTO covering evening Pacific time (5PM-11PM PT). While it was good enough for the time, that was another place where we had zero redundancy. The hours it covered were late enough that it was hard to find a US-based engineer to cover them without going far into the evening, even in Hawaii. And the company was understandably hesitant to expand into hiring in APAC, even using a PEO. Worse yet, we’d have had to hire at least two engineers, which would still leave us no better off than our existing EU coverage.
Management. In both cases, EU and EMEA, we had engineers who were multiple hours away from any existing management structure in the US. While we managed to swing it in the EU with one engineer designated as team lead, it was a constant balancing act ensuring they had both sufficient autonomy to cover their own workload and sufficient managerial support to address any performance or other issues. In the longer term, either promoting an engineer from the team or hiring in  managers who were located closer to the EU and EMEA engineers would have become necessary.
Boiled down to the essentials, going from the team we had (90% coverage) to the team we needed (100% coverage) would have been a significant leap in team complexity, global team distribution, and overall team cost in salary and overhead. And since most of the actual support needs were during US business hours, it was a hard sell to company leadership to overengineer today’s team, with today’s budget, to have sufficient capacity for our expected future needs. 
Final thoughts
For each of the coverage models, what I’ve discussed is the minimum requirement: if support load is consistently over 1 with your current staffing, then you need to scale up. But if you’re building in enough excess capacity to ensure you’re minimizing burnout and are able to absorb temporary increases in volume as well as scheduled and unscheduled time off, you should find that you’re rarely in a situation where that load is creeping too high for a sustained time. At this point, the question becomes: how do you make sure your support coverage is sufficient not just for today, but for a year from now? That’s when we start getting into the more complicated topic of capacity planning, which is definitely a topic for another time.

            Thanks for reading Andy's Support Notes 💻💥📝!

Don't miss what's next. Subscribe to Andy's Support Notes: