Infrastructure Engineering

Tech Spec Review

Mon, 16 Jan 2023 07:00:00 -0700

As the organization starts to write more Technical Specifications, you’ll eventually want a forum to discuss the key decisions. At most companies, that meeting is the Tech Spec Review.

The Tech Spec Review is a forum to review feedback on new Tech Specs, resolve open points of discussion, and flag new context to be considered before finalizing the design. Secondarily, it’s a valuable forum for keeping the wider organization aware of new and upcoming technology changes.

Related tools

Tech Spec

Related meetings

Incident Review

Other approaches to Tech Spec Reviews

Goals

Drive consistent technical decision making. Much of the value from your technology strategy comes from its consistent application, and this meeting should support consistency. The review is a particularly valuable source of problems to inform your technology strategy
Role model good technical decision-making and discussion. Your organization will learn what good technical decision-making looks like from this meeting. Proactively coach folks giving feedback in both good (“keep doing that!”) and ineffective (“in the last meeting, …”) feedback
Prevent teams from pursuing local maxima in ways that are misaligned with the company. For example, a given project might benefit from introducing a new database, but the cost to the company to support business continuity, privacy auditing, and so on might outweight the project’s benefits
Avoid the Tech Spec Review anti-patterns. Don’t be a domineering review, bottlenecked review, status-oriented review, or an inert review. As a key forum for resolving technical disagreement, there are many ways for Tech Specs Reviews to fail. Avoiding these anti-patterns requires ongoing, proactive attention from the Tech Spec Review’s sponsoring leader

Agenda, Scheduling, and Scaling

The default approach here is to run them on a weekly cadence, sending out Tech Specs for discussion two days ahead of the meeting, requiring all attendees to read the specs before the meeting, and canceling meetings ahead of time when there are no specs to review.

That said, most organizations end up with a fairly custom approach to this meeting. When your organization is small, you can likely do on-demand reviews for each Tech Spec. This allows the team to get comfortable reviewing and being reviewed without the risk of “running over time” and preventing another spec from getting discussed.

As your organization grows, it will typically become hard to schedule all stakeholders into a on-demand meetings, and you’ll typically move into a standing meeting. Each standing meeting should discuss one to three reviews, depending on the size of open decisions. You can experiment a bit with format here: you might be able to review five specs in five minutes if it’s just a matter of approving unless there are any additional concerns to flag.

There are many ways to scale this meeting. Some organizations rely on asynchronous review for most specifications, and only bring “controversial” specs to the synchronous review. Some organizations hold multiple Tech Spec Reviews, sharded by area: one for Product Engineering, one for Infrastructure Engineering, and so on. Ultimately, I recommend actively experimenting with your approach based on the specific issues you’re running into with the meeting. There are general solutions, but each company uses this meeting in a somewhat different way, so adopting the standard solution may not work well for your needs.

Roles & Attendance

There are four key roles in the Tech Spec Review:

Facilitator who coordinates the agenda and the conversation. This is generally either a Staff Engineer, a Technical Program Manager, or a partnership of the two
Presenter who has written the Tech Spec being discussed
Notetaker who ensures notes from the discussion are captured
Attendees who share context, ask questions, and participate in the discussion. Some companies restrict attendance because too many folks attend and want to “demonstrate value” by asking questions, or unconstructively inject their personal preferances rather than prioritizing the organization’s perspective. Generally, I think it’s better to allow open attendance and give direct, firm feedback to those who attend unconstructively. If folks feel like they must attend to avoid bad decisions impacting their team, then you should probably consider creating more visibility into Tech Specs outside of this meeting, via either chat or email
Sponsor who provides organizational weight to the meeting through their participation, this is generally either the head of engineering, a Staff Engineer serving as the head of engineering’s right hand, or a manager reporting directly to the head of engineering

Is it working?

Some questions to ask when considering if your current Tech Spec Review is working:

Do you have Tech Specs coming in for review? If not, is it because the review isn’t useful? Is the review too intimidating? Are folks not sure how to submit new specs?
Are too many reviews coming, such that feedback is slowing down execution? Is there a set of category-wide decisions you could make that would reduce the need for certain kind of Tech Specs (e.g. auto-approve specs that use the common storage and compute tiers)?
Are reviews generally getting to the right decisions? Are the right concerns being raised, but getting rejected because the presenters don’t engage with feedback? Conversely, is it because the review lacks the necessary authority to succeed in your company?
Are discussions generally on topic? Do some participants routinely derail discussion? How could you prevent that pattern from reoccuring?
Do attendees enjoy attending?

Incident Review

Thu, 05 Jan 2023 07:00:00 -0700

I’ve never heard of a company that has a business, that doesn’t also occasionally have things go wrong. Something going wrong might turn into a support ticket, an angry email, or an alert popping up on an on-call engineer’s phone. If there is user or business impact, and an engineer might need to respond, then it becomes an incident.

After the incident, the folks involved in mitigation write an Incident Review Template, and the that document is discussed in this meeting, the Incident Review.

Related Tools

Incident Review Template

Other Approaches to Incident Review

PagerDuty

Goals

Incident Reviews are a cultural carrier meeting for most engineering organizations. They are a rare meeting where you will see a wide mix of teams and seniority-levels arguing about something that the business cares about deeply: customer and employee impact. A well-run Incident Review helps new employees quickly understand how your culture works when things really matter.

An effective Incident Review facilitates these goals:

Foster and socialize learning about what caused an incident: incidents have a certain inherent rhythm, and the only way to change it is to ensure others are aware. The most valuable thing this meeting does is create awareness of what has actually happened in a given incident, which is the precursor to preventing a repeat
Surface missing context across teams and functions: customer success might mention an impact to users, an infrastructure engineering team might mention that the incident had a wider impact than initially recognized, a product engineering team might explain the business cost of delayed message processing
Inform investments on work that will best contribute to increased reliability: broaden an ongoing investment project to support a new edgecase, cancel a previous mitigation effort based on improved understanding of the underlying issue, recognize that similar issues are repeating without being successfully addressed

Anti-goals

Because of this cultural significance, Incident Reviews also have a predictable tendency to become ideological arenas, and to attract participants with ideological goals about the right way to foster reliability, run reviews, etc. Your goal as the senior leader who owns this meeting is to prevent it from becoming an open ideological discussion forum, and to instead focus it on the specific agenda at hand.

Several patterns to be wary of:

Ensuring adherance to documented process: some review meetings become focused on driving adherance to the specified incident response or review process. That is valuable work, but ineffective to conduct in a large, learning-oriented forum. Instead, drive adherance before the meeting
Pedantic or status-oriented: a surprising number of incident discussions end up orienting around policing correct nomenclature rather than encouraging learning and growth. Effective reviews are progress-oriented, with practioners who explain important context when additive, but don’t orient around policing correctness
Public performance of a one-person play: effective learning meetings don’t spend much time reading materials or reports out loud. The entire time should be devoted to discussion, perhaps with a short initial window for attendees to read the report. Learning is a group activity, wbhereas readouts as a solitary performance
Public performance of two-person play: some meetings adopt a consistent chorus across sessions. A certain set of questions, e.g. “How did you first become aware of this issue?”, will be asked and answered at each session, consuming much of the time. That feels useful, but it implicitly silences the wider group, who are not able to contribute their context and encourage group learning

Finally, like any important, large meeting, there may sometimes be individuals who are more focused on their personal ideological goals rather than the meeting’s goals, and it’s your responsibility to either anchor them on the meeting’s goals or get them out of the meeting so work can be done.

Agenda, Scheduling, and Scaling

The agenda for every incident review is discussion of one to two individual incidents or a cluster of related incidents. The agenda should be decided one to two days ahead of the review, and shared out with attendees to allow them to prepare. Because most learning occurs in discussion, I recommend against trying to include more than two incidents (or one batch of related incidents) in a given session.

Run these on a weekly cadence, canceling ahead of time when there are no incidents to review.

If you start to have backlog of incidents to review, then you have three options:

Batch related incidents if you have a cluster of incidents with shared contributing causes. For example, you might have a streak of incidents related to database instability caused by unindex queries, which would benefit from one curated, joint discussion rather than treating each as an independent incident
Extend review time for one week to have more incident review bandwidth. This works best when you have a short-term spike in incidents. Generally speaking, it is an organizatonal smell to permanently extend incident review beyond an hour a week for a large audience, as it’s an expensive investment of time
Stop discussing lower severity incidents in the review. For example, only discuss incidents with “significant” customer or internal impact, coupled with a simple definition of what incidents would fall beneath the line

Roles & Attendance

There are five key roles in an Incident Review:

Facilitator who coordinates the agenda and the conversation
Presenter who filled in the Incident Review Template for a given incident
Notetaker who ensures notes from the discussion are captured
Attendee who share context, ask questions, and learn from the discussion
Sponsor who provides organizational weight to the meeting through their participation, this is generally either the head of engineering or the head of infrastructure. It is reasonable for the Sponsor to occasionally miss, but I believe it’s essential for them to attend the majority of incident reviews

The Incident Reviews goals, particularly around learning and surfacing missing context, encourage a wide audience of attendees. I recommend allowing anyone to participate so long as they read–and abide by–the meeting’s goals and anti-goals. Ensuring folks act in accordance with the meeting’s goals is a joint responsibility of the Facilitator and the Sponsor.

Is it working?

Some questions to ask yourself if you’re unsure if your meeting is useful:

Are they getting scheduled? If that’s because you’re truly not having incidents, great! Conversely, if it’s because folks are not filling in the template, then dig into why not. Often these templates get overloaded with many questions to please many stakeholders, and consequently become difficult to use
Are key personnel attending? Particularly the sorts of folks who have important context to bring into the discussion. If the meeting is working, these should be an exceptionally high-leverage opportunity to grow the organization
Are the discussions resulting in a modified reliability strategy or roadmap? If these discussions are driving learning, then they should alter the shape of your roadmap
Do you enjoy attending?

Matthew Clarke

Wed, 11 May 2022 08:15:00 -0700

Interview in May, 2022. Learn more about Matthew on his blog, twitter, and linkedin.

Tell us a little about your current role: where do you work, your title and generally the sort of work you and your team do.

I work at Spotify as a Senior Backend Infrastructure Engineer. My team builds and maintains the tools that enable Spotify engineers to deploy safely and quickly whenever they need to.

We work a lot with Kubernetes, which Spotify uses to deploy and manage most of its websites and backend services. Spotify runs some of the largest multi-tenant Google Kubernetes Engine (GKE) workloads in the world, so this is a large responsibility.

My team builds tools on top of Kubernetes to simplify and create a great developer experience. These tools involve developing and maintaining our deployment tools, aggregating error messages from different Kubernetes resources and displaying them through Backstage (our internal developer portal), supporting developers on Slack with questions they have or problems they’re running into with Kubernetes, and working on our Kubernetes plugin for open source Backstage.

How did you start doing infrastructure engineering work? How have the companies you joined, your location, or your education impacted your path?

I actually started as a software engineer focused on e-commerce; while there are a lot of interesting problems to solve in this space, I found the lack of direct interaction with end-users frustrating. People don’t really care how “well” their online payment is accepted as long as it goes through, so you don’t get valuable feedback often.

My role at the Financial Times was my first real taste of infrastructure engineering. It was a DevOps microservice role focused on identity and e-commerce. My team was responsible for provisioning cloud resources, writing applications, deploying them and monitoring them. There I learned a lot about AWS, Kubernetes, and Cassandra. We used lots of different languages so that we could experiment with what worked for us, including Python, Java, Scala, Node, Go and Elixir, but we mainly settled on Java and Go.

However, throughout all of my roles, I found I gravitated towards building developer tools. Whether that was integrating two different build platforms at Cybersource/Visa, adopting Kubernetes at the Financial Times or changing to my current team at Spotify. One of the great things about infrastructure engineering is that you are sitting beside your users everyday, they’re your colleagues and you get to help make their life easier and get instant feedback about what they like and don’t like.

I have always wanted to have a big impact at the companies that I have worked at, and there is no better way to have an impact than to help increase the productivity of all the other developers at the company. This is also why I love to contribute to open source. By contributing to open source, you can make an impact not just at your company but throughout the whole industry.

What dashboards and metrics do you personally use to stay aware of your software and team’s work?

I use Backstage a lot to keep track of the current state of my team’s services. Backstage provides integrations with monitoring, deployment, CI and tech docs all in one place.

Other than that, I keep track of the various deployment features we provide, such as test environments and automated canary analysis, to get a good idea of what features users find useful.

Recently we have been making the effort to try to quantify and visualize deployment toil, so that we can see if we are moving things in the right direction with our platform offerings.

What would happen over next month if your infra org were all pulled away onto a secret project and couldn’t do their day to day efforts? Where would things slow down?

I think most things would continue along but probably not very efficiently, trending downwards.

We help developers at Spotify every day by answering their infrastructure questions, helping them get their services set up or debugging production issues, so there would be a lot of unanswered slack conversations! We are also continually scaling our systems out behind the scenes to continue to support an ever-growing number of users and artists.

Infrastructure engineering organizations have a lot of priorities. A few years ago I tried to define an overarching set of infrastructure priorities and came up with: security, reliability, usability, leverage, cost and latency. Of course, folks immediately started arguing I’d defined the scope too narrowly. How do you figure out what to prioritize working on?

This is a very interesting question, we get a lot of feature requests and feedback, but it is impossible to do everything. We try to focus our time on work that will have a wide impact, usually defining this on how much “toil” we can prevent. Toil for us is usesrs making infrastructure changes or tweaks that should be automated or happen behind the scenes without their interaction. An example of this would be our effort to automate migrations, make it clear to users the goal, and provide the tools to perform a migration with as small overhead as possible.

Related to priorities, one topic that I’ve had come up a few times recently is the idea of “Shadow IT”, where other organizations bootstrap an infrastructure project without your knowledge, and then ask you to take over running it once it becomes a burden. How do you deal with other teams asking infrastructure to take over their projects once they’re no longer fun (or often when the original implementer leaves the company)?

Something my team has been struggling with recently is the sheer number of systems and tools we own. Some of these might have been transferred to us like you mention above, but we give the benefit of the doubt and assume it was the best decision the implementor could have made given the information they had at the time.

Still you can’t support a limitless amount of systems and tools. Therefore the questions my team ask are:

Do we really need this tool? / What value is it bringing?
Are we the best people to be supporting this tool? / Instead of supporting this tool could we be doing something more important?

If we can’t justify the tool existing then it is a good candidate for deprecation. If it is valuable but we aren’t the best people to support it or could be working on something more important then perhaps we need to find a new owner, either another team internally or a managed version of the tool.

What’s the single most impactful project you’ve heard of an infra engineering org doing? Why? Was it obviously impactful beforehand?

I would say Backstage fits the mold for this. Before Backstage, different infrastructure teams at Spotify would create their own user interfaces, this work was not very efficient. Engineers had folders full of bookmarks and infrastructure engineers would toil away solving problems that other teams already had solutions for.

When Backstage came along the benefit was clear: developers had one portal for all their infrastructure needs, they could search Backstage for docs, datasets, teams, services and runbooks. Infrastructure engineers could embed their interfaces in Backstage and benefit from the large library of utilities and React components the Backstage maintainers had created for common use.

This lightened the load for all the engineers at the company and ultimately improved developer productivity, which is the ultimate goal of an infrastructure engineer.

Your current work focuses heavily on Kubernetes. This is a technology that has an outsized impact on the technology industry, and over the last six years has grown from something perceived as a toy into something widely used at scale. Where do you see the future of Kubernetes going?

Kubernetes is an open-source success story. It is great to see the industry rally around it as a project, including building incredible tools on top of it. Initially, it seemed like the only benefit was container orchestration. However, now we can see the additional benefits of extensibility, which has pushed Kubernetes beyond just containers.

In the future, I’m excited to see where the community goes with handling multiple clusters and whether some patterns emerge there. I also think there will be an emerging trend of workload clusters vs infrastructure-as-code clusters; some Kubernetes clusters will be used to manage your infrastructure through tools like Crossplane, and others will be where your services run.

I also hope we continue to see Kubernetes tools evolve to address the needs of service owners who have services running inside multi-tenant clusters and not just the administrators of the clusters.

Ok, excluding Kubernetes, but are there other technologies or tools that you see advancing the field in a similar way? What about technologies or tools, other than Kubernetes, that you believe will meaningfully advance the field over the next decade?

While I am a contributor, I do think Backstage has the potential to change how developers interact with their infrastructure and allows them to better focus on their code. Backstage has grown from an internal tool at Spotify to an open source CNCF Incubating project with hundreds of adopters and contributors, dozens of tool integrations and several commercial ventures using it as the basis of their products. The ability for a developer to have a single view of the entire software ecosystem at their company, including monitoring, docs, CI/CD and runtime, has been incredibly valuable at Spotify, and I think other organizations are discovering this too.

I am also very excited about eBPF; quite a few different tools are emerging that could enable language-agnostic service-mesh-like features in a microservice environment built on top of it. I like the idea of a service mesh that doesn’t require a sidecar proxy, which has latency and cost overheads. However, I think it still has a pretty steep hill to climb to rival some of the proxy-based service meshes out there.

What are some resources (books, blogs, people, etc) you’ve learned from? Who are your role models in the field?

I learned a lot from Sarah Wells when we were at the Financial Times; we embarked on a Kubernetes migration fairly ahead of the curve; Sarah gave a great talk on our migration (which is probably why it has been on the Kubernetes homepage for four years now!).

I love to read; some of my recent highlights have been: Network Programming with Go by Adam Woodbeck, Effective Python by Brett Slatkin, A Philosophy of Software Design by John Ousterhou and, of course, Staff Engineer by Will Larson.

I follow quite a few blogs, but the most valuable personally has been Last Week in Kubernetes Development. It can be tough to follow the current development of the Kubernetes codebase as it is such a moving target; this blog summarizes the interesting: PRs, merges, deprecations and news which makes that task a bit easier.

Mahdi Yusuf

Tue, 03 May 2022 14:00:00 -0700

Written interview in May, 2022. Learn more about Mahdi on his website, linkedin, and his StaffEng podcast interview.

Tell us a little about your current role: where do you work, your title and generally the sort of work you and your team do.

I am currently a Senior Staff Engineer at 1Password, leading the Server Architecture team. We are implicated in our systems’ overall design while pushing for the modernization of legacy systems.

The work encompasses everything from our overall system reliability and a few core components like queues, workers, and data stores. We also spend a decent chunk of time maintaining foundational libraries and service scaffolds that are used throughout the company.

Generally, this includes most of the non-product engineering work.

What dashboards, metrics, and forums do you personally use to stay aware of your organization? Is there a different answer that you would be more proud of? What’s preventing you from that answer being the current answer?

Currently, we are a Datadog shop for our dashboard and metrics. We now use collaborative Datadog notebooks when discussing/investigating new initiatives. We also use Kibana for logging and Bugsnag for error tracking.

I would like to see something that could cut across all those three places to get a real sense of what is happening entirely across our system. Without having to jump from platform to platform. One tool to rule them all. The more data sources you can synthesize, the better your understanding of your system can be.

I have been an avid user of Grafana, which delivers on the premise above. It integrates metrics, logs, and traces all in one clean interface.

There were various considerations around sticking with Datadog. In addition to the cost of moving, there was the idea of who would keep this running. I am happy to see there is a managed Grafana being offered by Amazon now. So we may revisit this, when we have more time.

What would happen over next month if 1Password’s infrastructure org were all pulled away onto a secret project and couldn’t do their day to day efforts? Would the company still run?

Depending on when you ask that, it can vary. But, honestly, as much as I would like to say, things would grind to a halt. It’s a constant effort, but we are always trying to make sure no team is in a position to bring things to a complete halt.

If the infrastructure organization were utterly gone, progress on tasks that have payoff farther in the future would lag behind the rest of the organization’s efforts, eventually impacting the broader organization.

The way I like to look at this work is as necessary investments we need to make today for the future progress of the entire engineering organization. So it’s a constant trade-off with many factors that come into play.

I have never seen this effectively work without dedicated teams focused on issues in the production systems. A new product feature usually trumps fixing something that isn’t a problem…yet. How long and at what speed the company would still run are probably more pertinent questions.

This is something I have been thinking about lately. If I was pressed to really get into my gut and define these prioritizations, it would be tricky but let me try here. Frameworks are great general guidelines when you don’t have context. Still, most of these decisions depend on the organization’s willingness to make said priorities happen and stick with them to see them through to completion.

That being given, I primarily focus on desired outcomes and slowly put problems behind us. Some of these classes of issues come back in various forms (see: scaling and migrations).

Also, knowing you can’t solve them all quickly, let’s get to the actual job of prioritization.

The first thing you need to identify is the severity of these problems. There are classes of problems that you can live with and others that, if left alone, will only get worse if they aren’t given the attention they need. The problems in the latter group aren’t usually a problem today, but being left alone can be limiting in some way in the future.

Keeping the organization as agile as possible is essential in this regard. I might be conservative, but I always pay off the compounding debt first. Software systems change, but teams always build on top of what is there today.

If a problem has more or less the same impact on the organization six months from now as it does today, it goes down my list of importance. However, suppose it gets worse as time goes on, the higher on my list of importance. This is when compounding is working against you.

Now let’s talk about when compounding is working with you. If I fix something that makes each of my engineers lose an hour a week–just one hour. If I eliminate that, I just saved the company 200 hours a week and reduced toil in the process. These classes of problems aren’t the ones that usually get worse with time; these are typically focused on developer velocity and usability.

So there you go, another framework.

You can always say no. Use this one sparingly often, you will eventually be working with that team in the future, and the road to fame and riches is long.

Frequently these systems are necessary but not in active development. It usually isn’t that bad if you can have some time for hand-off and transition it slowly. Documentation here can be worth its weight in gold. Knowing where the bodies are buried is helpful when things eventually go wrong.

There are always teams that get overburdened with these services with no owners. The burden is much like peanut butter: it’s better when spread around. There is always a team that is the best fit for said service.

Like Spike Lee said, “Do the Right Thing.” If the team is overburdened, you can always assign more headcount to the team.

I will say that leaving these services without clear ownership is a poison pill for your organization. People will shirk responsibility, and zero effort will be put towards these services, sometimes out of mere spite. It is better to assign the service to a team that won’t prioritize than to give it to no one.

At times I have run into a belief that infrastructure necessarily conflicts with productivity: e.g. we have to reduce productivity to increase reliability. Have you seen a tension between infrastructure and product engineering productivity? Are there ways to reduce that tension?

Absolutely! Measuring it is something you should try to be doing. For example, can you measure how long it takes merge requests to get through review? How long are RFCs in the review state? How many regressions are we seeing after deploying a new piece of infrastructure?

These things can worsen if infrastructure engineering is too prescriptive without understanding the underlying product work. Embedding infrastructure engineers into product teams can help here. But, again, it’s mostly balancing priorities/perspectives and communicating clearly.

The benefits of embedding can be twofold and can help infrastructure engineers get a first-hand experience of what is slowing down product engineers. They can take that back to the team to improve things, and product engineers can get some visibility on how these processes improve reliability in production.

I believe in supporting product engineers to deal with (read: empower to resolve) most of the issues their code causes in production and support them if they need help. But unfortunately, overzealous product engineers create debt faster than they develop products.

Most of this tension usually comes from not getting feedback in the correct stages if you cannot embed engineers into product teams. Writing design specifications can be fantastic and let’s most of the discussion occur before the rubber meets the road.

Ideas are quickly redrawn, maybe even code, architecture, and infrastructure, not so effortlessly. Where would you want to give constructive feedback?

There’s a tendency for infrastructure engineering to be invisible when nothing is going wrong. How do you articulate the value of your organization’s work?

This is true, but data shall set you free. So it’s essential to capture why you are doing something and what you think it will improve. Then follow up with either data or people you impacted with that change.

It’s all about outcomes. If you can’t track those with either data or teams you impacted positively through your efforts, you should probably rethink them. If you are doing this effectively it shouldn’t be too hard to articulate. You are often left to synthesize where engineering has under invested and figure out if anyone cares.

It’s important to understand that as software systems grow and more people start working on them, they become more complex. Unfortunately, you can look at these like a thousand cuts over time, so they are easy to miss and overlook.

Making sure you don’t succumb to these changes is essential. But unfortunately, I am sure most infrastructure engineers have been in the position where something they wanted to work on was minimized and deprioritized to have things quickly change when things go splat. Understanding risks and tying those to straightforward trade-offs is vital to communicating with leadership.

What are some resources (books, blogs, people, etc) you’ve learned from? Who are your role models in the field?

I have found Twitter in general to be a great resource throughout my career. I have met tons of people and learned so much. I often recommend LeadDev to new leads because they have outstanding resources. I am also a big fan of Neal Ford’s works around software architecture. I am also working on something new here called architecturenotes.co where we breakdown system design with the people that built them. I think this audience would get a kick out of it.

Shawn Wang / swyx

Mon, 11 Apr 2022 08:00:00 -0700

Interview occurred in February, 2022. Read more from Shawn on his blog, twitter, and his book, The Coding Career Handbook.

Tell us a little about your current role: where do you work, your title and generally the sort of work you and your team do.

I’m currently Head of Developer Experience at Temporal.io, an open source workflow engine for long running, durable processes powering companies as small as 2-person YCombinator startups, to enterprises as large as Stripe, Snap, Datadog, Netflix, Doordash, etc. We are generally responsible for improving the experience of “front line” individual contributor developers, covering their end to end journey from first contact (DevRel) to learning (Docs) to API Design (SDKs) to ecosystem (Community).

The basic insight is that companies ship their org charts (Conway’s law), but developers don’t care what team shipped which when they go through your product, so it makes sense to have someone whose job it is to coordinate and build out developer-facing efforts cohesively.

In Developer Exception Engineering, you wrote a bit about the slipperiness of defining “developer experience,” and how it often varies significantly across companies. How would you explain developer experience to someone unfamiliar with the role?

Developer Experience (DX) is a buzzword at this point, so naturally everyone is co-opting it to represent their particular view of the world, making it extra confusing for anyone that just wants a straight answer. But I’ll give you my best shot here.

At the highest level, the basic dichotomy to be aware of is “internal” vs “external” DX. Most people who come up within one of these two branches may be completely unaware of the other, which contributes to the confusion when people discuss “DX”.

Internal DX teams focus on developer productivity within a company (sometimes called “dev infra”). The math is simple - if you have 50 engineers, and you think it’s possible to improve their productivity by >1% a quarter, then you would be silly not to invest in 1-2 engineers who don’t work on product, but just focus on making everyone else more productive. Scale this up to a 1000 engineer company and you now have a whole Internal DX org to play with. They can span a wide range of deliverables, from build/test automation to dev environment to code quality. The clearest mental model for identifying Internal DX opportunities I’ve come across is Netflix’s Productivity Engineering team, which is responsible for three major components - from new hire to productive local dev (their Bootcamp and bootstrapping tool, NEWT), from local dev to production (their “build-bake-deploy paved road”), and then from production back to dev (their observability tools like Atlas). The other popular taxonomy to work with are the four Accelerate metrics. Both of these approaches essentially divvy up the software development lifecycle into meaningful chunks, which can then be independently and tangibly improved by internal DX teams.

External DX teams focus on improving developer adoption/mindshare/productivity at other companies. Where almost any software company can have Internal DX, it only really makes sense to have External DX if you make something for developers. This means it is a natural fit for devtools companies, but you might be surprised at what companies invest in this. Are Spotify, Notion and Slack devtools companies? No… but they all offer APIs for developers! So they all have DevRel teams. The distinction between DevRel (also known as Dev Advocacy) and DX is another common question. On one hand, traditional DevRel is very heavy on content creation (blogging and speaking, basically, but also demos and workshops), whereas DX has more of a mandate to write (non-core) code and docs to solve problems. I first transitioned from DevRel to DX at Netlify, where eventually it formally covered Advocacy, Integrations, and Documentation. The exact coverage will naturally differ based on the product - for example, Netlify is a closed source SaaS platform, so Advocacy plays a bigger role, whereas Temporal is an open source client-server system, where equal love needs to be given to Community and Docs.

A quick aside for those who often hear DevRel vs DX conflated: DX is supposed to be the superset, but frankly, the lion’s share of DX is still DevRel, for both economical and historical reasons. Economical, because most developers know how to build product, but are terrible at building distribution, so a DX team often contributes the most value by reaching developers despite all the other things on its plate. Historical, because the DevRel to DX transition is a once-in-a-lifetime career upgrade for Dev Advocates to have more impact, just like the Sysadmin to DevOps transition. It all makes sense once you consider that Dev Advocates speak the most to users, but usually have the least power to make fundamental changes to solve their pain, particularly those I term “Developer Exceptions” in that blogpost. Blogposts and talks have a half-life far shorter than docs and tooling/product improvements.

Once you’ve marinated in the various aspects of DX enough, the distinctions start to re-blur once you consider that Internal DX just serves internal customers (and needs to invest in docs and advocacy too), and External DX serves external ones (and needs to tangibly improve productivity too). Both roles require a great deal of empathy with developer problems, and an expansive mental catalog of ways to solve them. Yet the ultimate relevance of either to the outside world matters only to the extent of a typical build-vs-buy decision. Don’t get too hung up on precise definitions in an inherently fuzzy and still-moving field.

When we first discussed this interview, you asked if your experience would be interesting to folks focused on infrastructure engineering. I’ve increasingly come to believe that Developer Experience is a core competency for all folks developing infrastructure software or working on infrastructure projects like large-scale migrations. Should infrastructure teams consider Developer Experience as a core engineering competency? Any ideas why they often don’t?

It’s funny, even though I do DX at a company that serves Infra engineers now (and Temporal enables Infra engineers to offer a dramatically better developer experience to product teams by providing “reliability on rails”), I had never ever viewed it as something Infra engineers themselves should regard as core. For sure, the Derisk-Enable-Finish cycle in that article on Migrations leans on many of the same skills as DX teams - advocacy, docs, tooling. But I’m loath to recommend that it should be “core” in all contexts, because (as we discussed earlier) DX is so broad and hard to define, and I’m always skeptical of people hawking their pet topic as mandatory. A bloated definition of “core” defeats the purpose of defining a “core”.

What I will say is that I think most Infra Engineers could do with more developer empathy, which in most situations simply means putting themselves in the shoes of people with less context and knowledge than them and proactively helping them out by any means necessary. If you do it right, then yes, the developer experience of your users will be better because you took the effort, but it should be done not for altruistic “let’s make them happy” reasons, but rather, selfish ones: your efforts will be more successful if they feel more successful.

Why don’t more infra teams invest in Developer Experience? Honestly, probably because there’s no cultural expectation for them to. It’s common for infrastructure teams to get consumed by the loudest issues surrounding them like incidents and infrastructure costs such that they end up much more focused on their obligations to computers than their obligations to other engineers.

What are the top three tools or techniques that you use in Developer Experience that infrastructure engineering teams should consider adopting?

Journey Mapping: exhaustively enumerate every concept, system, or API capability your user should know (thereby letting people know what they don’t know). Pick 2 main axes of concern and map them out in 2D space - clustering related concepts together. Draw a small core of “must know” concepts where everyone should start (letting people know what they don’t need to know). Identify and highlight FAQs. Then let them find their way based on their needs. This contrasts with a “one size fits all” linear path. (see example)
Pitch Sizing: Be prepared to explain/define your system in one sentence (pique interest), one paragraph (by desired requirements or by pitching the problem), a 10 minute presentation, or a 30 minute demo. Logical/technical arguments are best supplemented with Cialdini persuasion principles. Practice this when you don’t yet need it (eg at internal demo day/lunch) because you will be called upon to do it at the most unexpected times for the most high leverage reasons.
Two phase commit: Knowledge is transferred as both discrete particles and continuous waves. Concretely, some of your users will want a monolithic organized reference, and others will just want diffs. One example rule that implements a “two phase commit” of knowledge - Every feature update should be communicated via a changelog and a doc/wiki update (and, for more impactful updates, a tweet, slack message, blogpost, talk…).
(Bonus) Events: Learning to throw events that people look forward to and enjoy participating in is a huge multiplier on existing DX efforts. (see Community Annealing)

Infrastructure engineering organizations have a lot of priorities. A few years ago I tried to define an overarching set of infrastructure priorities and came up with: security, reliability, usability, leverage, cost and latency. I imagine this is at least equally true for DX teams, how do you figure out what to work on given the wide range you could prioritize?

I think about DX work in terms of concentric circles radiating out from the core product, matching the maturity of the product:

When the product is still being shaped, there is no better time to give feedback on API design.
As the product approaches fully baked, I shift my attention to Docs.
After shipping the product with a complete set of docs, I shift to Content (Advocacy) to get users and to spell out and elaborate whatever doesn’t tonally or structurally fit in docs.
Users come for the content, and stay for the Community, so I start investing in getting to know them, helping them in their adoption, and find/build for/hire each other.

On and on pushing outward when we can, but looping back inward whenever a new feature or product is launched or a new problem is found.

All of these efforts should be coordinated with the same “map” I described above - shared terminology, shared understanding of core concepts, and a shared reality of neighborhoods and landmarks. However they are not equal in all contexts, because inner circles tend to have higher long term impact (the best docs are the docs I don’t have to read because the product teaches me as I go, the best blogposts are the blogposts I don’t have to look for because the documentation was good enough, etc.), but outer circles have more reach.

What I’ve described is from my experience in my sweet spot at early stage, Series A-C devtool startups, where each program is usually a singleton, but there are advanced versions of this at the larger companies too:

Every SDK can always have more languages and devtooling.
Every conference can be replicated across the major continents.
Every docs effort eventually morphs into a “University” or a certification/education program.

At AWS scale, we also layered the DX circles with language, geographical, and business vertical dimensions. If you wanted a Chinese speaking Telcos specialist in Australia, we had someone for that.

Folks working on infrastructure engineering often have a specific dashboard they look at every morning to get a sense of how the software, system, and organization is operating. Do you have a similar dashboard? What’s on it?

We use a mix of internal BI tools for lagging indicators (active clusters, SDK version adoption) and Common Room for leading indicators (open source and community activity). As long as everything is trending up on a trailing 2-3 month basis I’m not too concerned about checking it every day. Considering that it takes >10 touches for the average person to go from first contact to seriously interested, the natural frequency of consideration cycles make for extremely long feedback loops.

This is further confounded by the extremely non-ergodic nature of the open source enterprise customer, where one large customer can be worth 5 orders of magnitude more than the median, and take anywhere from a month to two years to convert to a customer.

Most DX metrics are better regarded as a health check that things aren’t broken, rather than proof positive that things are actually working well. If you need more specifics, I’ve received very positive feedback on my piece on Measuring Developer Relations.

OK, I’m going to start turning the conversation towards Temporal for a bit. In every infrastructure team I’ve worked on there’s a team focused on supporting services that offer an API, but it’s often only much later that there’s any support for workflows (by which I mean scheduled, periodic or event-driven tasks) outside of something like Airflow for batch processing. How did Temporal decide to focus on a workflow engine?

There was no decision so much as it was a lifelong obsession borne out of decades of distributed systems experience at scale, and solving the same problems over and over again. My basic insight is that everyone converges on the same requirements for reliability, observability, and scalability in their systems, but the tools we have are too low level, so everyone handrolls (poorly) their own distributed system out of these tools. Eventually, large-enough companies build their own workflow engines to slow the wheel-reinvention.

Our cofounders had been working on various iterations of messaging services and workflow engines for the prior ~20 years, at AWS, Google, Microsoft, and finally Uber. They created Temporal’s precursor at Uber, which became a full-time job as the number of applications using it ballooned to 300 in 3 years. This work was open sourced and similar growth was seen at Hashicorp, Coinbase, Airbnb, Doordash, etc. Finally, demand for a hosted solution was so strong, and the Uber-specific tech debt was so high, that they forked the project to start Temporal. So at every step of the journey the market demand drove the next phase of adoption, rather than any one decision.

Temporal is at once a 2 year old startup and a 20 year old team in this sense; and having that much big tech and open source validation gave us a lot of conviction that the industry is hugely underappreciating the use cases of a workflow engine beyond simple scheduled jobs. There’s a bunch of hypey hyperbole thrown around: “distributed system in a box”, “reliability on rails”, “distributed application state platform”, “a new computing primitive”, “service mesh for long running operations” - all of which are true depending on your point of view.

In listening to Stripe’s talk on Temporal and Netflix’s talk on Temporal, both mention writing their own SDK wrapper on Temporal’s SDK. Is it a good or bad sign when your users routinely wrap your SDK?

It’s easy to map out the pros and cons:

It’s good in that it validates that we solve a hard enough problem that people wrap us rather than build us… for now. And it gets us users that closed source SaaS and inextensible “No Code” platforms would not.
It’s bad in that it means our users have a built in facade that makes it easier for them to move off us in the future
It’s good in that both Stripe and Netflix talk about their wrappers solving company specific problems and providing good defaults for their intended users, that we can later absorb into core once validated enough in “userland”
It’s bad in that we don’t do some things for them out of the box… yet.

Ultimately I view “being wrapped” as a natural, net positive outcome of any valuable enough devtool. The best thinking I’ve come across on this is Kevin Kwok’s view of platforms vs their ecosystems:

Usecases that are high impact and generally useful should be solved by us, whereas usecases that are lower impact and very specific should be solved by wrappers. We would look to our growing ecosystem to help solve high impact, high niche usecases, and investing in an open source community directly contributes towards this long term advantage.

Since Stripe moved to base generating SDKs off their OpenAPI spec, I’ve started to suspect that SDKs are better interfaces to expose to users than APIs themselves. Are SDKs better user-facing interfaces than APIs?

This is very close to my heart! The simple response is that yes, if you can afford to, offering an SDK (or a CLI, by the way) generally provides a better developer experience than just the raw API. The basic argument is that if you don’t provide an SDK, the developer will eventually have to build one for themselves for anything of sufficient complexity. There are a number of problems that can only be solved at the SDK level, including providing more specific types or type inference, inline documentation/autocomplete, and mocking out the API for testing.

However a poorly implemented SDK can also introduce an extra layer of potential bugs and performance issues, constrain advanced users, cause uncertainty about exposed classes and data structures, and add complexity to versioning/upgrades. In scenarios like these, being able to “drop down” a layer to the underlying API is crucial and the platform should not actively obstruct that.

One should also distinguish between “Fat” and “Thin” SDKs. “Thin” SDKs are simple, 1:1 language wrappers over APIs, the kind that can be generated from OpenAPI. “Fat” SDKs do more, often managing state (e.g. AWS’s AppSync SDK creates a local replica of your DynamoDB backed database, and handles offline sync and merge conflicts), or allowing plugins, or as Temporal’s SDK does, offering a deterministic sandbox which can replay events through your code for failure recovery and durable async functions.

In short, the opportunities for “Fat” SDKs to improve developer experience well beyond simple RESTful APIs are greater, at the cost of more engineering (and docs, and support…) to maintain them. Tradeoffs everywhere!

In The Self Provisioning Runtime, you modify Alfred North Whitehead’s quote saying, “Developer Experience advances by extending the number of important problems our code handles without thinking of them.” That quote gets at the long-term promise of cloud providers, which are slowly making important problems invisible for many users, e.g. my experience is that general awareness of networking is significantly lower than it was a decade ago, which I attribute largely to cloud adoption. In some ways I see Temporal as competing with cloud providers’ own workflow engines. How do you think about competing with the cloud?

I’m excited by it. We certainly have to take it seriously, because Temporal is MIT-licensed, and there is nothing stopping Amazon or Azure from hosting us as a service tomorrow. But at this point dozens of open source companies have faced that threat and survived - by relicensing, and by serving their customers better than the big clouds can. On one hand, this is intimidating, because Amazon theoretically has infinitely more resources to crush us. On the other - I’ve worked at Amazon and seen how hard it is to push through the absolute mountain of conflicting priorities and legacy tech to get anything done compared to tiny startup teams with a fraction of our funding.

This is why topics like developer experience are so important - there are so many more dimensions to building a successful developer infra business than just the commodity operation of software - but I am actually most excited about outcompeting the big clouds by better product strategy and better network effects as those are sustainable and compoundable wins.

I can’t be too specific here but consider how Snowflake made an independent case for itself by being the “Data Cloud”, Cloudflare is doing the same for the decentralized cloud, and Stripe being payment rails for ecommerce. All are justifiably market positions that the big clouds will not/can not tackle given their current strategies. Temporal happens to occupy a very nice space:

managing lightweight application state, not egress-heavy data
having a small well defined contract with every mission-critical microservice in your company and others’, and
being generally agnostic as to whether humans or machines complete tasks in a given workflow.

I think every startup that competes with big clouds (read: every ambitious Infra startup) will need to carve out a space on which they are the undisputed independent source of truth, at least until the $100billion valuation stage when the metagame changes once more.

What’s the single most impactful project you’ve heard of an infrastructure engineering team working on? Why? Was it obviously impactful beforehand? Stripe’s Sorbet is an example of a discrete project that I found surprisingly impactful.

Probably the sharding system that became Vitess at YouTube. YouTube is quite simply the biggest social video platform on Earth today, but it faced a horde of well funded competitors in 2005-2010. YouTube was experiencing 2 outages a day due to the extreme load, and could have gone the way of Friendster if those performance issues continued. No Vitess, no YouTube.

Vitess made MySQL scalable for YouTube, then its open sourcing helped Hubspot, Slack, Pinterest, GitHub, Square and more. If database infra counts, then I’d be hard pressed to think of a more impactful project.

What are some resources (books, blogs, people, etc) you’ve learned from? Who are your role models in the field?

I keep a list of resources for DevTools and Dev Rel/Dev Community in my own space because the list of resources is long for such a young field. Special shoutouts to Beyang’s Guide to Devtools and Mary Thengvall’s Devrel Resources, and to Kelsey Hightower for getting me started Learning in Public and going down the Developer Advocate career path. Scott Hanselman is also a huge mentor to me, being an early reviewer of my book and with his inclusivity and ability to make anything in the Microsoft ecosystem interesting, and ability to cross over into newer platforms like Tiktok!

Smruti Patel

Sun, 10 Apr 2022 07:00:00 -0700

Written interview in early February, 2022. Learn more about Smruti on linkedin, twitter and on LeadDev.

Tell us a little about your current role: where do you work, your title and generally the sort of work you and your team do.

I currently lead the Data Platform group at Stripe – we operate the centralized data lake, and the bigdata, async & stream processing infrastructure for Stripe’s mission-critical business, while ensuring security, reliability and efficiency. Essentially, supporting Stripe’s core money movement & storage, financial reporting & analytics products for our merchants, and empowering ML infra to build credit, fraud & risk intelligence.

Prior to my current role, I also led the LEAP organization, which stands for Latency, Efficiency, Access & Attribution and Performance - my vision here was to take those small steps needed to unlock the giant leaps for both our engineering organization internally, and our users using Stripe. To enable that, we developed cross-functional strategies and tools for optimizing our cloud spend, and lowering Stripe’s end-to-end latency through performance tuning.

What dashboards or metrics do you personally use to stay aware of your organization’s work? How often do you check these?

I am one of those personality types, who is facts-oriented, analytical, and leverages data to draw patterns and drive decisions. So yes, metrics are my jam!

I’ve been leading teams for over a decade now. In this period, I’ve learnt that engineers don’t lack motivation. They are here to do their best work. Intrinsic motivation talks about a healthy balance of autonomy, competence and purpose. Let’s assume you have solved the hiring problem. You’ve built an inclusive team of highly skilled engineers with the right domain expertise. We’ll also assume that your management and leadership practices lean toward a healthy culture. A culture which provides the right blend of growth mindset, radical candor and psychological safety for individuals to thrive. So competence and autonomy are more or less solved, but how do we as leaders, then address purpose, the northstar, the why?

That’s where it’s important to think about opportunity cost! We have finite resources: and doing X implies not doing Y. For any software-driven company, our engineering talent, their productivity, efficiency and impact is our highest leverage. Hitting the right product-market fit can be extremely time sensitive. The opportunity cost, therefore, of going down a potentially wrong path, can be significantly high.

And so, you need a high fidelity OODA loop to observe, orient, decide, act and react to feedback! And that’s where I leverage metrics heavily to measure and debug engineering velocity:

Precision - What are you shipping and why?
Speed - How frequently are you able to ship?
Quality - What is the failure rate or quality of your software?
Impact - How well does it achieve business goals?

For LEAP: our impact metrics were around measuring overall cloud spend as a function of business volume. Or the tail latency - the p99.9– of the most important ChargePath API.

For Data Platform: some aspects are easier to measure than others. So here, we have 3 categories - starting from the outer loop of Stripe users, to the inner loop of our direct engineering cohorts and the bridge between the 2 - our executive leadership:

Non-functional requirements to measure strong guarantees of security, reliability, and performance of our systems.
Functional requirements to democratize access to data to enable rich insights for various cohorts that work with data - data scientists, data engineers, ML engineers or Product engineers. This is generally the hardest to measure!
The efficacy of operating Stripe’s business through data efficiency, compliance, and rigorous financial accounting.

I personally look at most of these system & business metrics weekly, to determine overall health of our systems and the broader investment within the organization.

In addition to these, I also look at team health metrics (monthly & quarterly) - like employee engagement, hiring ratios, attrition or transfers, #uplevel readiness.

Several of the areas you’ve worked on, especially efficiency (e.g. infrastructure spend) and performance (e.g. CPU utilization and user-facing latency) are areas of distributed accountability. A system’s efficiency is heavily dependent on the individual parts within the system. How do you set goals for areas of distributed accountability? What have you found effective for reducing the challenges of diffused accountability?

I love this question, and especially your reference to Thinking in Systems, a book which blew my mind a decade ago. Here’s how I’ve come to approach these problems.

Frame the problem. The why

For both Efficiency & user-facing latency, the first thing I did was own the whole problem, from the farm to the table- the reason this was a key, fundamental step was it helped provide a unified direction and sense of purpose for the narrative. I established myself as the accountable and authoritative subject matter expert in framing the problem for the company, through trust and verifiable, clean data.

I had learned from past experience that accountability without authority was a kitchen sink at best and a dull knife at its worst. So I secured executive sponsorship to back this key impactful initiative for Stripe, aligning on the outcomes through charter metrics (eg: overall cloud spend as a function of the business and p99.9 latency of the most popular ChargePath API), and setting expectations on the relative agency of a centralized team in driving those outcomes.

While this was necessary, it was far from sufficient. And that brings us to identifying the key elements of this system - the movers & the shakers.

Identify the elements. The what / who

In order to determine whom to hold accountable, we had to invest a few quarters in doing the gardening - creating a M.E.C.E. attribution our total cloud spend down to the last dollar to _a team. _This required navigating the notion of organizational hierarchies, supporting reorg workflows, re-attributing and backfilling to support error handling. This can feel toilsome, and be the valley of slow death – but here’s where I’d recommend persisting through, cause it will pay dividends when done right, and well.

Once we had attributed every dollar or every time slice, we then used Pareto’s 80-20 rule to focus on the top 5-10 product or platform teams, which provided the highest leverage.

Identify the interconnectedness and the flows. The how

Drucker said, culture eats strategy for breakfast. And a key aspect to changing behavior, especially when accountability is diffused, is to motivate culture. And culture is nothing but the behaviors the system incentivizes or disincentivizes.

We saw that the Hadoop platform team allocated its resources to teams through statically assigned queues, which led to local fragmentation and overall dropped system utilization when those jobs weren’t running. We needed the platform team to implement elasticity and the job runners to release resources – but both needed to be made aware, and then incentivized to prioritize this work.

And here’s where I’ve found it immensely useful to implement and operationalize the E.A.S.T. framework for behavioral economics:

Easy to self-serve costs through attribution, rich cost observability tooling and automated customized Nudges providing insights and recommendations on ways to meet their goals
Attractive to incentivize and reward Efficiency efforts by tracking wins, providing badges or company-equivalent means of public recognition
Social by driving ownership and accountability through cohort analysis, leaderboards, public Ops Reviews, and
Timely by introducing LEAP in Eng101 onboarding classes.

Lastly, there are systems where the carrots work better, or the stick. Depending on the urgency of the problem, some levers to drive the latter are setting explicit budgets (eg: cloud spend or headcount, spend budgets for org size of 25+), ensuring that teams have the right company level prioritization for related work, enforcing capacity governance processes or ring-fencing engineering bandwidth to drive centralized optimization.

Are there any processes or forums (like a “quarterly business review” or whatnot) that you’ve found valuable for inspecting execution within your team or across the many teams that share some responsibility for performance and efficiency?

In addition to diffused accountability, the other biggest challenge with inspecting execution for areas of performance and efficiency is, realized impact.

Let’s say, a data team decides to build a resource request portal, to automate away the static allocation and under-utilization of its compute resources. They ship this feature and move on to solve other problems. However, a few months in, they don’t see any change in the overall spend on the Hadoop infrastructure.

These are especially common in the performance and efficiency space, as the evaluation of the problem is based on several hypotheses, and there’s underlying complexity in the causal chain of dependencies. In the above case, we see that resources are under-utilized and waste is high. We concluded that most waste occurs from queue fragmentation in statically assigned compute resources, dynamic allocation will thus reduce fragmentation, hence cost savings! If we’d looked deeper at the data, we might have identified that the issue wasn’t so much in the fragmentation, but in the release of unused resources – a similar but different problem, begging for a different engineering solution.

Given this complexity in diagnosis, I’ve found it extremely useful to establish a contract with relevant teams (or my own) – anchor around invariants that need to be true at the end of a certain timeline, or around quantifiably, verifiable metrics. Eg: No product engineering team will miss their p99.9 latency service level agreement for over 48 hours, and beyond that, open an incident to follow due protocol. Or, team X will spend no more than 2% over their monthly allocated spend budget; any variances beyond this will need explicit approval from executive leaders.

Whether the teams decide to solve problem X or Y, or engineer a solution Foo or Bar, then, is secondary. We shake on the outcomes and invariants- and this forsters both agency & autonomy for the teams to drive results, and also creates owned accountability.

Speaking of accountability, I am a firm believer of ‘Trust _and verify’. _It is crucial then, to create the right ~real-time alerting and feedback loops, to catch early drifts - and I’ve found the weekly Ops reviews to be at the right cadence for these. This is where we want to leverage the exec sponsor for this program, who’ll recognize the right behaviors we want to see amplified, or facilitate deep dives into the incorrect outcomes to dampen their spread.

Lastly, QBRs are a great way to formally view trends in resource management, and related impact. This is also a great time to strategize and prioritize future investment, in line with the organization’s broader goals.

On that same theme, one particular challenge I’ve encountered is the perception that infrastructure efficiency is less important than developer productivity. To the extent that is true, some would argue that it’s illogical to prioritize things like performance and efficiency. How have you dealt with this tension between efficiency (or performance) and developer productivity?

For me, the joy of engineering lies in the solving of constraints, similar to those linear programming problems in Math. Given a system, and some non-functional requirements ( eg: availability, reliability, security), how do we seek equilibrium in the system? How do we make the right tradeoffs to sustain that?

At the macro level, it goes back to the opportunity cost for the business. When does it make sense for the business to invest in efficiency or performance? When a company is in its growth stage, its engineering talent is the highest asset and finding the right product-market fit is its highest priority. At that time, and at that scale, developer productivity is higher leverage than efficiency.

But as the business matures, and its organization and the engineering systems evolve, the balance shifts. 4YPs and discounted cash flows also start expecting to yield economies of scale- especially given the compounded nature of money. The CFO is likely to assess marginal revenue per net new employee, or overall margin for the business. And for most SaaS companies running infrastructure on the Cloud, their OPEX is the second largest spend.

At Stripe, I intimately witnessed our burgeoning cloud costs, and thanks to your foresight in investing early, we were largely successful in bending the curve along multiple dimensions of our overall spend. In order to justify and equip engineering teams with the agency to drive their investment, we laid down a generally applicable decision-making framework to translate engineering time to cost savings. For example: Invest 1dev-week of effort for $10K/month savings. For our own centralized Efficiency team, we placed a high premium on opportunities worth pursuing: eg: 5X cost savings per IC. These help address some of the tension between investment in dev prod efforts vs those catering to Efficiency.

However, at the micro level, depending on the problem you’re solving, you could either improve both systems efficiency and developer productivity, or face situations where “going faster” necessitates spending more. Eg: Take CI costs: if we were to improve and finetune our selection set of which tests to run, we’d reduce the dev time spent on running tests and reduce CPU hours, thereby being more efficient. But take build times- let’s say throwing 15% more instances to generate builds, reduces average build time from 25mins to 15mins. Is it worth it? Yes. But at what point is it not- how about when going from 15mins down to 12mins?

In Staff Engineer’s Manage Technical Quality, I argued that folks should focus on pursuing quality through improving hot spots, best practices, and so on. The least recommended solution was running an organization program that requires coordination across the entire engineering organization. This is a point of view that I developed in part during our time working together based on how hard it is to coordinate moving an organization. Do you think I came to the wrong conclusion in recommending folks avoid running organizational programs as much as possible? Any advice to make running organizational programs effective?

I especially love that article on Managing Technical Quality and I wholeheartedly agree on your assessment!

I started my career as a Quality Engineer around 2 decades ago, testing key features like distributed resource scheduling and linked clones for VMware’s control plane management solution to manage VMs. It was prior to the DevOps movement, and most enterprise companies ran these through centralized teams. There were several downsides to that model, stemming primarily from misaligned incentives, which arose due to lack of end-to-end ownership in shipping a high quality product to users. The developers were responsible for checking in code, and the QE for identifying defects and performance bottlenecks. This adversarial engagement created tension, as opposed to a joint commitment to delivering value. There was also a downward spiral of brain drain, due to the system perpetuating implicit second-class citizenship, in its hiring, compensation and talent management frameworks.

Fast forward to recent times, the core tenets of DevSecOps place a high value on end-to-end ownership of engineering – from code deployment to managing maintenance and operations. Systems which embrace this model heavily benefit from your recommendation in the article - which is to focus on the hot spots, drive practices, find leverage points and so on.

As cliche as it is, It all comes down to people! People are at the heart of every engineering problem, and its solution - be it more engineering, practice, process or program. I am of the firm opinion that people want to do the right thing, but they are optimizing for the constraints they are given. The most expedient way to then drive change is to provide awareness of the problem, align incentives, and give them the time and space to prioritize the fixes. For example, if a business leader is pushing their org to release product features at a breakneck pace, it will lead to technical debt or low code quality.

Also, running a program has extremely high overhead– sustainable metrics, weekly executive sponsorship and commitment, ongoing program evaluation. A program, its related scoring or goals evaluation, and associated leaderboards, also create a sense of foreboding - it is akin to being called into the Principal’s office– and shift the balance from the program owners being medics and dependable consults to cops who must be dealt with.

But there are times when a technical program is indeed the right solution – factors here range from the scale of the engineering organization (eg: tracking cloud spend for a group of 1000+ vs 200), to bootstrapping baseline shifts in your overall posture (eg: driving least privilege access to all data) or requiring immediate change to uplevel the entire organization simultaneously (eg: compliance needs like GDPR, India data locality).

I’ve had fair success leading such programs, focusing on:

Early (and often) alignment with key stakeholders on defining the goals and soliciting their buy-in.
Fostering trust and autonomy: trust in the data you leverage to guide ongoing decisions, trust in your intention to meet the mutually beneficial goal, and trust in being an equal, supportive partner throughout the journey. Trust _and verify. _
Effective communication and tight collaboration: create feedback loops to ensure information flows at the right cadence, at the right zoom factor for the right audience.
Giving credit liberally; publicly recognizing the good citizens, or the early adopters.

What are some of the most impactful projects or tools that your teams have rolled out to improve performance or efficiency that were impactful without requiring mass-coordination across many teams?

Efficiency, Performance and to that extent, even Reliability and Security are horizontal programs. For each of these, I’ve found it valuable to establish the right balance of tooling, education & practices to drive organizational behavior and simultaneously land direct improvements by solving real engineering problems. Anchoring on either end of the spectrum disproportionately impacts the end outcome. For example, if you index heavily on laying down patterns and practices for the org to adopt, but don’t build critical infrastructure or land impact by fixing existing systems, it erodes trust and credibility. If you are making point fixes, and landing impact a system at a time, you’re likely not evolving fast enough in a rapidly scaling company.

Keeping that balance in mind, and similar to macro-economic cyclicality, I developed our Efficiency strategy around 3 dimensions:

Pay Less (optimize procurement) ,
Use Less (optimize utilization) and
Need less (optimize performance).

Early on, optimizing procurement was the single biggest lever in reducing our cloud spend. Automating Reserved Instances & Saving Plans purchasing, implementing storage tiering for hot/warm/cold data accesses and centrally leading vendor discount negotiation (in collaboration with F&S) significantly dropped the spend/business volume bps.

We then focused on the second bucket - use less- improving utilization. This involved auditing unused/unclaimed stuff, automating brownouts to those unaccessed resources and then releasing the resources to prevent future spend.

Similarly, on the latency side, we rolled out an incident-free Ruby GC optimization without needing to coordinate with Product teams. This change dropped the tail from 4.6 seconds down to 2.9 seconds.

I’ve often considered Efficiency to be an “obvious spot” to partner with a Technical Program Manager (TPM), because it’s such a cross-organizational effort and there’s no finish line: the work just keeps going further. Do you agree? How would you approach involving TPMs in areas like efficiency and performance?

There are 3 key pillars to navigate when running an effective Efficiency & Performance program - the engineering, the organization and the Finance & Strategy.

Engineering comprises the centralized team which drives the execution of the strategy, and related projects serving the end outcomes.
Organization involves the product and other infrastructure teams within engineering, their organizational leaders and the executives leading the business.
Finance & Strategy leads the overall capital allocation at the macro business level, often reporting into the CFO.

A solid TPM can serve as _the _glue and the singular force operationalizing the strategy and seamlessly bridging all 3 pillars:

Identifying technical inefficiencies in product & infra engineering and creating the Efficiency portfolio of opportunities. This could involve: a. Tracking big scale up, scale down and swings in cloud spend, and enforcing capacity governance processes. b. Tracking platform rate cards (eg: avg cost per vcpu for a Hadoop job) and quantity of resources consumed (eg: #vcore-hours for team X). c. Creating effective feedback loops to bridge the utilization with consumption and budgets.
Enablement & education to motivate change bottom up and shift left the culture of efficiency & performance - Facilitate prioritization conversations across various stakeholders and leadership to unlock resourcing for highest leverage work items. Partner with the centralized Efficiency team, Education and other Infrastructure teams to develop best patterns and practices for developing systems and services efficiently.
Operationalize budget tracking and drive high forecast fidelity by organizing monthly spend budget reviews for: a. Identifying the right set of teams and tracking org-wise budgets vs actuals. b. Evaluating engineering plans for new investments. c. Accounting for budget variances due to delayed execution (eg: Team X budgeted $Y for the month of March for a new feature launch, but came in lower cause they encountered issues), or overspend (and identifying critical remediations). d. Enabling identification of potential cost saving opportunities.

Lastly, a TPM is a core partner to the Engineering Manager and F&S, in identifying and unifying KPIs to tune the OODA loop (observe, orient, decide & act), to make macro or micro refinements to the overall strategy.

There’s a tendency for infrastructure engineering to be invisible when nothing is going wrong. How do you articulate the value of your organization’s work?

I am very glad you brought this up! With infrastructure, when something’s going wrong, there’s nowhere to hide. But the key challenge is when nothing is going wrong, how do you know it’s actually going right? So, when I think of infrastructure problems, I think of ‘great power comes with great responsibility’. And here’s why.

The beauty of Infrastructure is in its whittling down of essential complexity, through simplified abstractions which bring joy to its users. And through that, key leverage for the business.

Most mid-sized companies looking to scale, start investing in infrastructure engineering teams; with typical hiring ratios being 7-8 Product engineers to every infra IC. This makes it imperative that every infra-eng-week of effort be dedicated to high leverage work. Infra problems also take more rigor to solve, and get right, to avoid thrashing the rest of the engineering organization. Imagine building out a cloud compute abstraction, which changed every quarter, and fanned itself out to 20+ product engineering teams doing daily deploys. It’d be a nightmare!

This combination of complexity, rigor and the expectation of high ROI, makes Infrastructure engineering a very high stakes endeavor :)! Teams which romanticize or idealize the tech, over its customers or business value tend to languish – either due to missing the mark on realized impact, ceased investment due to lost credibility or due to internal employee burnout. And so articulating value, within and without, at every stage of software development is extremely crucial to leading a high performing, value delivering Infra team.

The 3 tenets I’ve found useful are 1. Know your customer 2. Bring in a product mindset, whether it involves doing initial market study (eg: evaluating build vs buy options), customer analysis & segmentation (eg: focus on data scientists over business analysts ), or even developing a go-to-market strategy (eg: white-glove migration workshops to facilitate Data Locality needs). 3. Measure what matters, not what’s easy to measure (and do the early work to identify what this is!)

At the planning stage, drive precision through alignment and prioritization — are you focusing on the right problem? And for whom? Here, you need to be grounded in the why, before the what. Do the 5 whys exercise, especially if embarking on multi-half infrastructure investments (eg: migrating a monolith to a SOA). And to seek alignment and early feedback, I’ve found the PRFAQs practice from Amazon, quite useful to build trust and credibility with your stakeholders and executive leadership.
At execution, drive focus, speed & quality, leveraging the SPACE metrics whenever applicable. Be extremely paranoid about scoping the problem just so, go deep before you go broad, and aim for vertical slivers of delivering impact vs all-or-nothing. I recently led the data security strategy for Stripe, and the biggest win we had was in the underlying approach. We identified a data access metric, and pivoted from securing one data system at a time, to incrementally driving value and moving the needle. Depending on the culture of your organization, communicate early, and often, through shipped emails, company all-hands or demos. This is a great avenue to seek feedback with your beta users, confirming the validity of your approach.
Finally, ensure that you’re maximizing overall impact: Are folks using what you delivered? Are you actually seeing movement toward your northstar metric? This is when we hone the “what” and validate the “why”. Quite recently, we shipped some work expecting to see change, and moved onto solving other problems. Looking back retrospectively, we realized that we had needed to build adjacencies to the shipped work to actually capitalize on all the effort. Sometimes, you need to evaluate what an additional 5–10% looks like to realize the most impact; this could be a marketing strategy, a small UX improvement, a small optimization (e.g. making load times much faster), or in many cases, better documentation. _Take that time. Bring it home. _

On March 15, 2013, 1,200 Japanese workers converted the Shibuya Station train Line from above ground to underground, in just 3 hours, before the first morning train the next day! I have worked on Infrastructure for nearly 2 decades now – when I think of Infrastructure, I think of this. It is this behind-the-scenes symphony of dedicated, resilient and talented people, working together to keep the masses moving, with zero friction or downtime- THIS gives me joy. And pride.

Asking the same question again but from a different perspective: how does working on something like efficiency or performance impact someone’s career, particularly in terms of getting promoted?

There’s the finite game of uplevels and promotions, and the infinite one of constant learning, development and evolution.

For the former, when we think of individual career impact, there are 3 systems at play – individual career aspirations, the engineering ladder and expectations for different levels/roles, and the business need/team opportunity.

Efficiency/Performance-shaped problems tend to be both broad and deep (eg: spark tuning for bigdata computation or improving your Kafka publish tail latency). To navigate such problems, some key traits and competencies are: highly motivated, proactive problem-solvers who can move with urgency and focus, while balancing critical thinking, comfort in dealing with data-driven diagnosis, hypotheses and analyses, and ability to work cross-organizationally, collaborating with different teams, systems and organizational dynamics. Let’s assume that individuals working on efficiency or performance-shaped problems are inherently motivated and excited about solving such problems.

That brings us to ensuring that there is indeed a strong business need to be solving these problems. A company, in its initial phases, might not want to invest in efficiency and performance, and rightly so as we discussed earlier. If we’ve secured the need and the buy-in, it comes down to demonstrated value - results, results, results! And once you’re secured results, identify the narrative for the impact driven - what’s the before/after story? What got better? What gets worse, if left unsolved? Understand and evaluate your organization’s leveling rubric, to assess if the complexity or realized impact, are in line with the system’s expectation from someone at that level and role. Eg: At Stripe, we’ve intentionally introduced the “Fixer” archetype for Staff engineers, to create room for, and acknowledge the value of associated impact to the business.

Some things to keep in mind:

Going back to your earlier bit about diffused accountability, we also need to ensure that the individuals working on these problems are well equipped to navigate this aspect of the system (especially, because ICs tend to avoid situations involving potential conflict).
If in the discovery phase of evaluating which problems to solve, identify a rubric for effort and impact (eg: 1 eng-quarter for $X million in annualized savings), and stack rank those opportunities to avoid missing the forest for the trees.
Balance driving value with incoming interrupts, when driving change through the rest of the org – ICs want to code, and solve problems, so leverage partners like the EM and TPM to help ICs get focus time.

Lastly, speaking of the long game, and my own experience : I’ve developed some key strengths through my journey from quality engineering to leading efficiency/performance programs - ability to seamlessly operate & diagnose varied distributed systems, strong business communication skills, and effectively drive influence without authority across cross-functional organizations.

What are some resources (books, blogs, people, etc) you’ve learned from? Who are your role models in the field?

You and yours - infraeng is a great resource to hear from other practitioners, operators and builders!

I love understanding how complex systems at scale fail - for these, I’ve found it extremely valuable to study the AWS Post-Event summaries! I also routinely look forward to shared content from Cindy Sridharan, Liz Fong-Jones and Corey Quinn. I found Designing Data Intensive Applications to be an excellent resource for the basics around dealing with Data.

In terms of engineering management, I’ve learnt a lot from playbooks, frameworks and resources shared by Lara Hogan, Kim Scott, and Brene Brown. My favorite book on leadership is, I too had a dream by Dr. Kurien.

Engineering Leadership at the higher levels starts getting fuzzier in time and space, and has lesser structured content. Here, I do find myself leveraging a lot from my Macro & Micro Economics - and leaning on more holistic modeling of the world around us (Thinking in Systems: A Primer), and our own interpretation and response of it (Mindset: The New Psychology of Success, Switch: How to Change Things When Change Is Hard, Unlocking Leadership Mindtraps: How to Thrive in Complexity).

Contract Negotiation Checklist

Mon, 04 Apr 2022 07:00:00 -0700

Fork this template on Google Docs

Negotiating contracts is an important part of managing costs, but it’s also something that you only do infrequently. Particularly in an earlier stage company, you might only negotiate one large contract a year. It’s quite hard to get better at something that you do so infrequently, but using a checklist is one way to be consistent in your approach, and to ensure learnings from one negotiation carry over into the next.

Contract Negotiation Checklist

There’s no perfect checklist: you should customize the checklist for your company based on your process, preferences and the experience level of those who are involved. If this feels too heavy, then by all means remove some steps.

More Readings On Vendor Negotiations

How to Use

Fork this template on Google Docs
Follow the template’s checklist
Link your template into an internal repository of all negotiations so folks can find it the next time you’re negotiating this or related contracts
Now you’re done!

Tips

Negotiating contracts is a learned skill, and you’re probably not going to be good at it for the first few. That’s ok! Find someone else with more experience to partner with on your first few, even if it’s just asking a more experienced friend at another company to brainstorm with you through the process
Whenever possible, negotiate knowing how much other companies are paying the vendor you’re talking to. This creates a clear price ceiling to negotiate towards
If the vendor hasn’t hit their quota and is approaching their financial year or quarter, you can almost always make significant progress on terms if you’re willing to move quickly
Everyone should know their role in each negotiation. Sometimes your role is being the difficult, inflexible person! Sometimes your role is being the person who thinks they’re the final decider but is infact mistaken when your manager “take overs” later “much to your chagrin”
Not all contracts are equal. Sometimes the absolute number of dollars isn’t high enough to follow a time consuming process. On the other hand, sometimes the numbers are genuinely massive and are worth pulling in even your most senior leadership to get the best possible deal

Efficiency: Managing Infrastructure Costs

Sun, 03 Apr 2022 07:00:00 -0700

In my early career roles, I worked at companies that never worried about their infrastructure costs at all. They were simply too low a cost and growing too slowly for the Finance team to pay much attention to it. This “ignore it until it’s too large to ignore” approach served me well.

Until it didn’t.

Working at Uber, I was caught me off guard when a new Director joined and overnight infrastructure costs were recategorized from insignificant to requiring urgent, detailed review every month. Adding the instrumentation and accountability for these costs retroactively was a difficult retrofit. Although I was surprised that time, I’ve come to appreciate that all successful companies go through the transition from ignoring to setting goals on infrastructure costs, and an early focus during my time at Stripe was ensuring we were ready ahead of that shift.

Your job as an infrastructure leader is diagnosing the right mode of operation for your company’s infrastructure costs today, understanding when you’re likely to switch modes, and ensuring you’ve done the prework to make the transition relatively painless.

We’ll explore this topic by digging into:

three distinct operating modes for infrastructure costs: early-stage, growth, and late-stage
concrete tools and tactics such as managing infrastructure costs with cloud-specific reductions, including costs in your Business Review Template, and using a Contract Negotiation Checklist
whether you should spin up a dedicated team working in this area

When you finish reading this, you won’t have your entire efficiency plan worked out, but you will have the high-level pieces, know where you need to dig in, and have a clear approach to communciate to anyone who has been pushing you for a documented approach around infrastructure costs.

Related Interviews

Smruti Patel: Head of Engineering for L.E.A.P. at Stripe

Should you prioritize infrastructure costs?

Before diving into the mechanics of managing infrastructure costs, the first question to answer is whether it’s a valuable use of organizational time to make your current infrastructure spend more efficient. How you think about this will vary a bit depending on whether your company is early-stage, prioritizing growth, or focused on profitability in late-stage.

Early-Stage

Generally speaking, very early-stage companies shouldn’t spend much time thinking about infrastructure costs. You should instead be focused on finding product-market fit for your first product.

Here are two checks you can run to determine if it’s worth reducing your infrastructure costs:

If you were to reduce your infrastructure costs to $0, and it still doesn’t increase your runway by at least two months, then it’s not worth focusing on
If you’re spending less than $2,000/month per employee on infrastructure costs, then it’s probably not a significant priority because your headcount spend will be so much higher

If you’re not violating either of those checks, then keep on ignoring infrastructure spend. If you are exceeding one, and infrastructure costs are a significant part of your overall burn, then invest a sprint into reducing spend, and then resume ignoring it once these checks resume passing.

The one notable exception is if you’re building a low-margin product or product where cost efficiency is a pillar of your long-term strategy. For example, if you’re operating a metrics collection and dashboarding product like Datadog, then efficiency probably is worth considering earlier than usual.

Growth

When you’re prioritizing growth, the primary focus of the engineering organization in a technology company is creating, operating and advancing the products that support the business. Managing costs is important, but even immaculate cost management won’t make your company a success if enough energy isn’t being invested in your product.

The fundamental question to ask is whether infrastructure’s share of cost of goods sold (COGS) is increasing as a percentage of revenue? (The simplest way to think COGS is all your non-headcount costs, although a slightly better definition would be all costs to operate your software.)

Start answering this question by plotting revenue and infrastructure costs on a chart to get a sense of how these two numbers are moving. Although logarithmic scales often generate more confusion than they’re worth, in this case it’s usually the only way to see both lines closely enough to understand their slopes within a single chart. You particularly want to understand if either line has experienced an inflection over the past few quarters. If costs have started accelerating without corresponding acceleration of revenue, that’s worth digging into.

Once you’ve looked at the two lines independently to understand their movement, simplify your first chart into a chart showing infrastructure costs as a percentage of revenue. This chart hides some detail but is easier to parse for folks further away from the details. As long as the ratio is going down and your company is focused on growth, then this data should be sufficient to justify your current level of investment into efficiency: if growth is key, and infrastructure costs are not getting in the way, why should you slow down growth to reduce them?

Late-Stage

Even the best business lines stop growing at some point. Facebook is one of the most valuable businesses in the world, but even they at some point ran out of new users to attract to their platform. Once growth slows, a business naturally starts focusing more on costs, including infrastructure spend.

In those scenarios, the easiest approach is to work with the business to align on two numbers:

Dollars spent on infrastructure overhead per engineer: this includes things like development environments, testing tools, and so on. Determine your starting point by bucketing vendors and non-production infrastructure costs into a chart and plotting them over time divided by headcount. Pick a reasonable point on that line as your target. Refine it by reaching out to industry peers to get a sense of how this number compares to theirs (be sure to pick industry peers in companies that are currently focused on profitability, otherwise their answers won’t be very helpful to you)
Infrastructure dollars spent per N product operations served: anchoring on cost of operating the product. This will vary a bit depending on your product or business, but it might be “$1.00 in infrastructure costs to powering every 10,000 searches”, “$2.50 for every 10,000 payments processed”, or “$3.00 for every 10,000 trips scheduled”

In both, the key thing is moving away from anchoring on a percentage of revenue and instead setting a target against the fundamental operations that you support. Thinking of costs as a percentage of revenue works well when you’re growing, but is too abstract and hides too many details once you’re focused on reducing costs.

If you find yourself exceeding those targets, then it’s time to dive into reducing them.

Tools for Managing Infrastructure Costs

What I’ll introduce here is the fairly common playbook for managing infrastructure costs. As you work through these approaches, your goal is to do as few of them as possible while meeting your efficiency goals. I’ve prefixed a few particularly high return-on-investment tools with a “⭐, if you’re debating where to start, consider starting with them.

⭐ Use cloud vendor’s cost optimization tools. Every cloud vendor has a program along the lines of AWS Savings Plans or AWS Reserved Instances. These plans allow you to trade usage or spend commitments for reduced pricing. If you aren’t already using these, you can usually reduce your infrastructure costs by 20-40% in a few weeks of work
⭐ Standardize your vendor negotiation process. Beyond a core cloud vendor, many companies have five or six additional large vendor contracts for things like observability, security, or developer productivity. Introducing a structured process for negotiating and renegotiating, like using a Contract Negotiation Checklist, will significantly improve your pricing (as well as visibility into costs)
⭐ Run periodic deep dives on cost. Until you have a dedicated team actively looking at your infrastructure costs, you can usually identify significant cost reductions by periodically taking a week to dig into your biggest infrastructure costs and prioritizing low-hanging fruit. These will usually be accidents, like storing unused data, development environments not getting retired, etc. The key thing is scoping the opportunity to work that the infrastructure team can take on themselves
Update your Tech Spec template to include a section that estimates costs. Many engineers will be unfamiliar with that process, so make sure the template links to examples of how a few representative services estimated their costs. A great example will be onerously detailed, including links to the specific queries and tools to estimate their costs. A template that requires cost estimation without guiding folks through that process will inevitably trend towards make-work rather than a useful discussion
Find the executive sponsor who really cares about infrastructure costs and is willing to push inefficient users to spend less. This is usually your CTO or your CFO. Without an executive sponsor willing to prioritize this efficiency work, you’ll find progress further down this list difficult. If you can’t find a sponsor, that’s usually a good sign that you’re already doing enough to prevent infrastructure costs from becoming a top priority
Find product cost optimizations. There will be significant opportunities to reduce costs by changing how your product works, e.g. improving your data model, changing storage technologies, moving workloads between streaming and batch. However, product changes have a much wider set of stakeholders, which makes these sorts of improvements harder to prioritize. Generally, only try to pursue these if there is a massive opportunity
Pursue a cloud vendor contract discount. Negotiating your cloud contract to include a discount is very doable after you reach a certain level of spend, or are a sufficiently strategic partner, but before you reach that level of spend it’s quite difficult to get a meaningful discount. Is it worth spending three weeks and making multi-year financial commitments to get a six percent discount on your cloud spend? Maybe! It depends on your priorities and your confidence in future spending estimates, but it certainly isn’t worth it to everyone. Conversely, at a certain spend level–think, tens of millions of USD per year–your discount can get much higher without requiring any product-level changes
Set coarse goals on infrastructure costs. Partner with your company’s Finance team to coarsely attributing costs across teams, then set and monitor goals against those costs. Fine-grained goals and cost attribution requires a deeper investment into tooling, but most companies can split costs across their production environment, development environments, and data engineering. Once you done that split, you can set a goal and assign that goal to appropriate teams (respectively, something along the lines of infrastructure, developer productivity, and data engineering). This will provide some visibility and pressure on costs without requiring much attribution prework
Review costs in Business Reviews. Once you’ve set those coarse goals around infrastructure spend, ensure your company’s Business Review Template includes a section on their costs. If you run Business Review Meetings, then make sure someone is showing up to ask questions about costs for teams whose costs are missing goal or otherwise accelerating
Expand metadata to facilitate fine-grained goals on infrastructure costs. Implement an approach to Ownership Metadata such that you can assign all usage and storage costs against a specific team. Once you have that ownership metadata maintained, you can go further by generating proactive nudges to teams on following best practices, prioritizing high costs, and helping them identify accelerating spend early

If doing all of these sounds overwhelming, it should! Few companies do all of these, and those that do either operate in a business that is unusually margin sensitive or are spending many millions a year on their infrastructure costs.

Should You Have a Dedicated Efficiency Team?

Generally, the way I think through spinning out any given area into a dedicated team is described in Trunk and Branches Model, and that applies for the efficiency as well. That said, let me add a few caveats to that general approach as it applies here.

Much like managing technical quality, efficiency is an area where you can make significant progress with one-off initiatives. Improving how you use AWS Reserved Instances or renegotiating your vendor contracts can reduce your spent by 30-40% in a week or two. Product-level improvements to your architecture can reduce your spend even more, although they’ll probably take a bit longer.

Because you can make significant progress through one-off initiatives, the default is to wait until late into a company’s growth to spin out a dedicated team, and in most cases that’s the right decision.

The three factors to consider as you think through whether postponing a dedicated team is the best solution for you are:

Is infrastructure efficiency a fundamental strategic pillar for your business?
Are your infrastructure costs, today as an absolute cost, 10x more expensive than a team working to reduce them?
For the past year have you had pressure to reduce costs but an inability to prioritze the work because other critical work continues to displace efficiency efforts?

If you answer yes to any of those, then you may want to spin out a team earlier than the Trunk and Branches Model suggests. As you start sourcing candidates, it’ll become apparent that this is a bit of a custom role with folks who specifically enjoy working on the problem. Recruiting one or two folks with siginficant preexisting experience will save you years!

Business Review Template

Wed, 30 Mar 2022 07:00:00 -0700

Fork this template on Google Docs

As your company gets larger and more complex, it’s easy to become embroiled in supporting incoming asks from other teams. That’s important work, but it’s also important that your team is operating effectively and prioritizing your goals in addition to the goals of other teams making requests.

If you’re getting mixed signals on whether your team is doing the right work, the Business Review Template can help cut through the confusion. This written document facilitates an operational review of your team, and even more importantly creates an opportunity for you, your team, and your stakeholders to discuss if you’re focused on the right work.

Example using the Business Review Template

Most companies wind up using a variation of this template by the time they reach a thousand employees, with some starting much earlier. Even if there’s no structure business review process, it’s helpful to start writing them periodically for the area you’re responsible for: think of them as your area’s performance review.

Related Meetings

Business Review Meeting

Other Approaches to Business Reviews

The Kool-Aid Factory’s The Shipping Great Work Issue

How to Use

Fork this template on Google Docs
Find examples of previous business reviews at your company, and if possible ask the authors what was and wasn’t well received in their most recent review
Fill in the template for your team’s area
Iterate on your draft with feedback from your team and manager
Identify the key groups you want feedback from, and create copies for each of those groups. Transparency is important, but transparency too early often mutes the direct feedback that helps you succeed. Give these groups a week or so to provide feedback, including running a Business Review Meeting if that’s something your company finds valuable
Widely publish a clean, readable copy into wherever business reviews are collected, and let anyone who hasn’t gotten a change to see it so far know where to find it and how to share feedback on it
Now you’re done! (At least until the next one.)

Tips

Writing an effective business review depends first and foremost on understanding the audience you’re writing for, and what that audience cares about. If you’re not sure about the answer to either of those, ask!
Many companies and many teams try to use their business review to solve too many different problems. Your business review should focus on answering only two questions: how well is your area of the business operating? What do you need to do for it to operate better?
Good business reviews are focused on what the reviewers need from the review. Bad business reviews are comprehensive, capturing everything that someone on the team wants reviwers to know
Every metric you include in a business review should be a well-formed metric that includes the current value, the goal, and the trend over time
Avoid delegating the writing of your business review to multiple different folks. Short documents with disjoint authors are hard reads
The Amazon Way of Writing is a helpful set of rules for writing these sorts of documents

Trunk and Branches Model

Sat, 05 Mar 2022 07:00:00 -0700

Early on in your company’s lifetime, you’ll form the seed of your infrastructure organization: a small team of four to eight engineers. Maybe you’ll call it the infrastructure team. It’s very easy to route infrastructure requests, because they all go to that one team.

Later on, things are easy as well. You have seventy engineers spread across eight to ten mutually exclusive and collectively exhaustive teams with names like Storage, Traffic, and Compute. You’ll pull up the organization’s service cookbook and get pointed directly to the right team for your specific problem.

Those are both stable organizational configurations, but the transition between them can be a difficult, unstable one to navigate, and that’s what I want to dig into here. I’ll start by surveying my experience helping to ramp Uber’s infrastructure organization, abstract that experience into a playbook, and end by discussing some arguments that folks raise against this approach.

Uber

When I joined Uber, the Infrastructure organization consisted of three teams (whose names were unhelpfully generic, so I’m renaming them a bit for clarity): developer productivity who worked on build and test (~4 engineers), storage engineering (~6 engineers) who worked on scaling real-time storage, and operations (~5 engineers) who did everything else to support the company’s ~200 engineers, ~2,000 employees, and ~400% YoY growth in both usage and engineering headcount.

The first two teams were focused on acute, critical projects: keeping the engineering team productive and sharding our data to ensure we didn’t exhaust the disk space on the largest-we-could-buy hardware supporting our primary database cluster. The third team, the one I joined as its engineering manager, was responsible for keeping everything else going while the first two teams addressed their urgent focus areas.

On operations, our immediate challenges were significant: our self-managed compute cluster ran out of capacity every Friday leading to reduced availability (and at that point we were in a managed datacenter with limited capacity), our Kafka cluster was experiencing significant challenges with load, our Graphite cluster was frequently going down under load, the recently introduced move to a service oriented architecture depended on our team doing one to two days of work for each additional service, with new service provisioning requests coming in daily, and we handled on-call for the entire company with literally hundreds of alerts coming in most on-call shifts (it was not unusual for your phone’s battery to die during the 12 hour, follow the sun shift).

This was, objectively, a pretty difficult situation. That said, we started to work the problem:

We reworked our interviewing process to accelerate hiring. We knew if we hired behind the larger organization, we would fall even further behind as engineering headcount was a major input into the volume of incoming requests. We hired from 5 to 70, all external hires, over a two year period
We created a service cookbook so we could tag incoming requests to better understand where our time was going
We learned that service provisioning was our biggest source of time consumption, and it was a particularly consuming task because it required so many back and forth requests with the requesting team. We set up a request flow that required folks to supply all the necessary information along with their initial request. The volume was still overwhelming, so we hired an earlier career engineer whose initial project was to handle all incoming provisioning requests. This reduced interruptions for the wider team so that they could better focus on building an automated solution, but it also served as a backstop for service provisioning: if that engineer fell behind the incoming request load, we just went slower. As the team continued to grow, we spun out a services engineering team who fully automated the provisioning flow. About 15 months after I started, no humans were involved in service provisioning requests, which had now been migrated out of our initial data center into three new data centers
Three specific teams were placing significant and bespoke demands on the team. When we supported one team’s requests, they were always followed by even more requests. When we prioritized one team, the other two would be increasingly upset that we hadn’t prioritized them. When we prioritized any of these teams, the long tail of teams in the organization would be upset instead. To address this we spun up an embedded SRE function, where each of these high demand teams got two SREs that exclusively supported their requests, but they had to prioritize tasks to those SREs themselves. This become a deliberate bottleneck on the amount of one-off support we provided to those teams, creating space for us to innovate on more scalable solutions
Graphite, our metrics aggregator, was becoming overloaded with too many incoming metrics. There were simply too many incoming metrics from too many machines. We started by guarding Graphite behind a small pool of servers running a C reimplementation of statsd, which aggregated thousands of servers’ worth of metrics to four or five servers’ worth. We moved from TCP to UDP metric submission, and simply dropped the metrics we couldn’t process in a timely fashion. This allowed a baseline of stability, admittedly without much accuracy, while we worked to scale up the broader backend system. Eventually we lost confident in Graphite’s scalability and spun off a team to build M3, which solved the operational metrics problem for Uber
In our configuration, Kafka was only generally reliably shipping logs in our setup rather than providing the at-least once guarantee we required for some categories of logs. We did significant work stabilizing our Kafka cluster, and eventually spun out Kafka maintenance to a new team within the Data organization. That team invested heavily into Kafka, and our infrastructure became robust and reliable
We initially routed internal requests through an instance running HAProxy on every server. As the number of servers grew, these distributed instances performing health checks became a DDoS of its own. We reduced health checks, which bought us a few weeks of time. We added a health check cache running in Nginx on every hots to intercept incoming requests. Eventually these solutions simply ran out of runway, and we spun off a team that built a tiered health checking infrastructure that checked each host O(1) times rather O(servers*avg-number-services-per-host). That tiered health checking solution solved service routing scalability for our needs

That was a lot of work, which happened over the roughly two years that I worked at Uber, and we certainly did a bunch of other stuff as well: we also migrated out of our first data center, spun up (and down) two data centers in China, supported the deprecation of the original monolith, and so on.

The core organizational pattern was identifying the biggest emergency or largest source of incoming work, finding a way to provide a bounded level of quality of service, and focus as much energy as possible on innovation cycles that solved the underlying problem. If the underlying problem was too large to solve in a few weeks, then once we had the headcount, we would spin out a new team with the solitary focus on solving that problem.

This wasn’t glamorous, these were two very difficult years, but it does illustrate how that core pattern of exchanging short-term low quality of service to provide long-term high quality of service can overcome remarkably challenging circumstances.

Rules of Scaling Infrastructure Organizations

Exchanging quality of service for investment bandwidth is a key tradeoff within an infrastructure organization, but it’s hardly the only one. Operating an infrastructure organization is maintaining a dynamic balance across many forces. You need to balance tech debt against morale. You need to balance iterating on the usability of your capabilities against delivering them before being crushed by an exponentially scaling problem tomorrow. You also need to balance your budget.

Working through those challenges, I’ve come to appreciate there are two fundamental rules (with two corollaries) to successfully operating this sort of organization:

Rule One: You must maintain service quality high enough that your leadership team doesn’t throw you out

Rule Two: You must maintain a sizable investment budget to prevent exponential problems from sinking your organization

Building on the two rules are these two corollaries:

Corollary One: If morale is too low, service quality and investment budget will both collapse (as folks leave with the essential context)

Corollary Two: If your budget is too high, it’ll get compressed (which makes everything else much harder)

If you can solve for all four of those, it’s a relatively easy job.

Trunk and Branches Model

The solution I’ve found effective for addressing the infrastructure organization rules is an approach I call the Trunk and Branches Model. You start with a “trunk team” that is effectively your original infrastructure team. The trunk is responsible for absolutely everything that other teams expect from infrastructure, and might be called something like “Infra Eng,” “Platform Eng,” or “Core Infra.”

As the team grows, you identify a particularly valuable narrow subset of the work. Valuable here means one of three things:

it’s an exponential problem that will overrun your entire organization if you don’t solve it soon; for example, test or build instability accelerating as you hire more engineers
It’s a recurring fire that is undermining your company with users; for example, database instability causing site outages
It’s an internal workflow that’s starving your team’s ability to make investments; for example, a clunky process for manually spinning up new services in a company accelerating service adoption

You then create a narrowly focused “branch team” that wholly takes responsibility for that subset of work. This might be a Storage team that is responsible for all real-time data storage and retrieval. This might be a Services team that is responsible for all service provisioning. This team is responsible for both solving the immediate and long-term problems associated with their area of focus. Providing operational support within their vertical ensures they are tightly connected to their users real problems. Sufficient team staffing to support investment allows them to solve problems through platforms and automation rather than linearly scaling the team’s staffing.

Each time the trunk team grows beyond six to eight engineers, split off another branch team to focus on whatever your biggest problem or opportunity happens to be. Keep doing this for a few years of rapid growth, and your initial infrastructure team will have grown into an infrastructure organization.

Now that we’ve summarized the Trunk and Branches model, it’s worth addressing how it handles the challenges highlighted in the _Infrastructure Organization Rules _section above.

The first challenge is maintaining sufficiently high service quality at each point of growth such that you maintain the confidence of your peers and leadership. This model ensures there is always a clear responsible team for incoming asks, and facilitates spinning out the highest burden asks into branch teams with enough staffing to solve the underlying need with sublinear staffing
The second challenge is maintaining a sizable investment budget to prevent unchecked growth of exponential problems. This model spins off branch teams to consolidate investments on the most valuable problems.
The third challenge is maintaining sufficiently high team morale to retain your team. Branch team morale is driven by the focus and staffing to solve high impact problems. Trunk team morale is driven by folks who enjoy fighting fires and one-off solutions like bonuses, increased PTO and so on. (These solutions are temporary because the trunk team disappears as the organization grows sufficiently large.)
The final challenge is giving you the flexibility to maintain a reasonable budget. Headcount budget is maintained by restricting the number of branch teams. Infrastructure budget is maintained by spinning out an infrastructure efficiency team if operating costs begin to grow too quickly.

This isn’t easy, and it requires making bets on the right branches, but in my experience it does consistently work as long as your company views infrastructure as an essential contributor to its success rather than a cost-center to minimize.

Operating Trunk and Branch Model

Now that we’ve dug into the model and how it solves the underlying dynamic balance, there are a few operational aspects worth expanding upon:

The combination of trunk and branches must be mutually exclusive, collectively exhaustive. Many infrastructure organizations think they can simply “unown” critical work, but this doesn’t work. You’re better off having the trunk team explicitly own the area with a reduced service commitment than to have no official owner
Maintaining morale within the trunk team is an ongoing priority that requires active attention. The trunk team will eventually disappear as you build out branches, so you can do things that don’t work in the long run. Give team-specific bonuses for folks who stay on the trunk team for six months. Provide additional time off for the trunk team. Spend more time with them personally and celebrate them publicly
It’s ok to have significant intensity for a given team at a given point. I’ve consistently found that teams rise to meet temporary adversity. Where teams, and morale, suffers is prolonged exposure to adversity for a given group. This model shifts adversity by spinning out branch teams (to take adversity off the trunk team) and staffing the branche teams (to invest their way out of adverse conditions). If you pick and choose components from the model without ensuring that adversity rotates, then it won’t work out very well
Only add branches when the team sizing math works. The trunk team must never shrink below six to eight engineers. The new branch team should have at least three engineers. All existing branch teams should have at least five engineers. If you can’t properly staff a new branch, then it’s better to move work across teams (e.g. expand scope of an existing branch) than to create a new one. Each branch needs to both operate existing infrastructure and invest into a replacement, which depends on a decent level of staffing, otherwise you’re not actually resourcing them properly to dig out, and this isn’t going to work
If you urgently need more branch teams than you can staff according to the above rules, then you have a headcount planning problem which you should address directly rather than by attempting to spin out understaffed teams
Inspect new branches to ensure they’re investing into a scalable solution rather than manually working through the problem. Each branch needs to scale their solution with a sub-linear investment of headcount. Watch carefully to ensure that’s happening
You cannot replace the trunk team with a rotating on-call. This will sort-of work early on, but eventually the number and complexity of the systems to maintain will be too high. You’ll end up having shadow on-call rotations (“Call Laura, she’s the only one who knows how PostgreSQL really works.”) prolonged incidents due to lack of context (“I thought we could just restart that!”), and it’s unclear who is responsible for paying down the most urgent problems. This will cause you to under-deliver on service quality, violating the first rule of infrastructure organizations (“you must maintain sufficiently high service quality”)
You cannot replace the trunk team with a team staffed with a rotating membership. This works a bit better than only having a rotating on-call, but it struggles for all the same reasons
If you’re concerned you’ll need an unreasonable number of branch teams, then explore if you’re underutilizing vendors. This is your best tool for managing headcount growth to meet headcount budget expectations
Trunk team is usually one team, but in some cases you may find it’s easiest with two teams: a centralized trunk team and an embedded trunk team that supports your heaviest consumers of capacity. In this case the embedded model is about providing higher perceived quality of service while reducing support and forcing the requesting team to self-prioritize their asks

There are certainly more operational details worth considering, but if you start with these you’ll be on a good path.

Even Good Solutions Have Flaws

Having deployed the Trunk and Branches model at both Uber and Stripe, I’ve run into a number of concerns from folks who believe it doesn’t work or that it’s an unreasonably painful way to operate. In this section, I want to address some of the most frequent concerns. I wholly agree with these identified problems–it’s a deeply imperfect model–but proposed alternatives usually superficially address the fundamental tradeoffs: all approaches have flaws, but good approaches work.

The most common concerns are:

“Working in the trunk team is too difficult to retain engineers.” I touched on this above, but this is a real challenge that requires leadership focus. Some folks love the lightly controlled chaos on a trunk team, but others hate it. For the latter, you may need to rotate them out of the team after six to twelve month stints. You may need to offer a bonus stipend to folks on the trunk team. You may need to offer increased time off. No matter what else you do, you’ll need to spend time communicating how valuable their work is directly to the trunk team and consistently in each of your wider communications to the organization. This is hard, but it’s doable with attention and creativity
“It’s inequitable to concentrate the burden on the trunk team.” I’m deeply sympathetic that it’s uncomfortable to ask the trunk team to absorb the long-tail of obligations while allowing the new branch teams to focus. This does feel unfair. However, your obligation as an infrastructure leader is to guide the organization out of the unbalanced mode of operation. Preserving an unstable operating mode to maximize short-term equality is a short-sighted path that prefers “everyone is permanently in a difficult working scenario” over “everyone is permanently in a good working scenario” to avoid a fixed-length period of interim complexity. I just cannot understand that mentality! Commit to the transition and then work to ameliorate the interim period’s challenges
“Innovation teams shouldn’t be burdened with operational concerns.” This concern is generally raised by folks who want to be on an innovation team who only does investment work. They view operational work as second-class work that would distract truly innovative engineers like themselves from the most rewarding, impactful work. My experience is that innovation teams who aren’t exposed to the operational concerns of real systems tend to build the wrong thing. Exposing branch teams to a concentrated set of operational concerns within their scope exposes them to their customer and their customer’s eral problems. This significantly derisks execution and takes some burden off the trunk team. I understand how folks land on this perspective, but I continue to view it as a self-serving perspective rather than one that contributes to company, organization, or team success
“Just hire Site Reliability Engineers to solve this.” In modern companies, SRE is a software engineering role with specialized expertise in some aspect of running complex systems (reliability, scalability, etc). Following that definition, SREs can be a critical part of both trunk and branch teams. To the contrary, I find that folk who raise this concern tend to view SREs as operational capacity to offload manual work off “higher value” infrastructure engineers that can automate workloads. In some cases adding manual capacity to your team is a valuable strategy, but introducing a new role is a burdensome solution to what ought to be a temporary problem if you’re maintaining an appropriate investment budget
“This only works in a very fast growing organization.” One of the gifts of rapid growth is that it’s very easy to identify problems because they get so bad, so quickly. Slower growing companies go awry more gently, which can be harder to diagnose. This model does make a general assumption about headcount growth–that it goes up–and although it technically fits an organization without headcount growth (you spin off a fixed number of branch teams), it’s not particularly interesting, and you’ll need to introduce some mechanism for reprioritizing branch teams (and potentially for reconstituting their membership)
“This isn’t ambitious enough for an organization with slow growing technical challenges.” I generally agree with this critique, although with sufficiently slow growing technical problems, there’s little incentive for moving beyond the initial infrastructure team. Trunk and Branches doesn’t have much of anything to say about that scenario

Despite all those concerns, and having deployed the trunk and branches model twice, I still think it’s the best available option to operate with when you find yourself scaling a small infrastructure team into an infrastructure organization.

Utsav Shah

Sun, 27 Feb 2022 15:00:00 -0700

Interview recorded in late December, 2021. Learn more about Utsav on twitter, linkedin, and his newsletter/podcast.

Tell us a little about your current role: where do you work, your title and generally the sort of work you and your team do

I work at Vanta. We’re a continuous security monitoring and compliance automation platform. The vision of the company is to move the software industry away from “point in time” verification and towards continuous verification of security. For example, when you buy software from a vendor, you generally send them a security questionnaire and/or you ask them for their SOC 2 audit findings, but that’s a point in time, often outdated, representation of their security. A continuous monitoring system that checks on your security posture is a far better way to manage security.

My role is the Tech Lead of the newly formed Platform team. We have a bunch of product engineering teams and then this slightly different team that’s in charge of the non-product engineering work. I think of us as working on aspects like reliability and security that are directly impacting business, but also areas like developer tooling that help the velocity of the EPD organization. Stuff that needs to happen on an ongoing basis, but doesn’t really fit into the charter of a single product engineering team.

What was the original motivation for creating Vanta’s platform team?

When we were a slightly smaller company we had a certain amount of bandwidth for foundational engineering work that was split up across the entire engineering team. For example, 30% of engineering time was spent on foundation tasks, like making sure we’re upgrading third party dependencies, right sizing queues, following up on security tickets, and so on. That worked well when we had ten to twenty engineers, but didn’t work as well as we grew in headcount.

As the teams grew bigger, engineers started to lose focus across too many different things for a two week sprint. One engineer might be trying to ship a feature for a deadline with a product manager, tuning a MongoDB index, and working on tasks that slipped from the last sprint. There’s also the fact that some engineers naturally gravitate towards more platform-type work, and want to do it full-time. A dedicated team can form roadmaps for longer-term projects that might be harder to do in a decentralized setting.

Eventually we decided to go from the fixed percentage of bandwidth for all engineering to form a dedicated team focused on Platform.

Platform and infrastructure engineering cover a lot of space, and every company thinks about them a bit differently. For example, I once had a peer join who immediately told me I needed to hire SREs to take over his team’s on-call because the product engineers didn’t want to do it anymore. That was how things worked at his previous company, but it was pretty misaligned with the path we were pursuing. How do you figure out the right boundaries for your team?

We try very specifically to not be the team that’s in charge of everything that’s performance related, reliability related, etc. Instead, we’re trying to be the team that holds a high quality bar for our engineering practices and execution. That means sometimes we’re going to do some direct work, and sometimes that’s partnering with other teams on it.

To your question’s example, I think one thing that has been conflated with the DevOps movement, and a headache of mine, is that product engineers should be aware of everything to do with their infrastructure. It’s nice for people to specialize. For folks who are interested in solving memory leaks or database indexing issues, it’s good to have those people thinking about the problem holistically, seeing patterns across services, and be genuinely interested in that. For others it’s better to give them tooling that makes the problem easy for them to solve. Product engineers could spend their entire day trying to fix a database indexing issue that would take a specialist a few minutes. The right setup really depends on the scale of your company and the scale of your team.

The opposite problem exists as well, with infrastructure teams taking up a lot of product engineering time to work on migrations, package upgrades, or whatever. Just like product engineering teams shouldn’t dump problems on infrastructure engineering teams, infrastructure teams shouldn’t require too much from product teams. If an infrastructure migration is going to take 20% of product bandwidth, the product team should be able to say “no” or at least, “not right now.” It’s a hard balance because the lack of consistency or standardization in your infrastructure is not something you want to be stuck with indefinitely.

Coming from a larger company like Dropbox with a very mature infrastructure organization, was there anything surprising about getting started on Vanta’s infrastructure team?

Dropbox was a really interesting case. It’ll be helpful before answering if I talk about my background at Dropbox a bit. I was the TL of the Developer Effectiveness team for some time. We were responsible for version control, code review, and continuous integration (CI) systems, and all the surrounding infrastructure.

That was a big focus of my career, and what I used to think about, for a really long time. I was in that role for a couple of years and then moved to the Application Services team which was responsible for the Dropbox monolith. I started working on that team right after a stalled service oriented architecture (SOA) migration. At that point it was clear that the monolith was not going away anytime soon, and that we needed to make it effective to work in it.

At Dropbox, I had to think about making a thousand engineers productive, which is very different from the work I’m doing at Vanta with a smaller team. With a thousand engineers, you have all these different pockets: product engineers, mobile engineers, server engineers, desktop engineers, infrastructure engineers, and so on. We wanted to support all of them, but we had a limited amount of headcount and budget. We’d try to understand what everyone’s priorities were and pick from there. Mostly, there were more specialized teams, like Client Platform, which would focus on only client developers, and we could work with their needs, rather than talk to these different sets of users directly.

I learned a few things when I was thinking about developer productivity everyday for a few years. You often hear grumbling about tech debt from engineers, but it’s useful to understand the nature of tech debt in order to prioritize and tackle it effectively. One thing to think about is that it’s exponentially easier to fix tech debt closer to when it’s introduced. Thinking about flaky tests as an example, when you introduce your first flaky test into a codebase it’s easy to see what change caused it and how to fix it. But it’s much harder to solve that flaky test out a year later when there are dozens of other flaky tests and no one is actively working on that code. Every single time someone does a merge after introducing that flaky test, it can result in a failed build that creates development friction, and the problem compounds over time in a negative direction.

My experiences have led me to this notion that there are some things in terms of technical quality that are “high interest” tech debt and others that are “low interest” tech debt. You want to focus on the high interest end of things and fix them even if they don’t immediately cause problems.

One of the biggest examples of high interest technical debt is circular dependencies. It’s hard to fix them down the road and you end up creating a bigger and bigger tangle if you don’t prevent it from being merged. At Vanta, we had a few circular dependencies, and could get them fixed in a few weeks. Now we don’t have any. On the other hand, at Dropbox we had three or four projects over the course of five years trying to remove all circular dependencies in our monolith and it got resolved only after a lot of effort and pain.

The challenge with these is that the impact of poor technical quality is informed by experience, and not easily quantifiable. Circular dependencies are obviously a huge problem once you’ve dealt with them, but not so obvious early on. This is different from things like reliability, which are much easier to graph on a dashboard. That’s why it’s crucial that you have engineers who have experience and care about technical quality on your teams, so that they can understand the impact of decision-making that leads to technical-debt, and they can correct for it.

Going back to Dropbox for a second, you mentioned the challenge of supporting 1,000 plus engineers. That’s really hard, there are so many different projects you could work on to improve security, reliability, developer productivity, and so on. How did you figure out what to work on?

Yeah, that’s a hard one. The developer effectiveness team’s goal was to be a force multiplier for the rest of engineering. What can the five or 10 of us do to make the other 990 people in engineering more productive? We’d start prioritizing by developing an instinct around what good or bad development loops look like. For example, a build that takes 180 minutes is obviously bad compared to any experience that developers have outside the company, how much work would it be to get to 15 minutes? 90 minutes?

We also asked people what their biggest problems were using a good SaaS survey tool. This helped us divide information by cohorts so we can say something like “engineers who’ve been here for one year but had five years of previous experience are really frustrated by XYZ.” Conversely, we’d often see folks who’ve been at Dropbox for a while no longer noticed a problem because they got used to it and figured out some set of workarounds. It’s still a problem, they just don’t notice it anymore. Then you talk to people directly to understand the core of their problems.

When we looked at all the possible areas to work, we’d look for projects where the solutions solved things for multiple areas and got us compounding leverage. For example, faster builds would improve developer workflows, and also reduce our overall spend which would make the finance team happy, and we wouldn’t have to think about budget for a while.

Alternatively, a workstream that would improve our developer experience in the short-term, but also unlock longer-term benefits and ideas to make even bigger bets. For example, it was clear to me that a monorepo to share our server and client code was a good idea, but it was infeasible to merge these repositories, given how much slower git felt on the larger repository. So working on speeding up git would not only make server developers more efficient, we’d also be able to approach the monorepo conversation again.

One thing that frustrated me was actually the scale of the organization. Scale sounds like a lot of fun to work with, but in actuality, many smaller scale approaches, like outsourcing some concerns to a SaaS tool would not be feasible. For example, I tried to set up an evaluation with GitHub to migrate our version control from self-hosted systems to Github Enterprise, but their solutions engineers told us that we wouldn’t have a good experience migrating, since our repositories were too big. I was willing to set up some kind of repository size reduction efforts, since GitHub was a popular choice internally, but the recommended size at the time was simply infeasible, we would have to reduce our repositories to 1/100th their size. At the same time, building our own GitHub was certainly a terrible idea, so we were stuck with our existing systems.

Touching on another thing you mentioned before, I love the topic of service migrations because there’s so little industry consensus on what the right path is. Pretty much every company over a certain size has an involved story of attempting to migrate away from their original monolith codebase, but many fail or succeed without a net reduction in problems as Kelsey Hightower captures in Monoliths are the future. What’s your experience been, and how do you decide which of these sorts of lessons to bring forward to a smaller company like Vanta when you’ve been working at a larger one like Dropbox?

You don’t want to be extremely opinionated when you get to a new company because you don’t understand the context of why certain decisions are made. This is just like product management, where you need to understand the company’s true problems by digging deeper. For example, at Dropbox, availability was extremely important because it’s a B2C-ish product with people using it at all times of day and across many countries. People depend on Dropbox to get their work done so it needs to be available at all times of the day.

For a B2B company like Vanta, availability is still important, but just not as much. Instead, other things are equally important for business continuity as they were at Dropbox, like security and data correctness. One way of framing the problem is understanding the metrics or SLAs that other parts of the business/CEO actually care about to avoid prioritizing the wrong pieces.

At Dropbox, we had a very complex push process to reduce the risk of a deployment causing downtime, but we don’t need to do the same thing at Vanta because we care about different things. Finally, some of those ideas behind the process still apply, like empowering teams to make their own decisions and not be blocked due to other teams. These underlying principles, like letting teams operate independently, are important.

Absolutely! This is part of why I love the monolith versus services discussion. Even over the past decade there have been distinct inflection points between the belief that monoliths are good, monoliths are terrible, and then monoliths are good again.

This might be a hot take, but when reading The Phoenix Project, I thought while the DevOps movement was good based on how things were when it became popular, some of the ideas haven’t aged well. Maybe it used to be very common for developers to push problems to IT or Infrastructure teams, which was a problem when the service was so badly implemented that it had to be restarted every four hours or whatever.

However, I think we’ve gone a bit too far with every product engineer needing to know the complexities and intricacies of how the Kubernetes scheduler works. Generally product engineers are focused on (and enjoy) shipping useful features for users rather than the underlying infrastructure, and we should enable that. We also generally don’t interview product engineers on e.g. Kubernetes scheduling, so we shouldn’t be surprised when they aren’t knowledgeable or care about those topics.

We do want product engineers to be aware of the implications of their code, but as much as possible we should abstract them from underlying details. I think monoliths are part of the solution there. One key idea of monoliths is that you don’t have to think about deployment strategy or release cycle, underlying compute requirements, capacity planning, auto-scaling, it just works. Someone else worries about that for you, and that person is thinking about it deeply.

If you move towards a services oriented architecture that owns its software top to bottom, then often someone on every eight-person team has to think about these problems deeply, which isn’t very efficient in a large engineering organization..

That’s a great point. Something that has harmed many teams’ adoption of DevOps practices is that many of DevOps practices are described specifically in the context of smaller teams, say twenty or thirty developers, but are applied too literally by leaders at companies with much larger teams. There’s a lot of nuance to good practice. Even adopting good practices doesn’t necessarily work if you apply them without factoring in the context.

Yeah, there’s no cookie-cutter solution, which is what makes our field a bit challenging. The best way to learn from others’ experiences are often developer blogs. Reading Mike Bland’s blog on driving organizational change at Google was extremely informative for someone in my role.

Successful infrastructure engineering organizations think a lot about empowering developers. We talk about it enough that folks tend to have good “default ideas” about this topics: “of course, we empower our developers!” and so on. I’ve been trying to peer past those defaults a bit with the next question: What would happen if your entire team went on vacation for an entire month without their phones, computers or email?”

I would like to believe that things would keep running for the short to medium term. The goal of the Platform team is to preserve and improve engineering quality. That means things like the quality of the product itself, the site’s stability, our security, and so on. So you shouldn’t have a major outage because we disappeared for a month, and you could still respond to a stability incident without us, maybe a bit slower than you normally would.

But then, there might be a new vulnerability or class of vulnerabilities that appear and require cross-functional work to be efficiently resolved (eg: Log4J). Maybe a product engineering team needs to build a new system that needs additional isolation because of its capabilities, and aren’t sure whether it’s safe to roll the service out or not.

Ideally, the company’s leadership team wouldn’t notice our absence in the short-term, but the engineering team would. The system wouldn’t move in the right direction and the developer experience might start feeling worse and worse due to accruing complexity. Eventually, things would fall apart due to poor quality. This might be due to a data breach, or immense amounts of tech debt that causes a vicious cycle of developers leaving for greener pastures.

One idea you mentioned there is the idea that platform or infrastructure teams do security work. How should those teams think about doing security work?

In some ways security is a similar challenge to developer tooling. It’s hard to measure the security of your systems effectively, just like it’s hard to measure the productivity of your engineers. In my opinion, platform teams should split their time fixing security issues and building infrastructure to reduce the incidence of security issues, with a greater percentage going towards the latter over time. They should use their experience to fix things and to learn from the broader industry to figure out themes that they can use to prevent issues from ever happening.

For example, as a security team, you could either spend all your time fixing vulnerabilities in container images and be frustrated that updates keep causing new issues, or learn about tools like Distroless and migrate teams to using such tools. Both kinds of work solve the same problem, but it’s clear to me that a platform team is thinking about that solution, because a product team - rightfully - is thinking about customer delight, not distroless containers.

One specific security question I’ve been thinking about a lot lately is supply-chain attacks. How have you thought about the balance between developer productivity (allowing folks to use new packages, upgrade packages quickly) versus security (not allowing untrusted packages and package versions) as they relate to supply-chain attacks?

My fundamental belief here is that there are some software ecosystems and communities that have a culture of using many, many third party dependencies. Nodejs with leftpad is a classic example. Those ecosystems make it harder to write secure code. The alternative is that ecosystems with strong standard libraries require far fewer external dependencies which makes it easier to rely on dependencies you have high trust in. For critical components, you should pick a language ecosystem or tooling ecosystem that aligns with your goals.

Of course, this isn’t feasible for someone who already has a production application, so then you have to think about how to find a layered solution for your needs. For example, isolating parts of your workload, reducing the scope of secrets that each service needs, preventing egress access from your app to the whole internet – there are ways to reduce risk in tricky situations.

The industry is finally catching up on tooling and products that help with continuous monitoring, which is something that Vanta helps with. Even AWS has come a long way with Amazon Inspector which catches some third party dependency vulnerability issues. These sorts of tools need to be integrated into your workflows.

Related to security work, I also want to ask about who should be responsible for compliance within engineering. Oftentimes you end up with a surprise compliance deadline, often to land some sort of enterprise customer, and this work gets routed to whoever can do it as opposed to being routed on the basis of long-term alignment, and as a result oftentimes infrastructure teams end up doing much of the compliance work. Where should compliance work happen?

I think that’s a great question, and something you need to think about before working on security/compliance. For example, when Dropbox went public all the sudden we had these SOX audits show up, which were really tricky. They introduced a bunch of controls on any code that touched financial data, in particular ensuring those changes were all reviewed by the engineering team responsible for financial data. Compliance felt like a big, scary buzzword. Oh, you don’t want to be out of compliance, especially when you’re sitting in a room with the auditors and the compliance team.

That said, compliance is really about the minimum bar your company needs to meet, not the target you should be aiming for. It’s also a lot more nuanced than many engineers realize. Some pull requests weren’t reviewed pre-merge at Dropbox even after we went public, even though it seemed like a hard-and-fast control in several compliance frameworks. Compliance is always a conversation between teams involved, not strict, specific rules. The goal is to minimize risk and to show you have a repeatable processes to reduce risk, not a series of top down mandates.

It makes sense to me for the team working on engineering quality to do this sort of work, but it depends. It’s really helpful to have an enterprising Product Manager or two that is able to demystify the compliance process for engineers, since it can get confusing.

There’s a tendency for infrastructure engineering to be invisible when nothing is going wrong. How do you articulate the value of your organization’s work?

I think the goal of infrastructure is sort of a yin and yang between being a force multiplier for engineering and upholding a high engineering quality bar. I’ve personally found it not that hard to demonstrate and measure some aspects of infrastructure, like reliability. Talk to the sales team and figure out what commitments would make it easier to sell our software. How much would it enhance the sales process if we went from 99% to 99.9% uptime? Security work like ensuring we can remediate vulnerabilities in a certain number of days is also covered in contracts written for enterprise customers, and many auditors even require pen-test reports, so it gets covered in compliance requirements.

What’s harder to measure is the idea of force multiplication of engineers. How do you quantify that? In some ways it’s like the product problem of measuring “customer delight.” We can certainly show that we’ve reduced deployment from twelve steps down to three steps, but no one outside of engineering will necessarily care. NPS scores seem silly when internal developers have vendor-lock-in to your internal tools, and there’s no alternative to compare against.

We also ran a survey across engineering to find the biggest problems slowing engineers down, and used that to prioritize that work. Even if we didn’t have a clear productivity metric, at least we could directly connect our work to a valuable problem..

It was funny, at Stripe one of the goals we set for developer productivity was “a given theme doesn’t stay in the top three concerns surfaced by developer survey for more than six months” which was a somewhat awkward attempt at acknowledging that dynamic.

Yeah, exactly. When we surveyed engineers, the number one problem was always documentation. It was the clear number one for the three or four years that I was looking at those surveys. How do you solve the problem of documentation at scale?

It’s not just identifying a great tool, we already had three tools for documentation that folks weren’t using. You need to find a way to embed the culture, just like creating a culture of unit testing. Ultimately, it felt like a situation where you needed to pick and choose your battles, and this wasn’t one we picked, because it felt like boiling the ocean.

When I talk to infrastructure leaders, there’s often a strong orientation around structure and process, e.g. how do we pick the twenty valuable projects to prioritize this year, and how will we do it again next year? Conversely, I’ve sometimes wondered if there’s often one specific project that would be more valuable than all the process and all the somewhat-valuable projects that get done. Do you have any examples of exceptionally high impact infrastructure projects?

Yeah, that’s a good question. One issue is that it’s harder and harder to get those projects as a company matures. At some point there are few low hanging fruit left. Each large project always had complexities to untangle that the effort calcuation went up.

I didn’t work on this myself, but I’ve heard from anecdotes that Dropbox moving to their own in-house database system was transformational. It shifted from a paradigm of infrastructure engineers running every database migration and being blocked on changes, to enabling product engineers to run their own migrations. This was a step change improvement in developer productivity.

Another interesting project that I had no part in, but heard a lot about - was adopting pre-commit testing before merging into the main branch. Before that, changes got merged in before tests were run, and the build would break all the time. Pre-commit testing on its own had limited ROI since changes in one repository could break tests in another repository and there were enough changes like that that the build would break very often. Eventually we got down to three major repositories and the merge queues started working well. It’s interesting that over-time, running all tests pre-merge became a fool’s errand - does it really make sense to test every desktop client change with the 10+ operating systems that Dropbox supports? - and we had to work on smartly reducing that set and instead setup automatic reverts of commits that broke the build. The goalpost of a good developer experience kept changing as the size of the team grew, and that’s what made the work so interesting.

Relatedly, I’m a really big advocate of merging repositories, which is easy when a company is small but very hard once the company gets larger. I worked on but didn’t finish that project at Dropbox, someone else took it over, and a big part of what they did to make it work was getting Git to be fast for large repositories. Merging repositories is one of the largest impact projects I’ve seen.

Git performance is a funny topic for sure, and is a good example of how slow technology reputation problems are to resolve. Sort of like people who insist on rotating passwords every six months for compliance, there are people out there who insist Cassandra is terrible because the early versions of Cassandra were pretty rough.

Yeah, MongoDB had a very similar problem. It has changed completely over time but the image sticks. It’s still not perfect, but it’s probably fine for your startup if you’re stuck with it. I wouldn’t choose it as my first option, but that’s mainly due to the query language and lack of reasonable joins, not other factors.

Yeah, that’s funny. At Stripe, we used MongoDB for essentially everything, and MongoDB is a very capable system that is very tunable to your specific tradeoffs. However, it often felt like half the incoming engineers immediately wanted to replace MongoDB based on its decade-old reputation.

Moving on to the last question, what are some resources (books, blogs, people, etc) you’ve learned from? Who are your role models in the field?

In the first or second week of my job at Dropbox, my boss, who ended up being the VP of infrastructure, gave me the book The Effective Engineer by Edmond Lau. It’s an incredible book on how to improve your personal effectiveness as an engineer, and how you should think about creating leverage and iteration speed. It helped me realize how to be a better individual contributor and also gave me a way to think about helping other engineers be more effective.

I think infrastructure is easy to think about in terms of standard business metrics like reliability, security, and so on. But your challenge is really not just improving those metrics, but improving those metrics in a way that doesn’t reduce the productivity of all your coworkers. That means that developing a sense of empathy and a sense of how other engineers can be effective is important to your success.

Some other books I’ve found helpful are Designing Data-Intensive Applications and A Philosophy of Software Design.

That said, the best advice I’ve gotten was from a senior engineer at Dropbox who I used to work with. He worked on Vitess and other systems that operated at very large scale, and I expected his advice to involve a lot of clever tricks to improve scalability. But his biggest advice was that complexity was the real killer at scale, and that complexity begets complexity over time. Just focusing on simplicity to keep systems maintainable and scalable over time is what you need, and that’s probably what’s helped me the most in my career. Remove stuff whenever possible, keep things consistent, and prevent spending new innovation tokens unless you really need to.

Developer Productivity Survey

Tue, 15 Feb 2022 07:00:00 -0700

Example survey, Example analysis

While you should rely on your organizational metrics to measure developer productivity, quantitative measurement will sometimes miss important context. For example, you might be proud of how the backend developers are having a great time with their CI/CD, only to realize that the iOS engineers hate their release process that isn’t instrumented in any of your dashboards. A Developer Productivity Survey is an effective tool to bring qualitative feedback into your planning process and reduce your risk of meeting your metrics while missing your goal.

Most organizations run these surveys twice a year, with a focus on answering two questions:

Where should you prioritize your efforts next?
What unexpected problems or opportunties were identified?

For modest sized companies (e.g. less than thousands of engineers), surveys are often not the most effective way to measure your organization’s productivity, but a phenomenal way to better understand the developers you support.

Example analysis of Developer Productivity Survey

There are a wide range of topics you can choose to cover, and you’re going to want to balance focusing on areas you know are important and creating space to learn things you don’t already know. Some of the topics that these surveys often explore are: codebase quality, migration impact, code review experience, testing experience, CI/CD experience, and documentation.

You’ll have to figure out the right questions for your current organization, but as you design the survey remember the golden rule of survey writing: only collect information whose value you can show to the folks sharing their time to answer it.

Other Perspectives On Developer Productivity Surveys

Inspiration for Questions to Ask

Before You Start

Developer Productivity Surveys are a useful tool for determining what to focus on next. However, keep in mind that each time you send a survey, you’re entering into an implicit contract with the folks that you send it to: you will do something to address their concerns.

No one expects you to address all their concerns, but you should not send a survey if your schedule is already committed with other, unmovable work. If you do, you’ll find that you get fewer and fewer valuable responses over time, and that your survey becomes less valuble over time.

How to Use

Explore broadly for topics to consider focusing on for this survey. Remember that you’re trying to identify the areas to improve and check if your previous work has improved areas as intended. You are not trying to measure productivity, you have metrics for that. For inspiration look at others’ surveys (ex: one, two, three), or recent findings on developer productivity like State of the Octoverse or Accelerate State of DevOps Reports
Review your previous surveys, if any, and identify any questions where it would be valuable to have an ongoing dataset. You will need to make tradeoffs between the value of ongoing datasets (created by asking the same question in multiple surveys over time) and the desire to tweak a question
Determine the segments you’ll want to capture to understand the data. For example, web, mobile and backend engineers almost always have different toolchains and release processes. If you treat them as a uniform group, your analysis is likely to lead you down a confusing path. Some segments to consider are: primary toolchain, time at company. time in industry before joining this company, current role (engineer, engineering manager, etc), and current level (engineer, sr engineer, etc)
Combine the questions you’ve identified into your survey. Use the tool of your choice, Google Forms and Airtable are both good options. Review this example survey for suggestions. (You’ll need to create your own as Google Forms are not easily cloned.)
Review the proposed form with three or four engineers on your team. What is confusing? What questions were they unable to answer? Revise your questions based on their feedback
Announce your survey across communication mediums (e.g. Slack, email and your Engineering All Hands), and over the duration of your survey (e.g. when it launches, a week later, and a few days before it closes)
Once the survey closes, analyze the responses. Take inspiration from this example analysis. Don’t spend a single second worrying about whether you agree with the feedback, instead focus on identifying what’s new. Prefer looking at segmented data, particulary segmented by primary toolchain, rather than the entire corpus. Segmented data gives a much clearer view into the reality underlying the data
Within your team, select 2-3 areas that you’re going to prioritize improving before the next survey runs
Communicate the selected investment areas, how you’ll measure improvements, and your target improvement and share this out to the folks who you asked to take the survey. Showing impact is the best way to incentize future participation
Now you’re done! Sort of, anyway. You need to actually make those improvements you’ve committed to, and you’ll be running another survey soon!

Tips

Many organizations set goals related to Developer Productiviy Survey results. This is a great practice when you connect it to the specific areas you’re going to invest in your current quarter or half, e.g. “0% Java developers no longer rank build times as their biggest blocker (down from 40% in Q1)”. On the other hand, it’s generally an antipattern to maintain goals on sentiment, e.g. “100% of developers feel highly productive”. The later goal is too broad to usefully measure the quality of your current work: even if you hit your target, it’s likely due to confounding factors
Metrics are usually more effective than surveys once you’ve fully instrumented your various processes and pipelines. However, there are many cases where your process and tooling isn’t going to be instrumented anytime soon, in which case surveys are certainly better measures than having no measures at all
It’s easy to get carried away asking too many questions in a given survey. As a rule, you should start small and slowly expand your survey over time until you see it impacting response rates. If response rates dip, pull back on size. There will be pressure to use this survey to solve every problem, but don’t fall into that trap
Timing your survey is important. Try to run it at least a month before your planning process so that you have time to run the survey and analyze the results before you submitting plans
Some companies attempt to also include employee engagement in this survey, but I’d recommend using a different tool dedicated for employee engagement, probably Culture Amp, so that you can see how behavior varies across teams and organizations.

Tech Spec

Fri, 11 Feb 2022 07:00:00 -0700

Fork this template on Google Docs

Healthy engineering organizations make a lot of technical decisions. Many of those decisions impact multiple teams (Frontend, Backend) and functions (Engineering, Product, Customer Success, Finance). It’s normal to either feel like you’re moving too slow (“too many stakeholders in every decision”) or that your reckless pace creates frequent rework as issues are discovered late (“this problem would have been obvious if you’d just talked to Security first”).

Successful organizations make an explicit tradeoff between quick and comprehensive technology decisions, and the Tech Spec is a key tool for documenting and facilitating that tradeoff, along with reviewing your Tech Specs in some sort of Tech Spec Review.

Example tech spec on InfraEng.dev's hosting

As you think through using Tech Specs, remember that while all engineering organizations have a Tech Spec template, the template itself is specific to each organization. Earlier stage companies may find this template too heavy, whereas larger companies may find that it ignores many key topics. As you look at the proposed template, consider if some other alternative formats suggested in Other Tech Spec Formats might be a better starting point, and always remember that the best template is the one that matches your organization’s particular tradeoff between decision velocity and decision quality.

Related Meetings

Tech Spec Review

Other Tech Spec Formats

Other Approaches to Tech Specs

How to Use

Fork this template on Google Docs
Determine appropriate authors for the topic. If you’re not sure who might fit, ask around on your team for their suggestions
Draft a quick set of answers, time boxing to a few hours
Shop the draft around to stakeholders with the explicit goal of uncovering controversy and concern: what makes folks most uncomfortable when they read the draft?
Integrate the draft feedback and polish the draft into a completed spec
Run it through your Tech Spec Review
Based on the feedback in Tech Spec Review, make a decision on whether to adopt, refine, or throw out the proposal
Now you’re done!

Tips

At many companies, the Tech Spec format gets overloaded with too many concerns. A project once blew up cloud costs, and now every project has to detail their projections for future cloud costs. Being succesful depends on maintaining a thoughtful balance between navigating and ignoring bureaucracy, and what specifically makes sense for your company will vary a bit. Sometimes you’re better off ignoring some sections.

On the other hand, if you’re leading the process of designing a Tech Spec format, you should push back on folks who try to add too many concerns to the template. It’s far more important that folks are documenting things at all than that they document things comprehensively. Too many requirements means folks will try to avoid writing Tech Specs!
There are many different Tech Spec formats, play around with them to find one that works well for you, whether it’s Stitchfix’s, Amazon’s, Range’s, or something else entirely

Practices & Process Checklist

Fri, 28 Jan 2022 07:00:00 -0700

Fork this checklist on Google Docs

Spanning from your first on-call rotation to reviewing how information propagates from your executive team down to each engineer, there are an infinite number of practices and processes that you can implement in an organization. When you jump into a new organization or come up for air after your latest product launch, it’s helpful to have a checklist to think through how well your existing practices are working and which practices you might want to introduce next.

This checklist isn’t the set of required practices for your organization: every organization is unique. As you work through this checklist, think about your current challenges, consider how your existing processes could be improved to address them, and whether there’s a missing practice that might help.

Related Tools

Investments Checklist

How to Use

Fork this checklist on Google Docs
Budget an hour to work through the checklist, preferably with someone else
Check off the existing processes you already have.
Highlight in red the existing processes that aren’t working going well
Highlight in blue the processes you don’t have that seem particularly valuable
Write a proposal that suggests a couple processes to improve or introduce
Discuss the proposal with other organizational leaders in Engineering, Recruiting, and People organizations, identifying who should own the initiative, how you’ll evaluate success, and timing for the change
Now you’re done!

Tips

While it can be tempting, you will always regret introducing new process and practice without checking in with your manager first. Ask first!
It’s best to work through this checklist in partnership with other folks in your organization. Ideally you’d also have someone from your Recruiting and People (aka Human Resources) team involved as some of these practices are most easily adopted there. This might work well for a session at an offsite
Process has a cost. Not having process has a cost, too. As you consider introducing new practices, think about the cost of maintaining it, not just the initial cost

Decision Log

Thu, 27 Jan 2022 07:00:00 -0700

Fork this template on Google Docs

Something about the close-knit social chemistry of a small team gives them a shared brain. Of course you know that last week Michelle decided all new frontend work would happen in Typescript. Ambient awareness is less and less effective as an alignment tool as an organization grows, and becomes quite unreliable as an organization grows past ~twenty folks.

One tool that folks use to scale alignment around key decisions is the “decision log.”

Decision logs collect open and finalized decisions in a single, versioned document where anyone in the organization can quickly check if a decision has been made on an important area. They’re also a phenomenal place to collect links to the documents and forums where the decisions were made; when a decision is revisited in the future, the rationale behind how a decision was made is often much more important the decision itself.

How to Use

Fork this template in Google Docs
Think about important decisions that have been made over the past year and use them to preseed the closed decisions
Ask around the organization about the most important decisions that currently aren’t made. The sorts of implicit unasnwered decisions that are making decisions in your architecture review meetings difficult to align around
Agree with the team or leadership within an organization that folks will commit to tracking decisions in the decision log during a trial period of three months
Share with the team, and make the decision log easy to discover by linking it from related documents (RFC templates, Architecture Slack channels, etc)
Review adoption after three months and decide whether it’s a good practice for your organization to adopt or if it’s better to unwind and explore another solution

Tips

Decision logs are only helpful if folks know they exist and use them. Tie them into your processes: link to your #infra room in Slack, inform new hires about them in their onboarding document, link to them from the top of your RFC documents and include important RFC decisions in your log as well
Start out with fewer, more general decision logs (e.g. one for all of Infrastructure) rather than numerous, specialized decision logs (e.g. one for Compute, one for Data Platform, …). This makes it less likely that they silently go dead
If you stop maintaining the decision log in a specific area, that’s ok! Just make sure to explicitly mark is deprecated rather than leaving a stale resource

Organizational Design

Wed, 12 Jan 2022 07:00:00 -0700

Fork the org growth template and the org design template.

Having been involved in quite a few budget and headcount processes over the years, one thing that continues to surprise me is how often folks make major headcount requests without having done any organzational design of those those requested heads will compose into an organization.

The good news is that the high-level sort of organizational design required for headcount planning is abstract, low granularity, and it’ll likely only take you a couple hours to do a first pass. Add a few more hours to gather feedback, and you’ll have a reasonably good organizational design.

Related Tools

Hiring Ratios

How to Use

Fork the org growth template
Adjust B4 to reflect your organization’s current headcount
Adjust Growth/Quarter (Row 5) to reflect a reasonable quarterly growth number. This is going to be highly company specific. Either your headcount plan should provide some rough guidance or you can look at historical growth over the past year as a baseline assumption for next year’s growth. This only needs to be directionally accurate, and it’s better to be conservative than unrealistic
Tweak the values in Configuration to match with your organization’s beliefs, and to account for whatever roles your org does or does not have (technical program managers, product managers, etc). I’ve previously written up my rationale for a 1:8 ratio of managers to engineers, but the exact numbers here will depend on your organization. Again, these numbers just need to be directionally accurate, not perfect
You now know approximately how many teams and groups (e.g. teams of teams) you’ll have over the next year, and even the next several years if you extend the forecast
Next, fork the org design template
Started by linking your Org Growth projections into the Projections section. This will give readers context of your organization’s planned growth over the year without opening the sheet (you should absolutely link the sheet, but most readers will probably never open it)
Within Plan, start designing your organizational structure to match the size at the end of this year (or roughly twelve months out if you’re not doing this around January). The number of Directors will determine how many groups (e.g. teams of teams) you’ll need, and the number of Managers will determine the number of teams to distribute across those groups.

You should explicitly name each of those groups (e.g. Dev Productivity) and also connect each group to a subset of your organiztion’s goals or roadmap. You should further name the teams within each group to provide some flavor for how the group might be composed. It’s fine for team names can be a bit fuzzy as the relevant Directors will tweak the pieces a bit as they come into play, but the groups should be fairly firm
Extend Plan with a short proposal for how you’ll move from current state (e.g. 0 Directors) to future state (e.g. 3 Directors)
Tweak Peer Comparisons to highlight structures at comparably sized organizations with similar scope
Write your Summary section focusing on group structure roughly a year out
Read over your Org Design document. If following the model has introduced any particularly awkward elements, then go ahead and rewrite them with something that you find more natural
Now you’re done!

Tips

Your goal is understanding how you’ll structure your organization, as opposed to the precise number of hires on each team. Try to exclusively use the default team and group size parameters to focus you on the structure rather than precision

Hiring Ratio

Tue, 11 Jan 2022 07:00:00 -0700

Fork this template on Google Docs

It’s impossible to avoid headcount planning when running a large team within an engineering organization. On the other hand, many folks find it’s impossible to be usefully involved in headcount planning when the folks running the process aren’t closely involved with your work: Infrastructure? Oh, that’s going great: no crashes or breaches lately, we don’t need to invest here!

Hiring Ratios are a useful tool for folks leading support and enablement teams that often get little attention during the headcount process. Instead of relying on the folks running headcount planning, often the head of engineering, understanding your work and roadmap in detail, instead agree on a ratio of infrastructure engineer to product engineers.

Depending on the particularly individual responsible for engineering headcount planning, you may find that writing an Organizational Design is a better fit instead. Both are useful, which will resonate best depends on them!

Related Tools

Organizational Design

How to Use

Fork this template on Google Docs
Start by filling in the Team section, in particular establishing the current ratio between your team and the internal organizations that they support. For example, Engineering is 100 folks and Infrastructure Engineering is 8 folks, so you have roughly a 1:12 ratio.
Fill out the Rationale section by finding measures of your team’s recurring workload and normalizing that work against the size of engineering. Your goal is to find measures of how Infrastructure supports the wider organization and provide reasonable evidence that the effort to provide that support scales linearly with the size of the wider engineering team.

Some options to consider are:
- # of adhoc support tickets your team handles, per engineer, per month. You can also connect this to ticket latency, showing ticket latency increasing as there have been relatively fewer infrastructure engineers to the wider engineering organization
- # of builds or deploys, per engineer, per month. You can connect this to both build/deploy success rate and build/deploy p50 or p95 time to successful completion
- # of product incidents, per engineer, per month. You can connect this to number of engineering hours spent mitigating, remediating and cleaning up after incidents, which typically scales faster than number of incidents itself
Continuing in the Rationale section, take some time to consider the big ticket projects your team has worked on or ought to be working on. If you’re having trouble finding more high-impact projects to work on (or if no one seems to want the nominally high-impact projects you are working on), then your current ratio might be a bit too high: propose tweaking the ratio up a bit. If you’ve been unable to take on any major projects over the past year, you’re probably understaffed: propose tweaking the ratio down a bit. If you’ve been able to effectively support the teams you’re working and also finish one to two major investments per year, then you’re probably at a reasonable ratio: propose maintaining your current ratio.
Assume that folks reading this document have absolutely no clue where your other materials are that would help them understand your workload, and add links to your goals, dashboards, roadmaps, and work queues. (This may not be true, but links to context are at worst harmless and often very helpful for folks diving into a topic they don’t spend much time thinking about.)
Update Peer Comparisons with the ratios used by simila teams within industry peers. These provide context for your proposed ratio, particularly for folks who are unfamiliar with the sort of work your team does
Finish by writing a concise Summary section that quickly communicates the proposed ratio and the gist of the rationale
Now you’re done!

Tips

Your goal is to reasonably represent your team’s headcount needs in a broader headcount planning process, not to represent it accurately. A single shared ratio is never going to be particularly precise
If you want to get a bit fancier, you can make the formula a bit more complex, pulling in features like # daily builds, # daily deploys, # peak users, # total users, but usually simplier works best. Very few headcount spreadsheets include those sorts of values, and they get hard for other folks to reason about their headcount once you introduce more features solely for yours. You can always, of course, have your own work sheet off to the side that explores these sorts of ratios

Recruiter Velocity Check

Mon, 10 Jan 2022 07:00:00 -0700

Fork this template on Google Sheets

At some point in your planning process, you’re going to get a headcount target. It’s tempting to immediately jump into allocating that headcount–we’re going to do so much this year–but it’s helpful to take an hour to model out recruiting capacity to understand whether your headcount target is realistic.

Once you’ve gone through the exercise, you’ll finish with a simple chart that shows your progress over the year towards that headcount target. If the ending headcount line crosses the target headcount, then you’re in a pretty solid place. Reality is a lot more complex than this model, but at least you’re generally in a plausible starting position.

If ending headcount remains far away from target headcount, then you know it’s time to sit down and work through the details with your recruiting partner and whoever is allocating headcount.

How to Use

To use this tool:

Fork this template on Google Sheets
Update the values in blue boxes in Column B to reflect your plan. Most important are those under Headcount which represent your headcount plan and Recruiters which represent the number of recruiters working on your roles
There are a handful of numbers that you may not know off hand, including Expected Attrition / Quarter and those under Recruiting Rate. The current values are sensible-enough defaults if you want to run a quick calculation, although you’ll certainly get better results with data from your organization
Now you’re done!

Tips

This model is absolutely simplified, and in practice there are many things other than hiring recruiters that you can do to increase hiring velocity. Don’t get trapped by the model
It rarely makes sense to forecast further out than four quarters. Too many things change over long time frames
The most common mistake folks make is ignoring attrition throughout the year. It’s very easy to look like you’re on target if you ignore attrition. Sometimes recruiters quit, too

Infrastructure Engineering Resources

Sat, 01 Jan 2022 07:00:00 -0700

Of the folks I chatted with, the most common way of learning about infrastructure engineering was working professionally with experienced peers. That is, indeed, among the most effective way to learn about infrastructure, but it’s not always an accessible option, and certainly not the only way.

This is a collection of resources that I, or folks I’ve chatted to, found valuable. The majority of these resources are organized into alphabetically-ordered categories, but I wanted to start by recognizing a handful of foundational resources that I’d recommend starting with first:

Thinking in Systems: A Primer: Donella Meadows
Accelerate: Forsgren, Humble, and Kim
Reading one of The Phoenix Project (Kim, Behr, Spafford) or The Unicorn Project: Kim (Gene Kim)

Once you’ve read those, move to a section of particular interest and dive in.

Architecture

A Philosophy of Software Design: John Ousterhout
Software Design X-Rays: Fix Technical Debt with Behavioral Code Analysis: Adam Tornhill

Career

The Manager’s Path: Camille Fournier – a great career resource for engineers, even if you’re not considering management
The Effective Engineer: Edmond Lau, Bret Taylor
Staff Engineer: Leadership beyond the management: Will Larson, Tanya Reilly
The Engineer/Manager Pendulum: Charity Majors

Design Docs, Tech Specs, RFCs, and so on

Developer Productivity

Accelerate’s definition of developer productivity
The SPACE of Developer Productivity
DORA Research Program – DevOps Research & Assessment reports, particularly the annual state of DevOps reports
Migrations: the sole scalable fix to tech debt
You can’t reason about big balls of mud
Managing technical quality in a codebase

Metrics & Measurement

Forecasting synthetic metrics

Papers

Papers We Love is a great community to find more!

Philosophy & Approach

Technical Decision Making by Cindy Sridharan
Effective Mental Models for Code and Systems by Cindy Sridharan
“I Wouldn’t Start from Here”. How to make a big technical change by Tanya Reilly
Computers can be understood by Nelson Elhage
Maintaining platform-product fit
Magnitudes of exploration

Planning

Reliability

separate out on-call? pagerduty manual jelli manual

Roles

Strategy

Technical writing

Docs for Developers: Bhatti, Corleissen, Lambourne, Nunez, Waterhouse

Tools

Service cookbooks

Uncategorized

These are valuable resources that don’t quite fit into one of the above categories.

Headcount Planning

Sun, 21 Feb 2021 07:00:00 -0700

TODO:

Find better vocabulary to distinguish between “leadership team” in your org (that you manage) and “leadership team” that you’re a member of or report to

I once walked into an annual headcount planning session to learn that the other engineering managers in the room had already decided together how they would reallocate the senior members from the infrastructure organization that I supported to the teams that they ran. This was, they assured me, optimal for their roadmaps.

While that was a particularly contentious meeting, headcount planning is hard. It’s an attempt to rationalize priorities across many different teams, each of which works on different sorts of problems. Even when everyone involved has a shared goal of supporting your business, it’s a difficult problem. It can be particularly difficult for infrastructure engineering organizations which often think about their outcomes in unquantified ways: how should you prioritize reducing the risk of a security breach against driving an additional $10 million of revenue?

Meshing infrastructure engineering priorities with a headcount planning process is difficult, but it’s a common challenge for folks working in and leading infrastructure organizations, and there’s a toolkit for navigating the headcount process.

Tools used in this section:

Goals, Plans, and then Headcount

Whenever possible, you should work on headcount after you’ve set organizational goals and translated those goals into a loose plan. Once your headcount plan is finalized, then you’ll need to refine those initial plans and goals with the headcount plan.

Sometimes this ideal sequencing isn’t possible, and that’s ok: there are times to do mediocre work because the alternative is doing abysmal work. However, it’s quite difficult to run an accurate headcount process absent goals and a plan, and you’ll be better off relaxing precision in headcount planning when your goals and planning are ambiguous.

Phases of Headcount Planning

The tidy looking “headcount” box in the planning process hides three distinct phases:

headcount plan is the company planning and budgeting process (“how many people will we hire this year to hit our company goals and which functions will we hire them in?”)
headcount allocation is assigning your team’s headcount envelope across roles to be hired (“how do I prioritize these twenty headcount across the infrastructure engineering organization?”)
recruiter allocation is the mapping of recruiters to both the headcount plan and headcount allocation (“which recruiters will hire these twenty engineering roles?”)

Your headcount planning process almost certainly won’t acknowledge the existence of all of these phases, but it’s important to recognize that they all occur, even if your process pretends otherwise. For example, your headcount planning process may operate as if it’s a top-down process without bottoms-up feedback, but you can be certain that many of your peers are privately providing bottom-ups feedback and advocating for adjustments.

Even if your process acknowledges these three steps, these processes are generally run by different organizations (e.g. executive team, finance team, recruiting team) , and you’ll often find them surprisingly disconnected. If you treat them as a unified process, it’s easy to fall into gaps between the sub-processes (“what do you mean that we’re allocated fifty engineers headcount but won’t have a single recruiter working on engineering hiring until Q3?!”).

Drafting Phase

The headcount process begins with the drafting phase, where your goal is to understand the baseline headcount proposal. This starts with your organizational leader (e.g. VP Engineering) or finance partner (e.g. someone in FP&A) giving you a headcount target to hire against.

Preparing Your Manager for Headcount Planning

A frequent reaction to receiving a headcount plan is to ask, “How was this even established without input from my team?” It can feel quite counterintuitive that your first step in the headcount planning process is to receive a headcount plan. This is, however, generally what happens. To avoid surprises, find a way to remain continuously aligned with your manager on headcount planning.

The two most effective approaches that I’ve encountered are:

Writing an Organizational Design to align with your organizational leadership on why, when and how the team should grow over time
Establishing a Hiring Ratio between your team, stakeholder teams, and other growth levers (paying users, etc)

If you’ve aligned with your manager on one of those approaches, then the initial headcount plan you receive should roughly fit with your existing plan. If you haven’t aligned with your manager, then the plan will be whatever your manager makes up, typically driven by inertia (“we grew this team by 30% last year, which was good, so let’s do it again”) or company priorities (“we’re trying to accelerate enterprise sales, so we’re focusing hiring to directly support those initiatives”).

Headcount Plan

This headcount plan will be along the lines of:

You ended 2021 with twenty engineers
The budget supports growing to thirty engineers by end of 2022
There are a set of assumptions about when those hires will happen, for example three hires per quarter in Q1-Q3 and two hires in Q4
The next planned checkpoints to revise the headcount plan is in early Q3

There may be more details, but they’ll be rough assumptions that you shouldn’t take too seriously. For example, there may be assumptions about how these roles are allocated across teams or roles within your team, but the actual details are going to be up to you to figure out.

Your goal at this point is to understand the headcount proposal. What is the specific proposal? What are the constraints? Where is there flexibility? It is a mistake to raise objections against the initial headcount plan before you understand the constraints that shaped it.

Headcount Allocation

Now that you have the headcount plan, your goal is to translate it into a headcount allocation. The headcount plan is a high-level organizational view of hiring, whereas the headcount allocation is a concrete allocation of that headcount into the specific teams and initiatives that make up your team’s plan and obligations.

You want to come out of headcount allocation with a single document that states:

Your budget for the next year
How many additional hires that budget supports
Which teams those hires will be assigned to
The relative priority for filling those roles
What folks should do if they disagree with the plan

You should spend a fair amount of time reviewing this plan in detail with your leadership team and recruiting partners. Your headcount allocation is only useful to the extent that your team is committed to following it.

The best way to determine your headcount allocation is to apply your Organizational Design or Hiring Ratio against the provisional headcount plan. If you said you’ll hire one developer productivity engineer for every ten product engineers and one compute engineer for every twenty engineers, then use those ratios to inform the allocation. If you’ve gotten to this point without either of those documents, take two hours to draft up a proposal and then another hour to discuss it with your leadership team.

Even with a great organizational design or hiring ratio, headcount allocation is not purely a rote process. An overly mechanical approach will run into a few common challenges:

Inertia-driven planning where you staff based on previous staffing decisions rather than priority
Overstaffing bad teams that are struggling to close new candidates, retain their existing team, or generally to deliver results. It’s easy to prioritize staffing up these teams, but the impact will be muted until you address the underlying issues
Starving great teams that are exceeding their goals and consequently don’t “need” more staffing
Starving new initiatives that don’t fit cleanly into your existing organization

If you run through a hiring allocation process with truly no contention, then either your team has a very high level of trust or you’re dodging all the hard questions that you’d benefit most from addressing.

Recruiter Allocation

Once you’ve completed the headcount allocation, the next step is understanding how the recruiting team is planning to support your hiring efforts. You’re particularly trying to understand if there’s a significant gap between the headcount plan and recruiting support for that plan.

TODO: the wording in list below is verbose and unclear

A typical approach is:

Identify where alignment happens with recruiting for infrastructure hiring. Start by figuring out what level of your organization is aligning with the recruiters you work with. You’re looking for the place where hiring prioritization happens (“should we hire for the compute team, the SRE team, or the team for product XYZ first?”), and it’s typically the head of engineering at smaller companies (less than ~200 engineers) and sub-leaders beyond that
Estimate recruiting hiring velocity using the Recruiter Velocity Check. Once you’ve found the level that is aligning with recruiting for your area, then you want to understand the quarterly hiring rates for trained recruiters in that area over the past year. For example, if a recruiting manager with two recruiters is aligned with the head of infrastructure engineering, then how many folks did each recruiter hire within infrastructure engineering in each of the last four quarters? A typical number is going to be four to six hires per quarter per ramped recruiter
Estimate recruiting ramp up time. Also ask a recruiting manager how long it’s taking new recruiters to ramp up. Many hiring plans assume recruiters are hiring at full capacity from day one, which is a poor assumption. A frequently cited number is three months of ramp time
Estimate additional hiring to replace attrition. The biggest myth in recruiter allocation is that you only need to hire from your current headcount to the new headcount target, say make ten hires to grow the team from twenty to thirty. In reality you need to make those ten hires and also backfill any attrition that occurs over the year. Attrition numbers vary highly across organizations, but 10% attrition is a reasonable assumption if you have trouble calculating your historical attrition. In this case, it doesn’t matter if this is regretted or non-regretted attrition
Check recruiter allocation against headcount plan. You now have the pieces of data you need to cross-check the recruiter allocation for your team against the team’s hiring plan. Will you have enough recruiting support to accomplish the hiring plan?

At this point you should understand whether there’s a significant gap between the headcount plan and the recruiter allocation plan. If you believe there is a significant gap, then spend time talking it through with the recruiting team along with whoever is responsible for that recruiting team’s hiring priorities. Your analysis will be helpful in their own efforts to adjust the hiring plan and optimize the hiring process.

Finalizing Phase

Once you’ve completed the drafting phase, the fundamental question to answer is whether things are good enough that you can accept them as is, or whether you need to advocate for changes in the headcount plan.

Some of the smartest folks I’ve worked with have poured a tremendous amount of energy on headcount planning without accomplishing much because they pursue a degree of correctness that headcount planning simply doesn’t support. You should accept the plan as good enough if any of these apply:

You can solve inaccuracies within your headcount envelope. Headcount planning is a contract between team leaders and the planning process to work within a specific financial plan. The headcount plan doesn’t care if you shift headcount between two teams in your organization as long as the cost impact is relatively neutral
Things are generally right, even if they’re specifically wrong. Some folks get caught up on headcount plans that have specific errors (“this should be twelve but says eleven”) even though they’re close enough. In practice, headcount plans change all the time and most small errors don’t matter in the long run
Hiring bandwidth is the real constraint. If you have more headcount than you can hire based on a reasonable recruiting model, then don’t spend a single additional second worrying about the headcount plan. If you’ve already gone deep into optimizing your hiring process then you may want to propose changes to recruiter allocation, but my experience is that you’re almost always better off digging into and debugging your hiring process than advocating for more recruiters

Sometimes, however, you’re going to run into a headcount plan or recruiter allocation that makes your plans difficult, in which case there are a few patterns for negotiating against the initial proposals.

Headcount Planning

If you believe that the headcount plan is so misaligned with your needs that it’s effectively unworkable, then you have a short window to advocate for changes.

Your first instinct may be to write a massive document explaining why your work is important, but hold up a moment. Instead, find someone who has been effective at getting their work staffed, and go talk to them. How did they advocate for their team? What materials did they provide to their manager to show hiring’s impact on their goals and plans? You may come to realize there is an undocumented shadow headcount process that you’ve never realized existed.

After chatting with folks who’ve been effective at getting their headcount asks approved, you’ll usually find one of these things to be true:

They actually aligned with their manager on an Organizational Design or Hiring Ratio before the headcount process started
Leadership views a subset of their goals are critical (e.g. product deliverable, SOC2 compliance, security, reliability)
A revenue driving team has called out this team as essential to their goals (e.g. we can’t close more enterprise b2b deals unless my organization is staffed to complete project X)
Leadership has a strong relationship with their leadership for whatever reason

In the short-term, your only real option is to do a better job explaining how your team’s plans will impact key or revenue-driving initiatives. Longer term, this is a heads up that you could be doing a better job of connecting your organization’s goals to things your organization values.

Recruiter Allocation

Once your headcount plan is finalized, regroup with the recruiting team you partner with and discuss any needed changes. Even if your model suggests you need more recruiters, at this point you’ve made your case and should focus on how you can partner most effectively to take good advantage of the support you will have.

Headcount Allocation

At this point, you have the final headcount plan and it’s time to refresh your hiring allocation to reflect any changes that have been made. This should be a lightweight rerunning of the process you used for the initial headcount allocation and then communicating the updated plan to your team.

What’s next?

At this point, you’re done with the headcount process, and you can move on to finalizing your goals and plans based on the final numbers.

Sometimes folks will get frustrated with where the headcount allocation ends up, which is natural given the number of priorities being balanced against each other. What I’ve learned over time is that these things get revised sooner than later: if someone is upset, then work with them on putting together the data for a better rationale when you start the next headcount process.