Efficiency: Managing Infrastructure Costs
In my early career roles, I worked at companies that never worried about their infrastructure costs at all. They were simply too low a cost and growing too slowly for the Finance team to pay much attention to it. This “ignore it until it’s too large to ignore” approach served me well.
Until it didn’t.
Working at Uber, I was caught me off guard when a new Director joined and overnight infrastructure costs were recategorized from insignificant to requiring urgent, detailed review every month. Adding the instrumentation and accountability for these costs retroactively was a difficult retrofit. Although I was surprised that time, I’ve come to appreciate that all successful companies go through the transition from ignoring to setting goals on infrastructure costs, and an early focus during my time at Stripe was ensuring we were ready ahead of that shift.
Your job as an infrastructure leader is diagnosing the right mode of operation for your company’s infrastructure costs today, understanding when you’re likely to switch modes, and ensuring you’ve done the prework to make the transition relatively painless.
We’ll explore this topic by digging into:
- three distinct operating modes for infrastructure costs: early-stage, growth, and late-stage
- concrete tools and tactics such as managing infrastructure costs with cloud-specific reductions, including costs in your Business Review Template, and using a Contract Negotiation Checklist
- whether you should spin up a dedicated team working in this area
When you finish reading this, you won’t have your entire efficiency plan worked out, but you will have the high-level pieces, know where you need to dig in, and have a clear approach to communciate to anyone who has been pushing you for a documented approach around infrastructure costs.
Related Interviews
Should you prioritize infrastructure costs?
Before diving into the mechanics of managing infrastructure costs, the first question to answer is whether it’s a valuable use of organizational time to make your current infrastructure spend more efficient. How you think about this will vary a bit depending on whether your company is early-stage, prioritizing growth, or focused on profitability in late-stage.
Early-Stage
Generally speaking, very early-stage companies shouldn’t spend much time thinking about infrastructure costs. You should instead be focused on finding product-market fit for your first product.
Here are two checks you can run to determine if it’s worth reducing your infrastructure costs:
- If you were to reduce your infrastructure costs to $0, and it still doesn’t increase your runway by at least two months, then it’s not worth focusing on
- If you’re spending less than $2,000/month per employee on infrastructure costs, then it’s probably not a significant priority because your headcount spend will be so much higher
If you’re not violating either of those checks, then keep on ignoring infrastructure spend. If you are exceeding one, and infrastructure costs are a significant part of your overall burn, then invest a sprint into reducing spend, and then resume ignoring it once these checks resume passing.
The one notable exception is if you’re building a low-margin product or product where cost efficiency is a pillar of your long-term strategy. For example, if you’re operating a metrics collection and dashboarding product like Datadog, then efficiency probably is worth considering earlier than usual.
Growth
When you’re prioritizing growth, the primary focus of the engineering organization in a technology company is creating, operating and advancing the products that support the business. Managing costs is important, but even immaculate cost management won’t make your company a success if enough energy isn’t being invested in your product.
The fundamental question to ask is whether infrastructure’s share of cost of goods sold (COGS) is increasing as a percentage of revenue? (The simplest way to think COGS is all your non-headcount costs, although a slightly better definition would be all costs to operate your software.)
Start answering this question by plotting revenue and infrastructure costs on a chart to get a sense of how these two numbers are moving. Although logarithmic scales often generate more confusion than they’re worth, in this case it’s usually the only way to see both lines closely enough to understand their slopes within a single chart. You particularly want to understand if either line has experienced an inflection over the past few quarters. If costs have started accelerating without corresponding acceleration of revenue, that’s worth digging into.
Once you’ve looked at the two lines independently to understand their movement, simplify your first chart into a chart showing infrastructure costs as a percentage of revenue. This chart hides some detail but is easier to parse for folks further away from the details. As long as the ratio is going down and your company is focused on growth, then this data should be sufficient to justify your current level of investment into efficiency: if growth is key, and infrastructure costs are not getting in the way, why should you slow down growth to reduce them?
Late-Stage
Even the best business lines stop growing at some point. Facebook is one of the most valuable businesses in the world, but even they at some point ran out of new users to attract to their platform. Once growth slows, a business naturally starts focusing more on costs, including infrastructure spend.
In those scenarios, the easiest approach is to work with the business to align on two numbers:
- Dollars spent on infrastructure overhead per engineer: this includes things like development environments, testing tools, and so on. Determine your starting point by bucketing vendors and non-production infrastructure costs into a chart and plotting them over time divided by headcount. Pick a reasonable point on that line as your target. Refine it by reaching out to industry peers to get a sense of how this number compares to theirs (be sure to pick industry peers in companies that are currently focused on profitability, otherwise their answers won’t be very helpful to you)
- Infrastructure dollars spent per N product operations served: anchoring on cost of operating the product. This will vary a bit depending on your product or business, but it might be “$1.00 in infrastructure costs to powering every 10,000 searches”, “$2.50 for every 10,000 payments processed”, or “$3.00 for every 10,000 trips scheduled”
In both, the key thing is moving away from anchoring on a percentage of revenue and instead setting a target against the fundamental operations that you support. Thinking of costs as a percentage of revenue works well when you’re growing, but is too abstract and hides too many details once you’re focused on reducing costs.
If you find yourself exceeding those targets, then it’s time to dive into reducing them.
Tools for Managing Infrastructure Costs
What I’ll introduce here is the fairly common playbook for managing infrastructure costs. As you work through these approaches, your goal is to do as few of them as possible while meeting your efficiency goals. I’ve prefixed a few particularly high return-on-investment tools with a “⭐, if you’re debating where to start, consider starting with them.
- ⭐ Use cloud vendor’s cost optimization tools. Every cloud vendor has a program along the lines of AWS Savings Plans or AWS Reserved Instances. These plans allow you to trade usage or spend commitments for reduced pricing. If you aren’t already using these, you can usually reduce your infrastructure costs by 20-40% in a few weeks of work
- ⭐ Standardize your vendor negotiation process. Beyond a core cloud vendor, many companies have five or six additional large vendor contracts for things like observability, security, or developer productivity. Introducing a structured process for negotiating and renegotiating, like using a Contract Negotiation Checklist, will significantly improve your pricing (as well as visibility into costs)
- ⭐ Run periodic deep dives on cost. Until you have a dedicated team actively looking at your infrastructure costs, you can usually identify significant cost reductions by periodically taking a week to dig into your biggest infrastructure costs and prioritizing low-hanging fruit. These will usually be accidents, like storing unused data, development environments not getting retired, etc. The key thing is scoping the opportunity to work that the infrastructure team can take on themselves
- Update your Tech Spec template to include a section that estimates costs. Many engineers will be unfamiliar with that process, so make sure the template links to examples of how a few representative services estimated their costs. A great example will be onerously detailed, including links to the specific queries and tools to estimate their costs. A template that requires cost estimation without guiding folks through that process will inevitably trend towards make-work rather than a useful discussion
- Find the executive sponsor who really cares about infrastructure costs and is willing to push inefficient users to spend less. This is usually your CTO or your CFO. Without an executive sponsor willing to prioritize this efficiency work, you’ll find progress further down this list difficult. If you can’t find a sponsor, that’s usually a good sign that you’re already doing enough to prevent infrastructure costs from becoming a top priority
- Find product cost optimizations. There will be significant opportunities to reduce costs by changing how your product works, e.g. improving your data model, changing storage technologies, moving workloads between streaming and batch. However, product changes have a much wider set of stakeholders, which makes these sorts of improvements harder to prioritize. Generally, only try to pursue these if there is a massive opportunity
- Pursue a cloud vendor contract discount. Negotiating your cloud contract to include a discount is very doable after you reach a certain level of spend, or are a sufficiently strategic partner, but before you reach that level of spend it’s quite difficult to get a meaningful discount. Is it worth spending three weeks and making multi-year financial commitments to get a six percent discount on your cloud spend? Maybe! It depends on your priorities and your confidence in future spending estimates, but it certainly isn’t worth it to everyone. Conversely, at a certain spend level–think, tens of millions of USD per year–your discount can get much higher without requiring any product-level changes
- Set coarse goals on infrastructure costs. Partner with your company’s Finance team to coarsely attributing costs across teams, then set and monitor goals against those costs. Fine-grained goals and cost attribution requires a deeper investment into tooling, but most companies can split costs across their production environment, development environments, and data engineering. Once you done that split, you can set a goal and assign that goal to appropriate teams (respectively, something along the lines of infrastructure, developer productivity, and data engineering). This will provide some visibility and pressure on costs without requiring much attribution prework
- Review costs in Business Reviews. Once you’ve set those coarse goals around infrastructure spend, ensure your company’s Business Review Template includes a section on their costs. If you run Business Review Meetings, then make sure someone is showing up to ask questions about costs for teams whose costs are missing goal or otherwise accelerating
- Expand metadata to facilitate fine-grained goals on infrastructure costs. Implement an approach to Ownership Metadata such that you can assign all usage and storage costs against a specific team. Once you have that ownership metadata maintained, you can go further by generating proactive nudges to teams on following best practices, prioritizing high costs, and helping them identify accelerating spend early
If doing all of these sounds overwhelming, it should! Few companies do all of these, and those that do either operate in a business that is unusually margin sensitive or are spending many millions a year on their infrastructure costs.
Should You Have a Dedicated Efficiency Team?
Generally, the way I think through spinning out any given area into a dedicated team is described in Trunk and Branches Model, and that applies for the efficiency as well. That said, let me add a few caveats to that general approach as it applies here.
Much like managing technical quality, efficiency is an area where you can make significant progress with one-off initiatives. Improving how you use AWS Reserved Instances or renegotiating your vendor contracts can reduce your spent by 30-40% in a week or two. Product-level improvements to your architecture can reduce your spend even more, although they’ll probably take a bit longer.
Because you can make significant progress through one-off initiatives, the default is to wait until late into a company’s growth to spin out a dedicated team, and in most cases that’s the right decision.
The three factors to consider as you think through whether postponing a dedicated team is the best solution for you are:
- Is infrastructure efficiency a fundamental strategic pillar for your business?
- Are your infrastructure costs, today as an absolute cost, 10x more expensive than a team working to reduce them?
- For the past year have you had pressure to reduce costs but an inability to prioritze the work because other critical work continues to displace efficiency efforts?
If you answer yes to any of those, then you may want to spin out a team earlier than the Trunk and Branches Model suggests. As you start sourcing candidates, it’ll become apparent that this is a bit of a custom role with folks who specifically enjoy working on the problem. Recruiting one or two folks with siginficant preexisting experience will save you years!