Infrastructure Engineering
  • Interviews
  • Chapters
  • About
  • RSS
April 3, 2022

Efficiency: Managing Infrastructure Costs

In my early career roles, I worked at companies that never worried about their infrastructure costs at all. They were simply too low a cost and growing too slowly for the Finance team to pay much attention to it. This “ignore it until it’s too large to ignore” approach served me well.

Until it didn’t.

Working at Uber, I was caught me off guard when a new Director joined and overnight infrastructure costs were recategorized from insignificant to requiring urgent, detailed review every month. Adding the instrumentation and accountability for these costs retroactively was a difficult retrofit. Although I was surprised that time, I’ve come to appreciate that all successful companies go through the transition from ignoring to setting goals on infrastructure costs, and an early focus during my time at Stripe was ensuring we were ready ahead of that shift.

April 3, 2022

Security

Intro:


Topics:

  • Measuring security risk / setting appropriate goals
  • Compliance & Audits: not the goal, but a component of identifying and reaching your goal
  • Threat modeling
  • Forecasting security risk
  • What are the most important projects to start with?
  • When should you hire a CISO or CSO, and where should they report?

Resources:

  • How we secure Monzo’s banking platform
  • Killing “Chicken Little”: Measure and eliminate risk through forecasting
  • Lessons learned in risk measurement

to be read

April 3, 2022

Productivity / Experience

sources:

  • https://getdx.com/podcast/developer-experience-github
April 3, 2022

Reliability

Intro:

In some ways, the topic I’m least excited to write about. For some reason it’s become extremely opinionated, derisive area of discussion, but not to its benefit: there’s a lot to learn here, and much of the disagreement is a distraction from making the clear improvements.


Topics:

  • On-call rotation
  • Observability
  • How reliability process drives learning
    • https://lethain.com/modeling-reliability/
  • Incident Response – coordinate response
    • communication, coordination, mitigation
    • when should you have multiple roles?
    • after you’ve mitigated, move to remediation
  • Incident Reports – drive learning, support aggregation
  • Incident Report Clustering: understand where the recurring areas to invest are
  • Remediations
    • I find that many incident processes generate bad directions despite being run by well-meaning, smart people
      • https://lethain.com/how-to-safely-think-in-systems/
    • motion isn’t the same as progress
    • the incident response process isn’t the outcome, the outcome is the outcome
    • if you’re spending more energy feeding the process than fixing systemic weaknesses, your process is eating you
    • what are technical and system changes to significantly prevent incidents or reduce their impact
    • process changes are easy to think about and leaned on too heavily. only make process changes when they genuinely make sense and when they’re genuinely better than the alternative. So many incident processes iterate meaninglessly on small process improvements instead of serving to focus eneregy on highest impact outcomes
  • Some things to avoid

Resources:

© Will Larson 2025 FAQ RSS About