<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Infrastructure Engineering</title><link>https://infraeng.dev/</link><description>Recent content on Infrastructure Engineering</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><copyright>Will Larson</copyright><lastBuildDate>Mon, 16 Jan 2023 07:00:00 -0700</lastBuildDate><atom:link href="https://infraeng.dev/feeds.xml" rel="self" type="application/rss+xml"/><item><title>Categories</title><link>https://infraeng.dev/categories/</link><pubDate>Mon, 16 Jan 2023 07:00:00 -0700</pubDate><guid>https://infraeng.dev/categories/</guid><description/></item><item><title>Book</title><link>https://infraeng.dev/categories/book/</link><pubDate>Mon, 16 Jan 2023 07:00:00 -0700</pubDate><guid>https://infraeng.dev/categories/book/</guid><description/></item><item><title/><link>https://infraeng.dev/posts/</link><pubDate>Mon, 16 Jan 2023 07:00:00 -0700</pubDate><guid>https://infraeng.dev/posts/</guid><description>&lt;p&gt;&lt;em&gt;Suggestions? Take a look at &amp;lsquo;Want to help?&amp;rsquo; section on &lt;a href="https://infraeng.dev/about"&gt;About&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;This is the in-progress version of &lt;em&gt;Infrastructure Engineering&lt;/em&gt;.&lt;/p&gt;</description></item><item><title/><link>https://infraeng.dev/</link><pubDate>Mon, 16 Jan 2023 07:00:00 -0700</pubDate><guid>https://infraeng.dev/</guid><description>&lt;p&gt;Hey folks! I&amp;rsquo;m &lt;a href="https://lethain.com/about/"&gt;Will Larson&lt;/a&gt;, sometimes known by &lt;a href="https://twitter.com/lethain"&gt;Lethain&lt;/a&gt;,
and this is the &lt;em&gt;&lt;a href="https://infraeng.dev/about/"&gt;Infrastructure Engineering&lt;/a&gt;&lt;/em&gt;.
Infrastructure software engineering impacts the professional lives of every software engineer deeply,
and subtly shapes the products and platforms our companies build,
but relatively little is written about running an effective infrastructure engineering organization.&lt;/p&gt;
&lt;p&gt;Hopefully these interviews and guides will do a bit to help with that!&lt;/p&gt;</description></item><item><title>Tech Spec Review</title><link>https://infraeng.dev/tech-spec-review/</link><pubDate>Mon, 16 Jan 2023 07:00:00 -0700</pubDate><guid>https://infraeng.dev/tech-spec-review/</guid><description>&lt;p&gt;As the organization starts to write more
&lt;a href="https://infraeng.dev/tech-spec/"&gt;Technical Specifications&lt;/a&gt;, you&amp;rsquo;ll eventually want a forum to discuss the key decisions.
At most companies, that meeting is the &lt;em&gt;Tech Spec Review&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;em&gt;Tech Spec Review&lt;/em&gt; is a forum to review feedback on new &lt;em&gt;Tech Specs&lt;/em&gt;,
resolve open points of discussion, and flag new context to be considered
before finalizing the design. Secondarily, it&amp;rsquo;s a valuable forum for
keeping the wider organization aware of new and upcoming technology changes.&lt;/p&gt;
&lt;div class="callout ba b--light-gray br4 bg-lightest-blue ph4 pv2"&gt;
&lt;p&gt;&lt;strong&gt;Related tools&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://infraeng.dev/tech-spec/"&gt;Tech Spec&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Related meetings&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://infraeng.dev/incident-review/"&gt;Incident Review&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Other approaches to Tech Spec Reviews&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://lethain.com/scaling-consistency/"&gt;Scaling technical consistency&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://multithreaded.stitchfix.com/blog/2020/12/07/remote-decision-making/"&gt;Technical Decision-Making and Alignment in a Remote Culture&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;h2 id="goals"&gt;Goals&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Drive consistent technical decision making.&lt;/strong&gt;
Much of the value from your technology strategy comes from its consistent application,
and this meeting should support consistency.
The review is a particularly valuable source of problems to inform your technology strategy&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Role model good technical decision-making and discussion.&lt;/strong&gt;
Your organization will learn what good technical decision-making looks like from this meeting.
Proactively coach folks giving feedback in both good (&amp;ldquo;keep doing that!&amp;rdquo;) and ineffective (&amp;ldquo;in the last meeting, &amp;hellip;&amp;rdquo;) feedback&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Prevent teams from pursuing local maxima in ways that are misaligned with the company.&lt;/strong&gt;
For example, a given project might benefit from introducing a new database, but the cost to the company
to support business continuity, privacy auditing, and so on might outweight the project&amp;rsquo;s benefits&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Avoid the Tech Spec Review anti-patterns.&lt;/strong&gt;
Don&amp;rsquo;t be a domineering review, bottlenecked review, status-oriented review, or an inert review.
As a key forum for resolving technical disagreement, there are &lt;a href="https://lethain.com/scaling-consistency/"&gt;many ways for Tech Specs Reviews to fail&lt;/a&gt;.
Avoiding these anti-patterns requires ongoing, proactive attention from the &lt;em&gt;Tech Spec Review&lt;/em&gt;&amp;rsquo;s sponsoring leader&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="agenda-scheduling-and-scaling"&gt;Agenda, Scheduling, and Scaling&lt;/h2&gt;
&lt;p&gt;The default approach here is to run them on a weekly cadence,
sending out &lt;em&gt;Tech Specs&lt;/em&gt; for discussion two days ahead of the meeting,
requiring all attendees to read the specs before the meeting,
and canceling meetings ahead of time when there are no specs to review.&lt;/p&gt;
&lt;p&gt;That said, most organizations end up with a fairly custom approach to this meeting.
When your organization is small, you can likely do on-demand reviews for each Tech Spec.
This allows the team to get comfortable reviewing and being reviewed without the risk of
&amp;ldquo;running over time&amp;rdquo; and preventing another spec from getting discussed.&lt;/p&gt;
&lt;p&gt;As your organization grows, it will typically become hard to schedule all stakeholders into
a on-demand meetings, and you&amp;rsquo;ll typically move into a standing meeting. Each standing meeting
should discuss one to three reviews, depending on the size of open decisions. You can experiment a bit
with format here: you might be able to review five specs in five minutes if it&amp;rsquo;s just a matter of approving
unless there are any additional concerns to flag.&lt;/p&gt;
&lt;p&gt;There are many ways to scale this meeting.
Some organizations rely on asynchronous review for most specifications, and only bring &amp;ldquo;controversial&amp;rdquo; specs
to the synchronous review.
Some organizations hold multiple &lt;em&gt;Tech Spec Reviews&lt;/em&gt;, sharded by area: one for Product Engineering, one for Infrastructure Engineering,
and so on.
Ultimately, I recommend actively experimenting with your approach based on the specific issues you&amp;rsquo;re running into with the meeting.
There are general solutions, but each company uses this meeting in a somewhat different way, so adopting the standard solution may
not work well for your needs.&lt;/p&gt;
&lt;h2 id="roles--attendance"&gt;Roles &amp;amp; Attendance&lt;/h2&gt;
&lt;p&gt;There are four key roles in the Tech Spec Review:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Facilitator who coordinates the agenda and the conversation.
This is generally either a Staff Engineer, a Technical Program Manager, or a partnership of the two&lt;/li&gt;
&lt;li&gt;Presenter who has written the &lt;a href="https://infraeng.dev/tech-spec/"&gt;Tech Spec&lt;/a&gt; being discussed&lt;/li&gt;
&lt;li&gt;Notetaker who ensures notes from the discussion are captured&lt;/li&gt;
&lt;li&gt;Attendees who share context, ask questions, and participate in the discussion.
Some companies restrict attendance because too many folks attend and want to &amp;ldquo;demonstrate value&amp;rdquo; by
asking questions, or unconstructively inject their personal preferances rather than prioritizing the organization&amp;rsquo;s perspective.
Generally, I think it&amp;rsquo;s better to allow open attendance and give direct, firm feedback to those who attend unconstructively.
If folks feel like they &lt;em&gt;must&lt;/em&gt; attend to avoid bad decisions impacting their team, then you should probably consider creating
more visibility into Tech Specs outside of this meeting, via either chat or email&lt;/li&gt;
&lt;li&gt;Sponsor who provides organizational weight to the meeting through their participation,
this is generally either the head of engineering, a Staff Engineer &lt;a href="https://staffeng.com/guides/staff-archetypes"&gt;serving as the head of engineering&amp;rsquo;s right hand&lt;/a&gt;,
or a manager reporting directly to the head of engineering&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="is-it-working"&gt;Is it working?&lt;/h2&gt;
&lt;p&gt;Some questions to ask when considering if your current Tech Spec Review
is working:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Do you have Tech Specs coming in for review?
If not, is it because the review isn&amp;rsquo;t useful?
Is the review too intimidating?
Are folks not sure how to submit new specs?&lt;/li&gt;
&lt;li&gt;Are too many reviews coming, such that feedback is slowing down execution?
Is there a set of category-wide decisions you could make that would reduce the need for certain kind of Tech Specs
(e.g. auto-approve specs that use the common storage and compute tiers)?&lt;/li&gt;
&lt;li&gt;Are reviews generally getting to the right decisions?
Are the right concerns being raised, but getting rejected because the presenters don&amp;rsquo;t engage with feedback?
Conversely, is it because the review lacks the necessary authority to succeed in your company?&lt;/li&gt;
&lt;li&gt;Are discussions generally on topic? Do some participants routinely derail discussion?
How could you prevent that pattern from reoccuring?&lt;/li&gt;
&lt;li&gt;Do attendees enjoy attending?&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Meetings</title><link>https://infraeng.dev/posts/meetings/</link><pubDate>Mon, 16 Jan 2023 07:00:00 -0700</pubDate><guid>https://infraeng.dev/posts/meetings/</guid><description/></item><item><title>Incident Review</title><link>https://infraeng.dev/incident-review/</link><pubDate>Thu, 05 Jan 2023 07:00:00 -0700</pubDate><guid>https://infraeng.dev/incident-review/</guid><description>&lt;p&gt;I&amp;rsquo;ve never heard of a company that has a business, that doesn&amp;rsquo;t also occasionally have things go wrong.
Something going wrong might turn into a support ticket, an angry email, or an alert popping up on an on-call
engineer&amp;rsquo;s phone.
If there is user or business impact, and an engineer might need to respond, then it becomes an incident.&lt;/p&gt;
&lt;p&gt;After the incident, the folks involved in mitigation write an &lt;em&gt;&lt;a href="incident-review-template"&gt;Incident Review Template&lt;/a&gt;&lt;/em&gt;,
and the that document is discussed in this meeting, the &lt;em&gt;Incident Review&lt;/em&gt;.&lt;/p&gt;
&lt;div class="callout ba b--light-gray br4 bg-lightest-blue ph4 pv2"&gt;
&lt;p&gt;&lt;strong&gt;Related Tools&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://infraeng.dev/incident-review-template/"&gt;Incident Review Template&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Other Approaches to Incident Review&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://response.pagerduty.com/after/post_mortem_process/"&gt;PagerDuty&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;h2 id="goals"&gt;Goals&lt;/h2&gt;
&lt;p&gt;Incident Reviews are a cultural carrier meeting for most engineering organizations.
They are a rare meeting where you will see a wide mix of teams and seniority-levels arguing about something that the business cares
about deeply: customer and employee impact.
A well-run Incident Review helps new employees quickly understand how your culture works when things really matter.&lt;/p&gt;
&lt;p&gt;An effective &lt;em&gt;Incident Review&lt;/em&gt; facilitates these goals:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Foster and socialize learning about what caused an incident&lt;/strong&gt;:
incidents have a certain inherent rhythm, and the only way to change it is to ensure others are aware.
The most valuable thing this meeting does is create awareness of what has &lt;em&gt;actually&lt;/em&gt; happened in a given incident,
which is the precursor to preventing a repeat&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Surface missing context across teams and functions&lt;/strong&gt;:
customer success might mention an impact to users,
an infrastructure engineering team might mention that the incident had a wider impact than initially recognized,
a product engineering team might explain the business cost of delayed message processing&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Inform investments on work that will best contribute to increased reliability&lt;/strong&gt;:
broaden an ongoing investment project to support a new edgecase,
cancel a previous mitigation effort based on improved understanding of the underlying issue,
recognize that similar issues are repeating without being successfully addressed&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="anti-goals"&gt;Anti-goals&lt;/h2&gt;
&lt;p&gt;Because of this cultural significance, Incident Reviews also have a predictable tendency to become ideological arenas,
and to attract participants with ideological goals about the &lt;em&gt;right way&lt;/em&gt; to foster reliability, run reviews, etc.
Your goal as the senior leader who owns this meeting is to prevent it from becoming an open ideological discussion forum,
and to instead focus it on the specific agenda at hand.&lt;/p&gt;
&lt;p&gt;Several patterns to be wary of:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Ensuring adherance to documented process&lt;/strong&gt;: some review meetings become focused on driving adherance to the specified
incident response or review process. That is valuable work, but ineffective to conduct in a large, learning-oriented forum.
Instead, drive adherance before the meeting&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pedantic or status-oriented&lt;/strong&gt;:
a surprising number of incident discussions end up orienting
around policing correct nomenclature rather than encouraging learning and growth.
Effective reviews are progress-oriented, with practioners who explain important context when additive,
but don&amp;rsquo;t orient around policing correctness&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Public performance of a one-person play&lt;/strong&gt;: effective learning meetings don&amp;rsquo;t spend much time reading materials or reports
out loud. The entire time should be devoted to discussion, perhaps with a short initial window for attendees to read the report.
Learning is a group activity, wbhereas readouts as a solitary performance&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Public performance of two-person play&lt;/strong&gt;: some meetings adopt a consistent chorus across sessions.
A certain set of questions, e.g. &amp;ldquo;How did you first become aware of this issue?&amp;rdquo;, will be asked and answered at
each session, consuming much of the time.
That feels useful, but it implicitly silences the wider group, who are not able to contribute their context and encourage group learning&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Finally, like any important, large meeting, there may sometimes be individuals who are more focused on their personal ideological goals rather than
the meeting&amp;rsquo;s goals, and it&amp;rsquo;s your responsibility to either anchor them on the meeting&amp;rsquo;s goals or get them out of the meeting
so work can be done.&lt;/p&gt;
&lt;h2 id="agenda-scheduling-and-scaling"&gt;Agenda, Scheduling, and Scaling&lt;/h2&gt;
&lt;p&gt;The agenda for every incident review is discussion of one to two individual incidents
or a cluster of related incidents. The agenda should be decided one to two days ahead of
the review, and shared out with attendees to allow them to prepare.
Because most learning occurs in discussion, I recommend against trying to include more than two
incidents (or one batch of related incidents) in a given session.&lt;/p&gt;
&lt;p&gt;Run these on a weekly cadence, canceling ahead of time when there are no
incidents to review.&lt;/p&gt;
&lt;p&gt;If you start to have backlog of incidents to review, then you have three options:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Batch related incidents&lt;/strong&gt; if you have a cluster of incidents with shared contributing causes.
For example, you might have a streak of incidents related to database instability caused by unindex queries,
which would benefit from one curated, joint discussion rather than treating each as an independent incident&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Extend review time for one week&lt;/strong&gt; to have more incident review bandwidth. This works best when you have
a short-term spike in incidents. Generally speaking, it is an organizatonal smell to permanently extend
incident review beyond an hour a week for a large audience, as it&amp;rsquo;s an expensive investment of time&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stop discussing lower severity incidents in the review.&lt;/strong&gt;
For example, only discuss incidents with &amp;ldquo;significant&amp;rdquo; customer or internal impact,
coupled with a simple definition of what incidents would fall beneath the line&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="roles--attendance"&gt;Roles &amp;amp; Attendance&lt;/h2&gt;
&lt;p&gt;There are five key roles in an &lt;em&gt;Incident Review&lt;/em&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Facilitator who coordinates the agenda and the conversation&lt;/li&gt;
&lt;li&gt;Presenter who filled in the &lt;em&gt;Incident Review Template&lt;/em&gt; for a given incident&lt;/li&gt;
&lt;li&gt;Notetaker who ensures notes from the discussion are captured&lt;/li&gt;
&lt;li&gt;Attendee who share context, ask questions, and learn from the discussion&lt;/li&gt;
&lt;li&gt;Sponsor who provides organizational weight to the meeting through their participation,
this is generally either the head of engineering or the head of infrastructure.
It is reasonable for the Sponsor to occasionally miss, but I believe it&amp;rsquo;s essential for
them to attend the majority of incident reviews&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The &lt;em&gt;Incident Reviews&lt;/em&gt; goals, particularly around learning and surfacing missing context,
encourage a wide audience of attendees. I recommend allowing anyone to participate so long as
they read&amp;ndash;and abide by&amp;ndash;the meeting&amp;rsquo;s goals and anti-goals. Ensuring folks act in accordance with
the meeting&amp;rsquo;s goals is a joint responsibility of the Facilitator and the Sponsor.&lt;/p&gt;
&lt;h2 id="is-it-working"&gt;Is it working?&lt;/h2&gt;
&lt;p&gt;Some questions to ask yourself if you&amp;rsquo;re unsure if your meeting is useful:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Are they getting scheduled? If that&amp;rsquo;s because you&amp;rsquo;re truly not having incidents, great!
Conversely, if it&amp;rsquo;s because folks are not filling in the template, then dig into why not.
Often these templates get overloaded with many questions to please many stakeholders,
and consequently become difficult to use&lt;/li&gt;
&lt;li&gt;Are key personnel attending? Particularly the sorts of folks who have important context to bring into the discussion.
If the meeting is working, these should be an exceptionally high-leverage opportunity to grow the organization&lt;/li&gt;
&lt;li&gt;Are the discussions resulting in a modified reliability strategy or roadmap?
If these discussions are driving learning, then they should alter the shape of your roadmap&lt;/li&gt;
&lt;li&gt;Do you enjoy attending?&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Tools</title><link>https://infraeng.dev/posts/tools/</link><pubDate>Thu, 05 Jan 2023 07:00:00 -0700</pubDate><guid>https://infraeng.dev/posts/tools/</guid><description/></item><item><title>Matthew Clarke</title><link>https://infraeng.dev/matthew-clarke/</link><pubDate>Wed, 11 May 2022 08:15:00 -0700</pubDate><guid>https://infraeng.dev/matthew-clarke/</guid><description>&lt;p&gt;&lt;em&gt;Interview in May, 2022. Learn more about Matthew on &lt;a href="https://matthewclarke.io/"&gt;his blog&lt;/a&gt;, &lt;a href="https://twitter.com/MatthewClarke47"&gt;twitter&lt;/a&gt;, and &lt;a href="https://www.linkedin.com/in/matthewclarke47/"&gt;linkedin&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tell us a little about your current role: where do you work, your title and generally the sort of work you and your team do.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I work at Spotify as a Senior Backend Infrastructure Engineer. My team builds and maintains the tools that enable Spotify engineers to deploy safely and quickly whenever they need to.&lt;/p&gt;
&lt;p&gt;We work a lot with Kubernetes, which Spotify uses to deploy and manage most of its websites and backend services. Spotify runs some of the largest multi-tenant Google Kubernetes Engine (GKE) workloads in the world, so this is a large responsibility.&lt;/p&gt;
&lt;p&gt;My team builds tools on top of Kubernetes to simplify and create a great developer experience. These tools involve developing and maintaining our deployment tools, aggregating error messages from different Kubernetes resources and displaying them through Backstage (our internal developer portal), supporting developers on Slack with questions they have or problems they’re running into with Kubernetes, and working on our Kubernetes plugin for open source &lt;a href="http://backstage.io"&gt;Backstage&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How did you start doing infrastructure engineering work? How have the companies you joined, your location, or your education impacted your path?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I actually started as a software engineer focused on e-commerce; while there are a lot of interesting problems to solve in this space, I found the lack of direct interaction with end-users frustrating. People don’t really care how “well” their online payment is accepted as long as it goes through, so you don&amp;rsquo;t get valuable feedback often.&lt;/p&gt;
&lt;p&gt;My role at the Financial Times was my first real taste of infrastructure engineering. It was a DevOps microservice role focused on identity and e-commerce. My team was responsible for provisioning cloud resources, writing applications, deploying them and monitoring them. There I learned a lot about AWS, Kubernetes, and Cassandra. We used lots of different languages so that we could experiment with what worked for us, including Python, Java, Scala, Node, Go and Elixir, but we mainly settled on Java and Go.&lt;/p&gt;
&lt;p&gt;However, throughout all of my roles, I found I gravitated towards building developer tools. Whether that was integrating two different build platforms at Cybersource/Visa, adopting Kubernetes at the Financial Times or changing to my current team at Spotify. One of the great things about infrastructure engineering is that you are sitting beside your users everyday, they’re your colleagues and you get to help make their life easier and get instant feedback about what they like and don’t like.&lt;/p&gt;
&lt;p&gt;I have always wanted to have a big impact at the companies that I have worked at, and there is no better way to have an impact than to help increase the productivity of all the other developers at the company. This is also why I love to contribute to open source. By contributing to open source, you can make an impact not just at your company but throughout the whole industry.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What dashboards and metrics do you &lt;em&gt;personally&lt;/em&gt; use to stay aware of your software and team’s work?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I use Backstage a lot to keep track of the current state of my team’s services. Backstage provides integrations with monitoring, deployment, CI and tech docs all in one place.&lt;/p&gt;
&lt;p&gt;Other than that, I keep track of the various deployment features we provide, such as test environments and automated canary analysis, to get a good idea of what features users find useful.&lt;/p&gt;
&lt;p&gt;Recently we have been making the effort to try to quantify and visualize deployment toil, so that we can see if we are moving things in the right direction with our platform offerings.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What would happen over next month if your infra org were all pulled away onto a secret project and couldn’t do their day to day efforts? Where would things slow down?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I think most things would continue along but probably not very efficiently, trending downwards.&lt;/p&gt;
&lt;p&gt;We help developers at Spotify every day by answering their infrastructure questions, helping them get their services set up or debugging production issues, so there would be a lot of unanswered slack conversations! We are also continually scaling our systems out behind the scenes to continue to support an ever-growing number of users and artists.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Infrastructure engineering organizations have a lot of priorities. A few years ago &lt;a href="https://lethain.com/infrastructure-between-cost-center-and-before-ego-trip/"&gt;I tried to define an overarching set of infrastructure priorities&lt;/a&gt; and came up with: security, reliability, usability, leverage, cost and latency. Of course, folks immediately started arguing I’d defined the scope too narrowly. How do you figure out what to prioritize working on?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This is a very interesting question, we get a lot of feature requests and feedback, but it is impossible to do everything. We try to focus our time on work that will have a wide impact, usually defining this on how much “toil” we can prevent. Toil for us is usesrs making infrastructure changes or tweaks that should be automated or happen behind the scenes without their interaction. An example of this would be our effort to &lt;a href="https://engineering.atspotify.com/2020/06/tech-migrations-the-spotify-way/"&gt;automate migrations&lt;/a&gt;, make it clear to users the goal, and provide the tools to perform a migration with as small overhead as possible.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Related to priorities, one topic that I’ve had come up a few times recently is the idea of “Shadow IT”, where other organizations bootstrap an infrastructure project without your knowledge, and then ask you to take over running it once it becomes a burden. How do you deal with other teams asking infrastructure to take over their projects once they’re no longer fun (or often when the original implementer leaves the company)?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Something my team has been struggling with recently is the sheer number of systems and tools we own. Some of these might have been transferred to us like you mention above, but we give the benefit of the doubt and assume it was the best decision the implementor could have made given the information they had at the time.&lt;/p&gt;
&lt;p&gt;Still you can’t support a limitless amount of systems and tools. Therefore the questions my team ask are:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Do we &lt;strong&gt;&lt;em&gt;really&lt;/em&gt;&lt;/strong&gt; need this tool? / What value is it bringing?&lt;/li&gt;
&lt;li&gt;Are we the best people to be supporting this tool? / Instead of supporting this tool could we be doing something more important?&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If we can’t justify the tool existing then it is a good candidate for deprecation. If it is valuable but we aren’t the best people to support it or could be working on something more important then perhaps we need to find a new owner, either another team internally or a managed version of the tool.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What&amp;rsquo;s the single most impactful project you’ve heard of an infra engineering org doing? Why? Was it obviously impactful beforehand?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I would say Backstage fits the mold for this. Before Backstage, different infrastructure teams at Spotify would create their own user interfaces, this work was not very efficient. Engineers had folders full of bookmarks and infrastructure engineers would toil away solving problems that other teams already had solutions for.&lt;/p&gt;
&lt;p&gt;When Backstage came along the benefit was clear: developers had one portal for all their infrastructure needs, they could search Backstage for docs, datasets, teams, services and runbooks. Infrastructure engineers could embed their interfaces in Backstage and benefit from the large library of utilities and React components the Backstage maintainers had created for common use.&lt;/p&gt;
&lt;p&gt;This lightened the load for all the engineers at the company and ultimately improved developer productivity, which is the ultimate goal of an infrastructure engineer.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Your current work focuses heavily on Kubernetes. This is a technology that has an outsized impact on the technology industry, and over the last six years has grown from something perceived as a toy into something widely used at scale. Where do you see the future of Kubernetes going?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Kubernetes is an open-source success story. It is great to see the industry rally around it as a project, including building incredible tools on top of it. Initially, it seemed like the only benefit was container orchestration. However, now we can see the additional benefits of extensibility, which has pushed Kubernetes beyond just containers.&lt;/p&gt;
&lt;p&gt;In the future, I’m excited to see where the community goes with handling multiple clusters and whether some patterns emerge there. I also think there will be an emerging trend of workload clusters vs infrastructure-as-code clusters; some Kubernetes clusters will be used to manage your infrastructure through tools like Crossplane, and others will be where your services run.&lt;/p&gt;
&lt;p&gt;I also hope we continue to see Kubernetes tools evolve to address the needs of service owners who have services running inside multi-tenant clusters and not just the administrators of the clusters.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Ok, excluding Kubernetes, but are there other technologies or tools that you see advancing the field in a similar way? What about technologies or tools, other than Kubernetes, that you believe will meaningfully advance the field over the next decade?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;While I am a contributor, I do think Backstage has the potential to change how developers interact with their infrastructure and allows them to better focus on their code. Backstage has grown from an internal tool at Spotify to an open source CNCF Incubating project with hundreds of adopters and contributors, dozens of tool integrations and several commercial ventures using it as the basis of their products. The ability for a developer to have a single view of the entire software ecosystem at their company, including monitoring, docs, CI/CD and runtime, has been incredibly valuable at Spotify, and I think other organizations are discovering this too.&lt;/p&gt;
&lt;p&gt;I am also very excited about &lt;a href="https://ebpf.io/"&gt;eBPF&lt;/a&gt;; quite a few different tools are emerging that could enable language-agnostic service-mesh-like features in a microservice environment built on top of it. I like the idea of a service mesh that doesn&amp;rsquo;t require a sidecar proxy, which has latency and cost overheads. However, I think it still has a pretty steep hill to climb to rival some of the proxy-based service meshes out there.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What are some resources (books, blogs, people, etc) you’ve learned from? Who are your role models in the field?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I learned a lot from Sarah Wells when we were at the Financial Times; we embarked on a Kubernetes migration fairly ahead of the curve; &lt;a href="https://www.youtube.com/watch?v=H06qrNmGqyE&amp;amp;ab_channel=CNCF%5BCloudNativeComputingFoundation%5D"&gt;Sarah gave a great talk on our migration&lt;/a&gt; (which is probably why it has been on the Kubernetes homepage for four years now!).&lt;/p&gt;
&lt;p&gt;I love to read; some of my recent highlights have been: &lt;em&gt;&lt;a href="https://www.amazon.com/Network-Programming-Go-Adam-Woodbeck/dp/1718500882/"&gt;Network Programming with Go&lt;/a&gt;&lt;/em&gt; by Adam Woodbeck, &lt;em&gt;&lt;a href="https://www.amazon.com/Effective-Python-Specific-Software-Development/dp/0134853989/"&gt;Effective Python&lt;/a&gt;&lt;/em&gt; by Brett Slatkin, &lt;em&gt;&lt;a href="https://www.amazon.com/Philosophy-Software-Design-2nd/dp/173210221X/"&gt;A Philosophy of Software Design&lt;/a&gt;&lt;/em&gt; by John Ousterhou and, of course, &lt;em&gt;&lt;a href="https://staffeng.com/book"&gt;Staff Engineer&lt;/a&gt;&lt;/em&gt; by Will Larson.&lt;/p&gt;
&lt;p&gt;I follow quite a few blogs, but the most valuable personally has been &lt;a href="http://lwkd.info/"&gt;Last Week in Kubernetes Development&lt;/a&gt;. It can be tough to follow the current development of the Kubernetes codebase as it is such a moving target; this blog summarizes the interesting: PRs, merges, deprecations and news which makes that task a bit easier.&lt;/p&gt;</description></item><item><title>Manager</title><link>https://infraeng.dev/categories/manager/</link><pubDate>Wed, 11 May 2022 08:15:00 -0700</pubDate><guid>https://infraeng.dev/categories/manager/</guid><description/></item><item><title>Interview</title><link>https://infraeng.dev/categories/interview/</link><pubDate>Wed, 11 May 2022 08:15:00 -0700</pubDate><guid>https://infraeng.dev/categories/interview/</guid><description/></item><item><title/><link>https://infraeng.dev/interviews/</link><pubDate>Wed, 11 May 2022 08:15:00 -0700</pubDate><guid>https://infraeng.dev/interviews/</guid><description>&lt;p&gt;&lt;em&gt;Suggestions? Take a look at &amp;lsquo;Want to help?&amp;rsquo; section on &lt;a href="https://infraeng.dev/about"&gt;About&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Folks who have shared their infrastructure engineering wisdom:&lt;/p&gt;</description></item><item><title>Mahdi Yusuf</title><link>https://infraeng.dev/mahdi-yusuf/</link><pubDate>Tue, 03 May 2022 14:00:00 -0700</pubDate><guid>https://infraeng.dev/mahdi-yusuf/</guid><description>&lt;p&gt;&lt;em&gt;Written interview in May, 2022. Learn more about Mahdi on &lt;a href="https://mahdiyusuf.com/"&gt;his website&lt;/a&gt;, &lt;a href="https://www.linkedin.com/in/myusuf3/"&gt;linkedin&lt;/a&gt;, and his &lt;a href="https://podcast.staffeng.com/1687069/8585293-mahdi-yusuf-1password"&gt;StaffEng podcast interview&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tell us a little about your current role: where do you work, your title and generally the sort of work you and your team do.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I am currently a Senior Staff Engineer at 1Password, leading the Server Architecture team. We are implicated in our systems&amp;rsquo; overall design while pushing for the modernization of legacy systems.&lt;/p&gt;
&lt;p&gt;The work encompasses everything from our overall system reliability and a few core components like queues, workers, and data stores. We also spend a decent chunk of time maintaining foundational libraries and service scaffolds that are used throughout the company.&lt;/p&gt;
&lt;p&gt;Generally, this includes most of the non-product engineering work.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What dashboards, metrics, and forums do you &lt;em&gt;personally&lt;/em&gt; use to stay aware of your organization? Is there a different answer that you would be more proud of? What’s preventing you from that answer being the current answer?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Currently, we are a Datadog shop for our dashboard and metrics. We now use collaborative Datadog notebooks when discussing/investigating new initiatives. We also use Kibana for logging and Bugsnag for error tracking.&lt;/p&gt;
&lt;p&gt;I would like to see something that could cut across all those three places to get a real sense of what is happening entirely across our system. Without having to jump from platform to platform. One tool to rule them all. The more data sources you can synthesize, the better your understanding of your system can be.&lt;/p&gt;
&lt;p&gt;I have been an avid user of Grafana, which delivers on the premise above. It integrates metrics, logs, and traces all in one clean interface.&lt;/p&gt;
&lt;p&gt;There were various considerations around sticking with Datadog. In addition to the cost of moving, there was the idea of who would keep this running. I am happy to see there is a managed Grafana being offered by Amazon now. So we may revisit this, when we have more time.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What would happen over next month if 1Password’s infrastructure org were all pulled away onto a secret project and couldn’t do their day to day efforts? Would the company still run?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Depending on when you ask that, it can vary. But, honestly, as much as I would like to say, things would grind to a halt. It&amp;rsquo;s a constant effort, but we are always trying to make sure no team is in a position to bring things to a complete halt.&lt;/p&gt;
&lt;p&gt;If the infrastructure organization were utterly gone, progress on tasks that have payoff farther in the future would lag behind the rest of the organization&amp;rsquo;s efforts, eventually impacting the broader organization.&lt;/p&gt;
&lt;p&gt;The way I like to look at this work is as necessary investments we need to make today for the future progress of the entire engineering organization. So it&amp;rsquo;s a constant trade-off with many factors that come into play.&lt;/p&gt;
&lt;p&gt;I have never seen this effectively work without dedicated teams focused on issues in the production systems. A new product feature usually trumps fixing something that isn&amp;rsquo;t a problem&amp;hellip;yet.
How long and at what speed the company would still run are probably more pertinent questions.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Infrastructure engineering organizations have a lot of priorities. A few years ago &lt;a href="https://lethain.com/infrastructure-between-cost-center-and-before-ego-trip/"&gt;I tried to define an overarching set of infrastructure priorities&lt;/a&gt; and came up with: security, reliability, usability, leverage, cost and latency. Of course, folks immediately started arguing I’d defined the scope too narrowly. How do you figure out what to work on?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This is something I have been thinking about lately. If I was pressed to really get into my gut and define these prioritizations, it would be tricky but let me try here. Frameworks are great general guidelines when you don&amp;rsquo;t have context. Still, most of these decisions depend on the organization&amp;rsquo;s willingness to make said priorities happen and stick with them to see them through to completion.&lt;/p&gt;
&lt;p&gt;That being given, I primarily focus on desired outcomes and slowly put problems behind us. Some of these classes of issues come back in various forms (see: scaling and migrations).&lt;/p&gt;
&lt;p&gt;Also, knowing you can&amp;rsquo;t solve them all quickly, let&amp;rsquo;s get to the actual job of prioritization.&lt;/p&gt;
&lt;p&gt;The first thing you need to identify is the severity of these problems. There are classes of problems that you can live with and others that, if left alone, will only get worse if they aren&amp;rsquo;t given the attention they need. The problems in the latter group aren&amp;rsquo;t usually a problem today, but being left alone can be limiting in some way in the future.&lt;/p&gt;
&lt;p&gt;Keeping the organization as agile as possible is essential in this regard. I might be conservative, but I always pay off the compounding debt first. Software systems change, but teams always build on top of what is there today.&lt;/p&gt;
&lt;p&gt;If a problem has more or less the same impact on the organization six months from now as it does today, it goes down my list of importance. However, suppose it gets worse as time goes on, the higher on my list of importance. This is when compounding is working against you.&lt;/p&gt;
&lt;p&gt;Now let&amp;rsquo;s talk about when compounding is working with you. If I fix something that makes each of my engineers lose an hour a week–just one hour. If I eliminate that, I just saved the company 200 hours a week and reduced toil in the process. These classes of problems aren&amp;rsquo;t the ones that usually get worse with time; these are typically focused on developer velocity and usability.&lt;/p&gt;
&lt;p&gt;So there you go, another framework.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Related to priorities, one topic that I’ve had come up a few times recently is the idea of “Shadow IT”, where other organizations bootstrap an infrastructure project without your knowledge, and then ask you to take over running it once it becomes a burden. How do you deal with other teams asking infrastructure to take over their projects once they’re no longer fun (or often when the original implementer leaves the company)?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;You can always say no.
Use this one sparingly often, you will eventually be working with that team in the future, and the road to fame and riches is long.&lt;/p&gt;
&lt;p&gt;Frequently these systems are necessary but not in active development. It usually isn&amp;rsquo;t that bad if you can have some time for hand-off and transition it slowly. Documentation here can be worth its weight in gold. Knowing where the bodies are buried is helpful when things eventually go wrong.&lt;/p&gt;
&lt;p&gt;There are always teams that get overburdened with these services with no owners. The burden is much like peanut butter: it&amp;rsquo;s better when spread around. There is always a team that is the best fit for said service.&lt;/p&gt;
&lt;p&gt;Like Spike Lee said, &amp;ldquo;Do the Right Thing.&amp;rdquo; If the team is overburdened, you can always assign more headcount to the team.&lt;/p&gt;
&lt;p&gt;I will say that leaving these services without clear ownership is a poison pill for your organization. People will shirk responsibility, and zero effort will be put towards these services, sometimes out of mere spite. It is better to assign the service to a team that won&amp;rsquo;t prioritize than to give it to no one.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;At times I have run into a belief that infrastructure necessarily conflicts with productivity: e.g. we have to reduce productivity to increase reliability. Have you seen a tension between infrastructure and product engineering productivity? Are there ways to reduce that tension?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Absolutely! Measuring it is something you should try to be doing. For example, can you measure how long it takes merge requests to get through review? How long are RFCs in the review state? How many regressions are we seeing after deploying a new piece of infrastructure?&lt;/p&gt;
&lt;p&gt;These things can worsen if infrastructure engineering is too prescriptive without understanding the underlying product work. Embedding infrastructure engineers into product teams can help here. But, again, it&amp;rsquo;s mostly balancing priorities/perspectives and communicating clearly.&lt;/p&gt;
&lt;p&gt;The benefits of embedding can be twofold and can help infrastructure engineers get a first-hand experience of what is slowing down product engineers. They can take that back to the team to improve things, and product engineers can get some visibility on how these processes improve reliability in production.&lt;/p&gt;
&lt;p&gt;I believe in supporting product engineers to deal with (read: empower to resolve) most of the issues their code causes in production and support them if they need help. But unfortunately, overzealous product engineers​​ create debt faster than they develop products.&lt;/p&gt;
&lt;p&gt;Most of this tension usually comes from not getting feedback in the correct stages if you cannot embed engineers into product teams. Writing design specifications can be fantastic and let&amp;rsquo;s most of the discussion occur before the rubber meets the road.&lt;/p&gt;
&lt;p&gt;Ideas are quickly redrawn, maybe even code, architecture, and infrastructure, not so effortlessly. Where would you want to give constructive feedback?&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;There’s a tendency for infrastructure engineering to be invisible when nothing is going wrong. How do you articulate the value of your organization’s work?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This is true, but data shall set you free. So it&amp;rsquo;s essential to capture why you are doing something and what you think it will improve. Then follow up with either data or people you impacted with that change.&lt;/p&gt;
&lt;p&gt;It&amp;rsquo;s all about outcomes. If you can&amp;rsquo;t track those with either data or teams you impacted positively through your efforts, you should probably rethink them. If you are doing this effectively it shouldn’t be too hard to articulate. You are often left to synthesize where engineering has under invested and figure out if anyone cares.&lt;/p&gt;
&lt;p&gt;It&amp;rsquo;s important to understand that as software systems grow and more people start working on them, they become more complex. Unfortunately, you can look at these like a thousand cuts over time, so they are easy to miss and overlook.&lt;/p&gt;
&lt;p&gt;Making sure you don&amp;rsquo;t succumb to these changes is essential. But unfortunately, I am sure most infrastructure engineers have been in the position where something they wanted to work on was minimized and deprioritized to have things quickly change when things go splat.
Understanding risks and tying those to straightforward trade-offs is vital to communicating with leadership.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What are some resources (books, blogs, people, etc) you’ve learned from? Who are your role models in the field?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I have found Twitter in general to be a great resource throughout my career. I have met tons of people and learned so much. I often recommend &lt;a href="https://leaddev.com"&gt;LeadDev&lt;/a&gt; to new leads because they have outstanding resources. I am also a big fan of &lt;a href="https://twitter.com/neal4d"&gt;Neal Ford&lt;/a&gt;’s works around software architecture. I am also working on something new here called &lt;a href="https://architecturenotes.co"&gt;architecturenotes.co&lt;/a&gt; where we breakdown system design with the people that built them. I think this audience would get a kick out of it.&lt;/p&gt;</description></item><item><title>Shawn Wang / swyx</title><link>https://infraeng.dev/swyx/</link><pubDate>Mon, 11 Apr 2022 08:00:00 -0700</pubDate><guid>https://infraeng.dev/swyx/</guid><description>&lt;p&gt;&lt;em&gt;Interview occurred in February, 2022. Read more from Shawn on his &lt;a href="https://www.swyx.io/"&gt;blog&lt;/a&gt;, &lt;a href="https://twitter.com/swyx"&gt;twitter&lt;/a&gt;, and his book, &lt;a href="https://learninpublic.org/"&gt;The Coding Career Handbook&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tell us a little about your current role: where do you work, your title and generally the sort of work you and your team do.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I’m currently Head of Developer Experience at Temporal.io, an open source workflow engine for long running, durable processes powering companies as small as 2-person YCombinator startups, to enterprises as large as Stripe, Snap, Datadog, Netflix, Doordash, etc. We are generally responsible for improving the experience of &amp;ldquo;front line&amp;rdquo; individual contributor developers, covering their end to end journey from first contact (DevRel) to learning (Docs) to API Design (SDKs) to ecosystem (Community).&lt;/p&gt;
&lt;p&gt;The basic insight is that companies ship their org charts (Conway’s law), but developers don’t care what team shipped which when they go through your product, so it makes sense to have someone whose job it is to coordinate and build out developer-facing efforts cohesively.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;In &lt;a href="https://www.swyx.io/developer-exception/"&gt;Developer Exception Engineering&lt;/a&gt;, you wrote a bit about the slipperiness of defining “developer experience,” and how it often varies significantly across companies. How would you explain developer experience to someone unfamiliar with the role?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Developer Experience (DX) is a buzzword at this point, so naturally everyone is co-opting it to represent their particular view of the world, making it extra confusing for anyone that just wants a straight answer. But I’ll give you my best shot here.&lt;/p&gt;
&lt;p&gt;At the highest level, the basic dichotomy to be aware of is “internal” vs “external” DX. Most people who come up within one of these two branches may be completely unaware of the other, which contributes to the confusion when people discuss “DX”.&lt;/p&gt;
&lt;p&gt;Internal DX teams focus on developer productivity within a company (sometimes called “dev infra”). The math is simple - if you have 50 engineers, and you think it’s possible to improve their productivity by &amp;gt;1% a quarter, then you would be silly not to invest in 1-2 engineers who don’t work on product, but just focus on making everyone else more productive. Scale this up to a 1000 engineer company and you now have a whole Internal DX org to play with. They can span a wide range of deliverables, from build/test automation to dev environment to code quality. The clearest mental model for identifying Internal DX opportunities I’ve come across is Netflix’s &lt;a href="https://soundcloud.com/front-end-happy-hour/episode-122-productivity-engineering-ballmer-peak"&gt;Productivity Engineering team&lt;/a&gt;, which is responsible for three major components - from new hire to productive local dev (their Bootcamp and &lt;a href="https://thenewstack.io/netflix-builds-pipeline-polyglot-programming/"&gt;bootstrapping tool, NEWT&lt;/a&gt;), from local dev to production (their &amp;ldquo;&lt;a href="https://www.infoq.com/news/2018/08/better-devex-at-netflix/"&gt;build-bake-deploy paved road&lt;/a&gt;&amp;rdquo;), and then from production back to dev (their observability tools like &lt;a href="https://netflixtechblog.com/lessons-from-building-observability-tools-at-netflix-7cfafed6ab17"&gt;Atlas&lt;/a&gt;). The other popular taxonomy to work with are &lt;a href="https://www.usehaystack.io/blog/the-accelerate-book-the-four-key-devops-metrics-why-they-matter"&gt;the four Accelerate metrics&lt;/a&gt;. Both of these approaches essentially divvy up the software development lifecycle into meaningful chunks, which can then be independently and tangibly improved by internal DX teams.&lt;/p&gt;
&lt;p&gt;External DX teams focus on improving developer adoption/mindshare/productivity at &lt;em&gt;other&lt;/em&gt; companies. Where almost any software company can have Internal DX, it only really makes sense to have External DX if you make something &lt;em&gt;for developers&lt;/em&gt;. This means it is a natural fit for devtools companies, but you might be surprised at what companies invest in this. Are Spotify, Notion and Slack devtools companies? No… but they all offer APIs for developers! So they all have DevRel teams. The distinction between DevRel (also known as Dev Advocacy) and DX is another common question. On one hand, traditional DevRel is very heavy on content creation (blogging and speaking, basically, but also demos and workshops), whereas DX has more of a mandate to write (non-core) code and docs to solve problems. I first transitioned from DevRel to DX at Netlify, where eventually it formally covered &lt;a href="https://www.netlify.com/blog/2021/01/06/developer-experience-at-netlify/"&gt;Advocacy, Integrations, and Documentation&lt;/a&gt;. The exact coverage will naturally differ based on the product - for example, Netlify is a closed source SaaS platform, so Advocacy plays a bigger role, whereas Temporal is an open source client-server system, where equal love needs to be given to Community and Docs.&lt;/p&gt;
&lt;p&gt;A quick aside for those who often hear DevRel vs DX conflated: DX is supposed to be the superset, but frankly, the lion’s share of DX is still DevRel, for both economical and historical reasons. Economical, because most developers know how to build product, but are terrible at building distribution, so a DX team often contributes the most value by reaching developers despite all the other things on its plate. Historical, because the DevRel to DX transition is a once-in-a-lifetime career upgrade for Dev Advocates to have more impact, just like the Sysadmin to DevOps transition. It all makes sense once you consider that Dev Advocates speak the most to users, but usually have the least power to make fundamental changes to solve their pain, particularly those I term “&lt;em&gt;Developer Exceptions&lt;/em&gt;” in that blogpost. Blogposts and talks have a half-life far shorter than docs and tooling/product improvements.&lt;/p&gt;
&lt;p&gt;Once you’ve marinated in the various aspects of DX enough, the distinctions start to re-blur once you consider that Internal DX just serves internal customers (and needs to invest in docs and advocacy too), and External DX serves external ones (and needs to tangibly improve productivity too). Both roles require a great deal of &lt;strong&gt;empathy&lt;/strong&gt; with developer problems, and an expansive mental catalog of ways to solve them. Yet the ultimate relevance of either to the outside world matters only to the extent of a typical build-vs-buy decision. Don’t get too hung up on precise definitions in an inherently fuzzy and still-moving field.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;When we first discussed this interview, you asked if your experience would be interesting to folks focused on infrastructure engineering. I’ve increasingly come to believe that Developer Experience is a core competency for all folks developing infrastructure software or working on infrastructure projects like &lt;a href="https://lethain.com/migrations/"&gt;large-scale migrations&lt;/a&gt;. Should infrastructure teams consider Developer Experience as a core engineering competency? Any ideas why they often don’t?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;It’s funny, even though I do DX at a company that &lt;em&gt;serves&lt;/em&gt; Infra engineers now (and Temporal enables Infra engineers to offer a dramatically better developer experience to product teams by providing “reliability on rails”), I had never ever viewed it as something Infra engineers themselves should regard as core. For sure, the Derisk-Enable-Finish cycle in that article on Migrations leans on many of the same skills as DX teams - advocacy, docs, tooling. But I’m loath to recommend that it should be “core” in all contexts, because (as we discussed earlier) DX is so broad and hard to define, and I’m always skeptical of people hawking their pet topic as mandatory. A bloated definition of “core” defeats the purpose of defining a “core”.&lt;/p&gt;
&lt;p&gt;What I will say is that I think most Infra Engineers could do with more developer &lt;strong&gt;empathy&lt;/strong&gt;, which in most situations simply means putting themselves in the shoes of people with less context and knowledge than them and proactively helping them out by any means necessary. If you do it right, then yes, the developer experience of your users will be better because you took the effort, but it should be done not for altruistic “let’s make them happy” reasons, but rather, selfish ones: your efforts will be more successful if they feel more successful.&lt;/p&gt;
&lt;p&gt;Why don’t more infra teams invest in Developer Experience? Honestly, probably because there’s no cultural expectation for them to. It&amp;rsquo;s common for infrastructure teams to get consumed by the loudest issues surrounding them like incidents and infrastructure costs such that they end up much more focused on their obligations to computers than their obligations to other engineers.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What are the top three tools or techniques that you use in Developer Experience that infrastructure engineering teams should consider adopting?&lt;/strong&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Journey Mapping&lt;/strong&gt;: exhaustively enumerate every concept, system, or API capability your user should know (thereby letting people know what they don’t know). Pick 2 main axes of concern and map them out in 2D space - clustering related concepts together. Draw a small core of “must know” concepts where everyone should start (letting people know what they don&amp;rsquo;t need to know). Identify and highlight FAQs. Then let them find their way based on their needs. This contrasts with a “one size fits all” linear path. (see &lt;a href="https://twitter.com/swyx/status/1455699258531729411"&gt;example&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pitch Sizing&lt;/strong&gt;: Be prepared to explain/define your system in one sentence (pique interest), one paragraph (by desired requirements or by pitching the problem), a 10 minute presentation, or a 30 minute demo. Logical/technical arguments are best supplemented with &lt;a href="https://cxl.com/blog/cialdinis-principles-persuasion/"&gt;Cialdini persuasion principles&lt;/a&gt;. Practice this when you don’t yet need it (eg at internal demo day/lunch) because you will be called upon to do it at the most unexpected times for the most high leverage reasons.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Two phase commit&lt;/strong&gt;: Knowledge is transferred as both &lt;a href="https://www.swyx.io/particle-wave-duality"&gt;discrete particles and continuous waves&lt;/a&gt;. Concretely, some of your users will want a monolithic organized reference, and others will just want diffs. One example rule that implements a “two phase commit” of knowledge - Every feature update should be communicated via a changelog and a doc/wiki update (and, for more impactful updates, a tweet, slack message, blogpost, talk&amp;hellip;).&lt;/li&gt;
&lt;li&gt;(Bonus) &lt;strong&gt;Events&lt;/strong&gt;: Learning to throw events that people look forward to and enjoy participating in is a huge multiplier on existing DX efforts. (see &lt;a href="https://www.swyx.io/community-heat"&gt;Community Annealing&lt;/a&gt;)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;strong&gt;Infrastructure engineering organizations have a lot of priorities. A few years ago &lt;a href="https://lethain.com/infrastructure-between-cost-center-and-before-ego-trip/"&gt;I tried to define an overarching set of infrastructure priorities&lt;/a&gt; and came up with: security, reliability, usability, leverage, cost and latency. I imagine this is at least equally true for DX teams, how do you figure out what to work on given the wide range you &lt;em&gt;could&lt;/em&gt; prioritize?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I think about DX work in terms of concentric circles radiating out from the core product, matching the maturity of the product:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;When the product is still being shaped, there is no better time to give feedback on API design.&lt;/li&gt;
&lt;li&gt;As the product approaches fully baked, I shift my attention to Docs.&lt;/li&gt;
&lt;li&gt;After shipping the product with a complete set of docs, I shift to Content (Advocacy) to get users and to spell out and elaborate whatever doesn’t tonally or structurally fit in docs.&lt;/li&gt;
&lt;li&gt;Users come for the content, and stay for the Community, so I start investing in getting to know them, helping them in their adoption, and find/build for/hire each other.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;On and on pushing outward when we can, but looping back inward whenever a new feature or product is launched or a new problem is found.&lt;/p&gt;
&lt;p&gt;All of these efforts should be coordinated with the same &lt;a href="https://twitter.com/swyx/status/1455699258531729411"&gt;&amp;ldquo;map&amp;rdquo;&lt;/a&gt; I described above - shared terminology, shared understanding of core concepts, and a shared reality of neighborhoods and landmarks. However they are not equal in all contexts, because inner circles tend to have higher long term impact (the best docs are the docs I don&amp;rsquo;t have to read because the product teaches me as I go, the best blogposts are the blogposts I don&amp;rsquo;t have to look for because the documentation was good enough, etc.), but outer circles have more reach.&lt;/p&gt;
&lt;p&gt;What I&amp;rsquo;ve described is from my experience in my sweet spot at early stage, Series A-C devtool startups, where each program is usually a singleton, but there are advanced versions of this at the larger companies too:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Every SDK can always have more languages and devtooling.&lt;/li&gt;
&lt;li&gt;Every conference can be replicated across the major continents.&lt;/li&gt;
&lt;li&gt;Every docs effort eventually morphs into a “University” or a certification/education program.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;At AWS scale, we also layered the DX circles with language, geographical, and business vertical dimensions. If you wanted a Chinese speaking Telcos specialist in Australia, we had someone for that.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Folks working on infrastructure engineering often have a specific dashboard they look at every morning to get a sense of how the software, system, and organization is operating. Do you have a similar dashboard? What’s on it?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;We use a mix of internal BI tools for lagging indicators (active clusters, SDK version adoption) and &lt;a href="https://www.commonroom.io/"&gt;Common Room&lt;/a&gt; for leading indicators (open source and community activity). As long as everything is trending up on a trailing 2-3 month basis I’m not too concerned about checking it every day. Considering that it takes &amp;gt;10 touches for the average person to go from first contact to seriously interested, the natural frequency of consideration cycles make for extremely long feedback loops.&lt;/p&gt;
&lt;p&gt;This is further confounded by the extremely non-ergodic nature of the open source enterprise customer, where one large customer can be worth 5 orders of magnitude more than the median, and take anywhere from a month to two years to convert to a customer.&lt;/p&gt;
&lt;p&gt;Most DX metrics are better regarded as a health check that things aren’t broken, rather than proof positive that things are actually working well. If you need more specifics, I&amp;rsquo;ve received very positive feedback on my piece on &lt;a href="https://www.swyx.io/measuring-devrel"&gt;Measuring Developer Relations&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;OK, I’m going to start turning the conversation towards Temporal for a bit. In every infrastructure team I’ve worked on there’s a team focused on supporting services that offer an API, but it’s often only much later that there’s any support for workflows (by which I mean scheduled, periodic or event-driven tasks) outside of something like &lt;a href="https://airflow.apache.org/"&gt;Airflow&lt;/a&gt; for batch processing. How did Temporal decide to focus on a workflow engine?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;There was no decision so much as it was a lifelong obsession borne out of decades of distributed systems experience at scale, and solving the same problems over and over again. My basic insight is that everyone converges on &lt;a href="https://www.swyx.io/why-temporal/#30-second-pitch"&gt;the same requirements&lt;/a&gt; for reliability, observability, and scalability in their systems, but the tools we have are too low level, so everyone handrolls (poorly) their own distributed system out of these tools. Eventually, large-enough companies build their own workflow engines to slow the wheel-reinvention.&lt;/p&gt;
&lt;p&gt;Our cofounders had been working on various iterations of messaging services and workflow engines for the prior ~20 years, at AWS, Google, Microsoft, and finally Uber. They created Temporal’s precursor at Uber, which became a full-time job as the number of applications using it ballooned to 300 in 3 years. This work was open sourced and similar growth was seen at Hashicorp, Coinbase, Airbnb, Doordash, etc. Finally, demand for a hosted solution was so strong, and the Uber-specific tech debt was so high, that they forked the project to start Temporal. So at every step of the journey the market demand drove the next phase of adoption, rather than any one decision.&lt;/p&gt;
&lt;p&gt;Temporal is at once a 2 year old startup and a 20 year old team in this sense; and having that much big tech and open source validation gave us a lot of conviction that the industry is hugely underappreciating the use cases of a workflow engine beyond simple scheduled jobs. There&amp;rsquo;s a bunch of hypey hyperbole thrown around: &amp;ldquo;distributed system in a box&amp;rdquo;, &amp;ldquo;reliability on rails&amp;rdquo;, &amp;ldquo;distributed application state platform&amp;rdquo;, &amp;ldquo;a new computing primitive&amp;rdquo;, &amp;ldquo;service mesh for long running operations&amp;rdquo; - all of which are true depending on your point of view.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;In listening to &lt;a href="https://temporal.io/stripe"&gt;Stripe’s talk on Temporal&lt;/a&gt; and &lt;a href="https://temporal.io/netflix"&gt;Netflix’s talk on Temporal&lt;/a&gt;, both mention writing their own SDK wrapper on Temporal’s SDK. Is it a good or bad sign when your users routinely wrap your SDK?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;It’s easy to map out the pros and cons:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;It’s good in that it validates that we solve a hard enough problem that people wrap us rather than build us… for now. And it gets us users that closed source SaaS and inextensible “No Code” platforms would not.&lt;/li&gt;
&lt;li&gt;It’s bad in that it means our users have a built in facade that makes it easier for them to move off us in the future&lt;/li&gt;
&lt;li&gt;It’s good in that both Stripe and Netflix talk about their wrappers solving company specific problems and providing good defaults for their intended users, that we can later absorb into core once validated enough in “userland”&lt;/li&gt;
&lt;li&gt;It’s bad in that we don’t do some things for them out of the box… yet.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Ultimately I view “being wrapped” as a natural, net positive outcome of any valuable enough devtool. The best thinking I’ve come across on this is Kevin Kwok’s view of &lt;a href="https://kwokchain.com/2021/02/05/atomic-concepts/"&gt;platforms vs their ecosystems&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://infraeng.dev/interviews/kevin-kwok-platforms.png" alt="Kevin Kwok&amp;rsquo;s diagram on platform ecosystems"&gt;&lt;/p&gt;
&lt;p&gt;Usecases that are high impact and generally useful should be solved by us, whereas usecases that are lower impact and very specific should be solved by wrappers. We would look to our growing ecosystem to help solve high impact, high niche usecases, and investing in an open source community directly contributes towards this long term advantage.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Since Stripe moved to base generating SDKs off &lt;a href="https://github.com/stripe/openapi"&gt;their OpenAPI spec&lt;/a&gt;, I’ve started to suspect that SDKs are better interfaces to expose to users than APIs themselves. Are SDKs better user-facing interfaces than APIs?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This is very close to my heart! The simple response is that yes, if you can afford to, offering an SDK (or a CLI, by the way) generally provides a better developer experience than just the raw API. The basic argument is that if you &lt;em&gt;don’t&lt;/em&gt; provide an SDK, the developer will eventually have to build one for themselves for anything of sufficient complexity. There are a number of problems that can only be solved at the SDK level, including providing more specific types or type inference, inline documentation/autocomplete, and mocking out the API for testing.&lt;/p&gt;
&lt;p&gt;However a poorly implemented SDK can also introduce an extra layer of potential bugs and performance issues, constrain advanced users, cause uncertainty about exposed classes and data structures, and add complexity to versioning/upgrades. In scenarios like these, being able to “drop down” a layer to the underlying API is crucial and the platform should not actively obstruct that.&lt;/p&gt;
&lt;p&gt;One should also distinguish between “Fat” and “Thin” SDKs. “Thin” SDKs are simple, 1:1 language wrappers over APIs, the kind that can be generated from OpenAPI. “Fat” SDKs do more, often managing state (e.g. AWS’s AppSync SDK creates a local replica of your DynamoDB backed database, and handles offline sync and merge conflicts), or allowing plugins, or as Temporal’s SDK does, offering a deterministic sandbox which can replay events through your code for failure recovery and durable async functions.&lt;/p&gt;
&lt;p&gt;In short, the opportunities for “Fat” SDKs to improve developer experience well beyond simple RESTful APIs are greater, at the cost of more engineering (and docs, and support…) to maintain them. Tradeoffs everywhere!&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;In &lt;a href="https://www.swyx.io/self-provisioning-runtime/"&gt;The Self Provisioning Runtime&lt;/a&gt;, you modify Alfred North Whitehead’s quote saying, “Developer Experience advances by extending the number of important problems our code handles without thinking of them.” That quote gets at the long-term promise of cloud providers, which are slowly making important problems invisible for many users, e.g. my experience is that general awareness of networking is significantly lower than it was a decade ago, which I attribute largely to cloud adoption. In some ways I see Temporal as competing with cloud providers&amp;rsquo; own workflow engines. How do you think about competing with the cloud?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I’m excited by it. We certainly have to take it seriously, because Temporal is MIT-licensed, and there is nothing stopping Amazon or Azure from hosting us as a service tomorrow. But at this point &lt;em&gt;dozens&lt;/em&gt; of open source companies have faced that threat and survived - by relicensing, and by serving their customers better than the big clouds can. On one hand, this is intimidating, because Amazon theoretically has infinitely more resources to crush us. On the other - I’ve worked at Amazon and seen how hard it is to push through the absolute mountain of conflicting priorities and legacy tech to get anything done compared to tiny startup teams with a fraction of our funding.&lt;/p&gt;
&lt;p&gt;This is why topics like developer experience are so important - there are so many more dimensions to building a successful developer infra business than just the commodity operation of software - but I am actually most excited about outcompeting the big clouds by better product strategy and better network effects as those are sustainable and compoundable wins.&lt;/p&gt;
&lt;p&gt;I can’t be too specific here but consider how Snowflake made an independent case for itself by being the “Data Cloud”, Cloudflare is doing the same for the &lt;a href="https://www.swyx.io/cloudflare-go/"&gt;decentralized cloud&lt;/a&gt;, and Stripe being payment rails for ecommerce. All are justifiably market positions that the big clouds will not/can not tackle given their current strategies. Temporal happens to occupy a very nice space:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;managing lightweight application state, not egress-heavy data&lt;/li&gt;
&lt;li&gt;having a small well defined contract with every mission-critical microservice in your company and others’, and&lt;/li&gt;
&lt;li&gt;being generally agnostic as to whether &lt;a href="https://www.swyx.io/api-economy/"&gt;humans or machines&lt;/a&gt; complete tasks in a given workflow.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;I think every startup that competes with big clouds (read: every ambitious Infra startup) will need to carve out a space on which they are the undisputed independent source of truth, at least until &lt;a href="https://www.swyx.io/temporal-centicorn/"&gt;the $100billion valuation stage&lt;/a&gt; when the metagame changes once more.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What&amp;rsquo;s the single most impactful project you’ve heard of an infrastructure engineering team working on? Why? Was it obviously impactful beforehand? Stripe’s &lt;a href="https://sorbet.org/"&gt;Sorbet&lt;/a&gt; is an example of a discrete project that I found surprisingly impactful.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Probably the sharding system that became Vitess at YouTube. YouTube is quite simply the biggest social video platform on Earth today, but it faced &lt;a href="https://www.alexanderjarvis.com/the-confidential-youtube-investment-memo-by-sequoia-you-were-never-meant-to-see/"&gt;a horde of well funded competitors&lt;/a&gt; in 2005-2010. YouTube was experiencing &lt;a href="https://about.sourcegraph.com/podcast/sugu-sougoumarane/"&gt;2 outages a day&lt;/a&gt; due to the extreme load, and could have gone the way of Friendster if those performance issues continued. &lt;strong&gt;No Vitess, no YouTube.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Vitess made MySQL scalable for YouTube, then its open sourcing helped Hubspot, Slack, Pinterest, GitHub, Square and more. If database infra counts, then I’d be hard pressed to think of a more impactful project.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What are some resources (books, blogs, people, etc) you’ve learned from? Who are your role models in the field?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I keep a list of resources for &lt;a href="https://codingcareer.circle.so/c/devtools/"&gt;DevTools&lt;/a&gt; and &lt;a href="https://codingcareer.circle.so/c/dev-communities/"&gt;Dev Rel/Dev Community&lt;/a&gt; in my own space because the list of resources is long for such a young field. Special shoutouts to &lt;a href="https://about.sourcegraph.com/blog/ex-googler-guide-dev-tools/"&gt;Beyang’s Guide to Devtools&lt;/a&gt; and &lt;a href="https://devrelresourc.es/"&gt;Mary Thengvall’s Devrel Resources&lt;/a&gt;, and to Kelsey Hightower for getting me started &lt;a href="https://www.swyx.io/learn-in-public/"&gt;Learning in Public&lt;/a&gt; and going down the Developer Advocate career path. Scott Hanselman is also a huge mentor to me, being &lt;a href="https://www.learninpublic.org/"&gt;an early reviewer of my book&lt;/a&gt; and with his inclusivity and ability to make anything in the Microsoft ecosystem interesting, and ability to cross over into newer platforms like Tiktok!&lt;/p&gt;</description></item><item><title>Developer-Experience</title><link>https://infraeng.dev/categories/developer-experience/</link><pubDate>Mon, 11 Apr 2022 08:00:00 -0700</pubDate><guid>https://infraeng.dev/categories/developer-experience/</guid><description/></item><item><title>Smruti Patel</title><link>https://infraeng.dev/smruti-patel/</link><pubDate>Sun, 10 Apr 2022 07:00:00 -0700</pubDate><guid>https://infraeng.dev/smruti-patel/</guid><description>&lt;p&gt;&lt;em&gt;Written interview in early February, 2022. Learn more about Smruti on &lt;a href="https://www.linkedin.com/in/smrutirp/"&gt;linkedin&lt;/a&gt;&lt;/em&gt;, &lt;em&gt;&lt;a href="https://twitter.com/smrutirp"&gt;twitter&lt;/a&gt; and on &lt;a href="https://leaddev.com/community/smruti-patel"&gt;LeadDev&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tell us a little about your current role: where do you work, your title and generally the sort of work you and your team do.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I currently lead the Data Platform group at Stripe – we operate the centralized data lake, and the bigdata, async &amp;amp; stream processing infrastructure for Stripe’s mission-critical business, while ensuring security, reliability and efficiency. Essentially, supporting Stripe’s core money movement &amp;amp; storage, financial reporting &amp;amp; analytics products for our merchants, and empowering ML infra to build credit, fraud &amp;amp; risk intelligence.&lt;/p&gt;
&lt;p&gt;Prior to my current role, I also led the LEAP organization, which stands for Latency, Efficiency, Access &amp;amp; Attribution and Performance - my vision here was to take those small steps needed to unlock the giant leaps for both our engineering organization internally, and our users using Stripe. To enable that, we developed cross-functional strategies and tools for optimizing our cloud spend, and lowering Stripe&amp;rsquo;s end-to-end latency through performance tuning.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What dashboards or metrics do you &lt;em&gt;personally&lt;/em&gt; use to stay aware of your organization’s work? How often do you check these?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I am one of those personality types, who is facts-oriented, analytical, and leverages data to draw patterns and drive decisions. So yes, metrics are my jam!&lt;/p&gt;
&lt;p&gt;I’ve been leading teams for over a decade now. In this period, I’ve learnt that engineers don’t lack motivation. They are here to do their best work. Intrinsic motivation talks about a healthy balance of autonomy, competence and purpose. Let&amp;rsquo;s assume you have solved the hiring problem. You’ve built an inclusive team of highly skilled engineers with the right domain expertise. We’ll also assume that your management and leadership practices lean toward a healthy culture. A culture which provides the right blend of growth mindset, radical candor and psychological safety for individuals to thrive. So competence and autonomy are more or less solved, but how do we as leaders, then address purpose, the northstar, the why?&lt;/p&gt;
&lt;p&gt;That’s where it’s important to think about opportunity cost! We have finite resources: and doing X implies not doing Y. For any software-driven company, our engineering talent, their productivity, efficiency and impact is our highest leverage. Hitting the right product-market fit can be extremely time sensitive. The opportunity cost, therefore, of going down a potentially wrong path, can be significantly high.&lt;/p&gt;
&lt;p&gt;And so, you need a high fidelity OODA loop to observe, orient, decide, act and &lt;em&gt;react&lt;/em&gt; to feedback! And that’s where I leverage metrics heavily to measure and &lt;a href="https://leaddev.com/productivity-eng-velocity/debugging-engineering-velocity-and-leading-high-performing-teams"&gt;debug engineering velocity&lt;/a&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Precision - What are you shipping and why?&lt;/li&gt;
&lt;li&gt;Speed - How frequently are you able to ship?&lt;/li&gt;
&lt;li&gt;Quality - What is the failure rate or quality of your software?&lt;/li&gt;
&lt;li&gt;Impact - How well does it achieve business goals?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For LEAP: our impact metrics were around measuring overall cloud spend as a function of business volume. Or the tail latency - the p99.9– of the most important ChargePath API.&lt;/p&gt;
&lt;p&gt;For Data Platform: some aspects are easier to measure than others. So here, we have 3 categories - starting from the outer loop of Stripe users, to the inner loop of our direct engineering cohorts and the bridge between the 2 - our executive leadership:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Non-functional requirements to measure strong guarantees of security, reliability, and performance of our systems.&lt;/li&gt;
&lt;li&gt;Functional requirements to democratize access to data to enable rich insights for various cohorts that work with data - data scientists, data engineers, ML engineers or Product engineers. This is generally the hardest to measure!&lt;/li&gt;
&lt;li&gt;The efficacy of operating Stripe’s business through data efficiency, compliance, and rigorous financial accounting.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;I personally look at most of these system &amp;amp; business metrics weekly, to determine overall health of our systems and the broader investment within the organization.&lt;/p&gt;
&lt;p&gt;In addition to these, I also look at team health metrics (monthly &amp;amp; quarterly) - like employee engagement, hiring ratios, attrition or transfers, #uplevel readiness.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Several of the areas you’ve worked on, especially efficiency (e.g. infrastructure spend) and performance (e.g. CPU utilization and user-facing latency) are areas of distributed accountability. A system’s efficiency is heavily dependent on the individual parts within the system. How do you set goals for areas of distributed accountability? What have you found effective for reducing the challenges of diffused accountability?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I love this question, and especially your reference to &lt;em&gt;&lt;a href="https://www.amazon.com/dp/B005VSRFEA/"&gt;Thinking in Systems&lt;/a&gt;&lt;/em&gt;, a book which blew my mind a decade ago. Here’s how I’ve come to approach these problems.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Frame the problem&lt;/em&gt;. The &lt;em&gt;why&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;For both Efficiency &amp;amp; user-facing latency, the first thing I did was own the whole problem, from the farm to the table- the reason this was a key, fundamental step was it helped provide a unified direction and sense of purpose for the narrative. I established myself as the accountable and authoritative subject matter expert in framing the problem for the company, through trust and verifiable, clean data.&lt;/p&gt;
&lt;p&gt;I had learned from past experience that accountability without authority was a kitchen sink at best and a dull knife at its worst. So I secured executive sponsorship to back this key impactful initiative for Stripe, aligning on the outcomes through charter metrics (eg: overall cloud spend as a function of the business and p99.9 latency of the most popular ChargePath API), and setting expectations on the relative agency of a centralized team in driving those outcomes.&lt;/p&gt;
&lt;p&gt;While this was necessary, it was far from sufficient. And that brings us to identifying the key elements of this system - the movers &amp;amp; the shakers.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Identify the elements. The&lt;/em&gt; &lt;em&gt;what / who&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;In order to determine whom to hold accountable, we had to invest a few quarters in doing the gardening - creating a &lt;a href="https://en.wikipedia.org/wiki/MECE_principle"&gt;M.E.C.E.&lt;/a&gt; attribution our total cloud spend down to the last dollar to _a team. _This required navigating the notion of organizational hierarchies, supporting reorg workflows, re-attributing and backfilling to support error handling. This can feel toilsome, and be the valley of slow death – but here’s where I’d recommend persisting through, cause it will pay dividends when done right, and well.&lt;/p&gt;
&lt;p&gt;Once we had attributed every dollar or every time slice, we then used Pareto’s 80-20 rule to focus on the top 5-10 product or platform teams, which provided the highest leverage.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Identify the interconnectedness and the flows. The how&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Drucker said, culture eats strategy for breakfast. And a key aspect to changing behavior, especially when accountability is diffused, is to motivate culture. And culture is nothing but the behaviors the system incentivizes or disincentivizes.&lt;/p&gt;
&lt;p&gt;We saw that the Hadoop platform team allocated its resources to teams through statically assigned queues, which led to local fragmentation and overall dropped system utilization when those jobs weren’t running. We needed the platform team to implement elasticity and the job runners to release resources – but both needed to be made aware, and then incentivized to prioritize this work.&lt;/p&gt;
&lt;p&gt;And here’s where I’ve found it immensely useful to implement and operationalize the &lt;a href="https://www.bi.team/wp-content/uploads/2015/07/BIT-Publication-EAST_FA_WEB.pdf"&gt;E.A.S.T. framework for behavioral economics&lt;/a&gt;:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;E&lt;/strong&gt;asy to self-serve costs through attribution, rich cost observability tooling and automated customized Nudges providing insights and recommendations on ways to meet their goals&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;A&lt;/strong&gt;ttractive to incentivize and reward Efficiency efforts by tracking wins, providing badges or company-equivalent means of public recognition&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;S&lt;/strong&gt;ocial by driving ownership and accountability through cohort analysis, leaderboards, public Ops Reviews, and&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;T&lt;/strong&gt;imely by introducing LEAP in Eng101 onboarding classes.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Lastly, there are systems where the carrots work better, or the stick. Depending on the urgency of the problem, some levers to drive the latter are setting explicit budgets (eg: cloud spend or headcount, spend budgets for org size of 25+), ensuring that teams have the right company level prioritization for related work, enforcing capacity governance processes or ring-fencing engineering bandwidth to drive centralized optimization.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Are there any processes or forums (like a “quarterly business review” or whatnot) that you’ve found valuable for &lt;a href="https://lethain.com/inspection/"&gt;inspecting execution&lt;/a&gt; within your team or across the many teams that share some responsibility for performance and efficiency?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;In addition to diffused accountability, the other biggest challenge with inspecting execution for areas of performance and efficiency is, realized impact.&lt;/p&gt;
&lt;p&gt;Let’s say, a data team decides to build a resource request portal, to automate away the static allocation and under-utilization of its compute resources. They ship this feature and move on to solve other problems. However, a few months in, they don’t see any change in the overall spend on the Hadoop infrastructure.&lt;/p&gt;
&lt;p&gt;These are especially common in the performance and efficiency space, as the evaluation of the problem is based on several hypotheses, and there’s underlying complexity in the causal chain of dependencies. In the above case, we see that resources are under-utilized and waste is high. We concluded that most waste occurs from queue fragmentation in statically assigned compute resources, dynamic allocation will thus reduce fragmentation, hence cost savings! If we’d looked deeper at the data, we might have identified that the issue wasn’t so much in the fragmentation, but in the release of unused resources – a similar but different problem, begging for a different engineering solution.&lt;/p&gt;
&lt;p&gt;Given this complexity in diagnosis, I’ve found it extremely useful to establish a contract with relevant teams (or my own) – anchor around invariants that need to be true at the end of a certain timeline, or around quantifiably, verifiable metrics. Eg: No product engineering team will miss their p99.9 latency service level agreement for over 48 hours, and beyond that, open an incident to follow due protocol. Or, team X will spend no more than 2% over their monthly allocated spend budget; any variances beyond this will need explicit approval from executive leaders.&lt;/p&gt;
&lt;p&gt;Whether the teams decide to solve problem X or Y, or engineer a solution Foo or Bar, then, is secondary. We shake on the outcomes and invariants- and this &lt;a href="https://leaddev.com/culture-engagement-motivation/fostering-autonomy-and-trust-lead-high-performing-teams"&gt;forsters both agency &amp;amp; autonomy&lt;/a&gt; for the teams to drive results, and also creates owned accountability.&lt;/p&gt;
&lt;p&gt;Speaking of accountability, I am a firm believer of ‘Trust _and verify’. _It is crucial then, to create the right ~real-time alerting and feedback loops, to catch early drifts - and I’ve found the weekly Ops reviews to be at the right cadence for these. This is where we want to leverage the exec sponsor for this program, who’ll recognize the right behaviors we want to see amplified, or facilitate deep dives into the incorrect outcomes to dampen their spread.&lt;/p&gt;
&lt;p&gt;Lastly, QBRs are a great way to formally view trends in resource management, and related impact. This is also a great time to strategize and prioritize future investment, in line with the organization’s broader goals.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;On that same theme, one particular challenge I’ve encountered is the perception that infrastructure efficiency is less important than developer productivity. To the extent that is true, some would argue that it’s illogical to prioritize things like performance and efficiency. How have you dealt with this tension between efficiency (or performance) and developer productivity?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For me, the joy of engineering lies in the solving of constraints, similar to those linear programming problems in Math. Given a system, and some non-functional requirements ( eg: availability, reliability, security), how do we seek equilibrium in the system? How do we make the right tradeoffs to sustain that?&lt;/p&gt;
&lt;p&gt;At the macro level, it goes back to the opportunity cost for the business. When does it make sense for the business to invest in efficiency or performance? When a company is in its growth stage, its engineering talent is the highest asset and finding the right product-market fit is its highest priority. At that time, and at that scale, developer productivity is higher leverage than efficiency.&lt;/p&gt;
&lt;p&gt;But as the business matures, and its organization and the engineering systems evolve, the balance shifts. 4YPs and discounted cash flows also start expecting to yield economies of scale- especially given the compounded nature of money. The CFO is likely to assess marginal revenue per net new employee, or overall margin for the business. And for most SaaS companies running infrastructure on the Cloud, their OPEX is the second largest spend.&lt;/p&gt;
&lt;p&gt;At Stripe, I intimately witnessed our burgeoning cloud costs, and thanks to &lt;em&gt;your&lt;/em&gt; foresight in investing early, we were largely successful in bending the curve along multiple dimensions of our overall spend. In order to justify and equip engineering teams with the agency to drive their investment, we laid down a generally applicable decision-making framework to translate engineering time to cost savings. For example: Invest 1dev-week of effort for $10K/month savings. For our own centralized Efficiency team, we placed a high premium on opportunities worth pursuing: eg: 5X cost savings per IC. These help address some of the tension between investment in dev prod efforts vs those catering to Efficiency.&lt;/p&gt;
&lt;p&gt;However, at the micro level, depending on the problem you’re solving, you could either improve both systems efficiency and developer productivity, or face situations where “going faster” necessitates spending more. Eg: Take CI costs: if we were to improve and finetune our selection set of which tests to run, we’d reduce the dev time spent on running tests and reduce CPU hours, thereby being more efficient. But take build times- let’s say throwing 15% more instances to generate builds, reduces average build time from 25mins to 15mins. Is it worth it? Yes. But at what point is it not- how about when going from 15mins down to 12mins?&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;In &lt;em&gt;Staff Engineer&lt;/em&gt;’s &lt;a href="https://staffeng.com/guides/manage-technical-quality"&gt;Manage Technical Quality&lt;/a&gt;, I argued that folks should focus on pursuing quality through improving hot spots, best practices, and so on. The least recommended solution was &lt;a href="https://lethain.com/programs-owning-the-unownable/"&gt;running an organization program&lt;/a&gt; that requires coordination across the entire engineering organization. This is a point of view that I developed in part during our time working together based on how hard it is to coordinate moving an organization. Do you think I came to the wrong conclusion in recommending folks avoid running organizational programs as much as possible? Any advice to make running organizational programs effective?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I especially love that article on &lt;em&gt;Managing Technical Quality&lt;/em&gt; and I wholeheartedly agree on your assessment!&lt;/p&gt;
&lt;p&gt;I started my career as a Quality Engineer around 2 decades ago, testing key features like distributed resource scheduling and linked clones for VMware’s control plane management solution to manage VMs. It was prior to the DevOps movement, and most enterprise companies ran these through centralized teams. There were several downsides to that model, stemming primarily from misaligned incentives, which arose due to lack of end-to-end ownership in shipping a high quality product to users. The developers were responsible for checking in code, and the QE for identifying defects and performance bottlenecks. This adversarial engagement created tension, as opposed to a joint commitment to delivering value. There was also a downward spiral of brain drain, due to the system perpetuating implicit second-class citizenship, in its hiring, compensation and talent management frameworks.&lt;/p&gt;
&lt;p&gt;Fast forward to recent times, the core tenets of DevSecOps place a high value on end-to-end ownership of engineering – from code deployment to managing maintenance and operations. Systems which embrace this model heavily benefit from your recommendation in the article - which is to focus on the hot spots, drive practices, find leverage points and so on.&lt;/p&gt;
&lt;p&gt;As cliche as it is, It all comes down to people! People are at the heart of every engineering problem, and its solution - be it more engineering, practice, process or program. I am of the firm opinion that people want to do the right thing, but they are optimizing for the constraints they are given. The most expedient way to then drive change is to provide awareness of the problem, align incentives, and give them the time and space to prioritize the fixes. For example, if a business leader is pushing their org to release product features at a breakneck pace, it will lead to technical debt or low code quality.&lt;/p&gt;
&lt;p&gt;Also, running a program has extremely high overhead– sustainable metrics, weekly executive sponsorship and commitment, ongoing program evaluation. A program, its related scoring or goals evaluation, and associated leaderboards, also create a sense of foreboding - it is akin to being called into the Principal’s office– and shift the balance from the program owners being medics and dependable consults to cops who must be dealt with.&lt;/p&gt;
&lt;p&gt;But there are times when a technical program is indeed the right solution – factors here range from the scale of the engineering organization (eg: tracking cloud spend for a group of 1000+ vs 200), to bootstrapping baseline shifts in your overall posture (eg: driving least privilege access to all data) or requiring immediate change to uplevel the entire organization simultaneously (eg: compliance needs like GDPR, India data locality).&lt;/p&gt;
&lt;p&gt;I’ve had fair success leading such programs, focusing on:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Early (and often) alignment with key stakeholders on defining the goals and soliciting their buy-in.&lt;/li&gt;
&lt;li&gt;Fostering trust and autonomy: trust in the data you leverage to guide ongoing decisions, trust in your intention to meet the mutually beneficial goal, and trust in being an equal, supportive partner throughout the journey. Trust _and verify. _&lt;/li&gt;
&lt;li&gt;Effective communication and tight collaboration: create feedback loops to ensure information flows at the right cadence, at the right zoom factor for the right audience.&lt;/li&gt;
&lt;li&gt;Giving credit liberally; publicly recognizing the good citizens, or the early adopters.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;What are some of the most impactful projects or tools that your teams have rolled out to improve performance or efficiency that were impactful without requiring mass-coordination across many teams?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Efficiency, Performance and to that extent, even Reliability and Security are horizontal programs. For each of these, I’ve found it valuable to establish the right balance of tooling, education &amp;amp; practices to drive organizational behavior and simultaneously land direct improvements by solving real engineering problems. Anchoring on either end of the spectrum disproportionately impacts the end outcome. For example, if you index heavily on laying down patterns and practices for the org to adopt, but don’t build critical infrastructure or land impact by fixing existing systems, it erodes trust and credibility. If you are making point fixes, and landing impact a system at a time, you’re likely not evolving fast enough in a rapidly scaling company.&lt;/p&gt;
&lt;p&gt;Keeping that balance in mind, and similar to macro-economic cyclicality, I developed our Efficiency strategy around 3 dimensions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Pay Less (optimize procurement) ,&lt;/li&gt;
&lt;li&gt;Use Less (optimize utilization) and&lt;/li&gt;
&lt;li&gt;Need less (optimize performance).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Early on, optimizing procurement was the single biggest lever in reducing our cloud spend. Automating Reserved Instances &amp;amp; Saving Plans purchasing, implementing storage tiering for hot/warm/cold data accesses and centrally leading vendor discount negotiation (in collaboration with F&amp;amp;S) significantly dropped the spend/business volume bps.&lt;/p&gt;
&lt;p&gt;We then focused on the second bucket - use less- improving utilization. This involved auditing unused/unclaimed stuff, automating brownouts to those unaccessed resources and then releasing the resources to prevent future spend.&lt;/p&gt;
&lt;p&gt;Similarly, on the latency side, we rolled out an incident-free Ruby GC optimization without needing to coordinate with Product teams. This change dropped the tail from 4.6 seconds down to 2.9 seconds.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;I’ve often considered Efficiency to be an “obvious spot” to partner with a &lt;a href="https://tpmstories.medium.com/"&gt;Technical Program Manager&lt;/a&gt; (TPM), because it’s such a cross-organizational effort and there’s no finish line: the work just keeps going further. Do you agree? How would you approach involving TPMs in areas like efficiency and performance?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;There are 3 key pillars to navigate when running an effective Efficiency &amp;amp; Performance program - the engineering, the organization and the Finance &amp;amp; Strategy.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Engineering comprises the centralized team which drives the execution of the strategy, and related projects serving the end outcomes.&lt;/li&gt;
&lt;li&gt;Organization involves the product and other infrastructure teams within engineering, their organizational leaders and the executives leading the business.&lt;/li&gt;
&lt;li&gt;Finance &amp;amp; Strategy leads the overall capital allocation at the macro business level, often reporting into the CFO.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A solid TPM can serve as _the _glue and the singular force operationalizing the strategy and seamlessly bridging all 3 pillars:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Identifying technical inefficiencies in product &amp;amp; infra engineering and creating the Efficiency portfolio of opportunities. This could involve: a. Tracking big scale up, scale down and swings in cloud spend, and enforcing capacity governance processes. b. Tracking platform rate cards (eg: avg cost per vcpu for a Hadoop job) and quantity of resources consumed (eg: #vcore-hours for team X). c. Creating effective feedback loops to bridge the utilization with consumption and budgets.&lt;/li&gt;
&lt;li&gt;Enablement &amp;amp; education to motivate change bottom up and shift left the culture of efficiency &amp;amp; performance - Facilitate prioritization conversations across various stakeholders and leadership to unlock resourcing for highest leverage work items. Partner with the centralized Efficiency team, Education and other Infrastructure teams to develop best patterns and practices for developing systems and services efficiently.&lt;/li&gt;
&lt;li&gt;Operationalize budget tracking and drive high forecast fidelity by organizing monthly spend budget reviews for: a. Identifying the right set of teams and tracking org-wise budgets vs actuals. b. Evaluating engineering plans for new investments. c. Accounting for budget variances due to delayed execution (eg: Team X budgeted $Y for the month of March for a new feature launch, but came in lower cause they encountered issues), or overspend (and identifying critical remediations). d. Enabling identification of potential cost saving opportunities.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Lastly, a TPM is a core partner to the Engineering Manager and F&amp;amp;S, in identifying and unifying KPIs to tune the OODA loop (observe, orient, decide &amp;amp; act), to make macro or micro refinements to the overall strategy.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;There’s a tendency for infrastructure engineering to be invisible when nothing is going wrong. How do you articulate the value of your organization’s work?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I am very glad you brought this up! With infrastructure, when something’s going wrong, there’s nowhere to hide. But the key challenge is when nothing is going wrong, how do you know it’s actually going right? So, when I think of infrastructure problems, I think of ‘great power comes with great responsibility’. And here’s why.&lt;/p&gt;
&lt;p&gt;The beauty of Infrastructure is in its whittling down of essential complexity, through simplified abstractions which bring joy to its users. And through that, key leverage for the business.&lt;/p&gt;
&lt;p&gt;Most mid-sized companies looking to scale, start investing in infrastructure engineering teams; with typical hiring ratios being 7-8 Product engineers to every infra IC. This makes it imperative that every infra-eng-week of effort be dedicated to high leverage work. Infra problems also take more rigor to solve, and get right, to avoid thrashing the rest of the engineering organization. Imagine building out a cloud compute abstraction, which changed every quarter, and fanned itself out to 20+ product engineering teams doing daily deploys. It’d be a nightmare!&lt;/p&gt;
&lt;p&gt;This combination of complexity, rigor and the expectation of high ROI, makes Infrastructure engineering a very high stakes endeavor :)! Teams which romanticize or idealize the tech, over its customers or business value tend to languish – either due to missing the mark on realized impact, ceased investment due to lost credibility or due to internal employee burnout. And so articulating value, within and without, at every stage of software development is extremely crucial to leading a high performing, value delivering Infra team.&lt;/p&gt;
&lt;p&gt;The 3 tenets I’ve found useful are 1. Know your customer 2. Bring in a product mindset, whether it involves doing initial market study (eg: evaluating build vs buy options), customer analysis &amp;amp; segmentation (eg: focus on data scientists over business analysts ), or even developing a go-to-market strategy (eg: white-glove migration workshops to facilitate Data Locality needs). 3. Measure what matters, not what’s easy to measure (and do the early work to identify what this is!)&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;At the planning stage, drive precision through alignment and prioritization — are you focusing on the right problem? And for whom? Here, you need to be grounded in the &lt;em&gt;why&lt;/em&gt;, before the &lt;em&gt;what&lt;/em&gt;. Do the 5 whys exercise, especially if embarking on multi-half infrastructure investments (eg: migrating a monolith to a SOA). And to seek alignment and early feedback, I’ve found the PRFAQs practice from Amazon, quite useful to build trust and credibility with your stakeholders and executive leadership.&lt;/li&gt;
&lt;li&gt;At execution, drive focus, speed &amp;amp; quality, leveraging the SPACE metrics whenever applicable. Be extremely paranoid about scoping the problem just so, go deep before you go broad, and aim for vertical slivers of delivering impact vs all-or-nothing. I recently led the data security strategy for Stripe, and the biggest win we had was in the underlying approach. We identified a data access metric, and pivoted from securing one data system at a time, to incrementally driving value and moving the needle. Depending on the culture of your organization, communicate early, and often, through shipped emails, company all-hands or demos. This is a great avenue to seek feedback with your beta users, confirming the validity of your approach.&lt;/li&gt;
&lt;li&gt;Finally, ensure that you’re maximizing overall impact: Are folks using what you delivered? Are you actually seeing movement toward your northstar metric? This is when we hone the “what” and validate the “why”. Quite recently, we shipped some work expecting to see change, and moved onto solving other problems. Looking back retrospectively, we realized that we had needed to build adjacencies to the shipped work to actually capitalize on all the effort. Sometimes, you need to evaluate what an additional 5–10% looks like to realize the most impact; this could be a marketing strategy, a small UX improvement, a small optimization (e.g. making load times much faster), or in many cases, better documentation. _Take that time. Bring it home. _&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;On March 15, 2013, 1,200 Japanese workers converted the Shibuya Station train Line from above ground to underground, in just 3 hours, before the first morning train the next day! I have worked on Infrastructure for nearly 2 decades now &amp;ndash; when I think of Infrastructure, I think of this. It is this behind-the-scenes symphony of dedicated, resilient and talented people, working together to keep the masses moving, with zero friction or downtime- THIS gives me joy. And pride.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Asking the same question again but from a different perspective: how does working on something like efficiency or performance impact someone’s career, particularly in terms of getting promoted?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;There’s the finite game of uplevels and promotions, and the&lt;a href="https://simonsinek.com/product/the-infinite-game"&gt; infinite one&lt;/a&gt; of constant learning, development and evolution.&lt;/p&gt;
&lt;p&gt;For the former, when we think of individual career impact, there are 3 systems at play – individual career aspirations, the engineering ladder and expectations for different levels/roles, and the business need/team opportunity.&lt;/p&gt;
&lt;p&gt;Efficiency/Performance-shaped problems tend to be both broad and deep (eg: spark tuning for bigdata computation or improving your Kafka publish tail latency). To navigate such problems, some key traits and competencies are: highly motivated, proactive problem-solvers who can move with urgency and focus, while balancing critical thinking, comfort in dealing with data-driven diagnosis, hypotheses and analyses, and ability to work cross-organizationally, collaborating with different teams, systems and organizational dynamics. Let’s assume that individuals working on efficiency or performance-shaped problems are inherently motivated and excited about solving such problems.&lt;/p&gt;
&lt;p&gt;That brings us to ensuring that there is indeed a strong business need to be solving these problems. A company, in its initial phases, might not want to invest in efficiency and performance, and rightly so as we discussed earlier. If we’ve secured the need and the buy-in, it comes down to demonstrated value - results, results, results! And once you’re secured results, identify the narrative for the impact driven - what’s the before/after story? What got better? What gets worse, if left unsolved? Understand and evaluate your organization’s leveling rubric, to assess if the complexity or realized impact, are in line with the system’s expectation from someone at that level and role. Eg: At Stripe, we’ve intentionally introduced the “Fixer” archetype for Staff engineers, to create room for, and acknowledge the value of associated impact to the business.&lt;/p&gt;
&lt;p&gt;Some things to keep in mind:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Going back to your earlier bit about diffused accountability, we also need to ensure that the individuals working on these problems are well equipped to navigate this aspect of the system (especially, because ICs tend to avoid situations involving potential conflict).&lt;/li&gt;
&lt;li&gt;If in the discovery phase of evaluating which problems to solve, identify a rubric for effort and impact (eg: 1 eng-quarter for $X million in annualized savings), and stack rank those opportunities to avoid missing the forest for the trees.&lt;/li&gt;
&lt;li&gt;Balance driving value with incoming interrupts, when driving change through the rest of the org – ICs want to code, and solve problems, so leverage partners like the EM and TPM to help ICs get focus time.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Lastly, speaking of the long game, and my own experience : I&amp;rsquo;ve developed some key strengths through my journey from quality engineering to leading efficiency/performance programs - ability to seamlessly operate &amp;amp; diagnose varied distributed systems, strong business communication skills, and effectively drive influence without authority across cross-functional organizations.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What are some resources (books, blogs, people, etc) you’ve learned from? Who are your role models in the field?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;You and yours - infraeng is a great resource to hear from other practitioners, operators and builders!&lt;/p&gt;
&lt;p&gt;I love understanding how complex systems at scale fail - for these, I’ve found it extremely valuable to study the &lt;a href="https://aws.amazon.com/premiumsupport/technology/pes/"&gt;AWS Post-Event summaries&lt;/a&gt;! I also routinely look forward to shared content from &lt;a href="https://copyconstruct.medium.com/"&gt;Cindy Sridharan&lt;/a&gt;, &lt;a href="https://www.lizthegrey.com/"&gt;Liz Fong-Jones&lt;/a&gt; and &lt;a href="https://www.lastweekinaws.com/blog/"&gt;Corey Quinn&lt;/a&gt;. I found &lt;a href="https://dataintensive.net/"&gt;Designing Data Intensive Applications&lt;/a&gt; to be an excellent resource for the basics around dealing with Data.&lt;/p&gt;
&lt;p&gt;In terms of engineering management, I’ve learnt a lot from playbooks, frameworks and resources shared by &lt;a href="https://larahogan.me/"&gt;Lara Hogan&lt;/a&gt;, &lt;a href="https://kimmalonescott.com/"&gt;Kim Scott&lt;/a&gt;, and&lt;a href="https://brenebrown.com/hubs/dare-to-lead/"&gt; Brene Brown&lt;/a&gt;. My favorite book on leadership is, &lt;a href="https://www.amazon.com/Too-Had-Dream-Verghese-Kurien/dp/8174364072/ref=sr_1_1?dchild=1&amp;amp;keywords=I+too+had+a+dream+Kurien&amp;amp;qid=1596688010&amp;amp;sr=8-1"&gt;I too had a dream&lt;/a&gt; by Dr. Kurien.&lt;/p&gt;
&lt;p&gt;Engineering Leadership at the higher levels starts getting fuzzier in time and space, and has lesser structured content. Here, I do find myself leveraging a lot from my Macro &amp;amp; Micro Economics - and leaning on more holistic modeling of the world around us (&lt;a href="https://www.amazon.com/Thinking-Systems-Donella-H-Meadows/dp/1603580557"&gt;Thinking in Systems: A Primer&lt;/a&gt;)&lt;strong&gt;,&lt;/strong&gt; and our own interpretation and response of it (&lt;a href="https://www.amazon.com/Mindset-Psychology-Carol-S-Dweck/dp/0345472322/ref=sr_1_1?crid=MBC5P00HSH9D&amp;amp;dchild=1&amp;amp;keywords=mindset+carol+s.+dweck&amp;amp;qid=1596688204&amp;amp;sprefix=MInds%2Caps%2C223&amp;amp;sr=8-1"&gt;Mindset: The New Psychology of Success&lt;/a&gt;&lt;span style="text-decoration:underline;"&gt;, &lt;a href="https://www.amazon.com/Switch-Change-Things-When-Hard/dp/0385528752/ref=sr_1_1?dchild=1&amp;amp;keywords=switching+is+hard&amp;amp;qid=1596683143&amp;amp;sr=8-1"&gt;Switch: How to Change Things When Change Is Hard&lt;/a&gt;&lt;/span&gt;, &lt;a href="https://www.amazon.com/Unlocking-Leadership-Mindtraps-Thrive-Complexity/dp/1503609014/ref=sr_1_5?crid=3FUD3KLN5GHVS&amp;amp;keywords=Executive+Mind+Traps&amp;amp;qid=1646028463&amp;amp;sprefix=executive+mind+traps%2Caps%2C123&amp;amp;sr=8-5"&gt;Unlocking Leadership Mindtraps: How to Thrive in Complexity&lt;/a&gt;).&lt;/p&gt;</description></item><item><title>Contract Negotiation Checklist</title><link>https://infraeng.dev/contract-negotiation-checklist/</link><pubDate>Mon, 04 Apr 2022 07:00:00 -0700</pubDate><guid>https://infraeng.dev/contract-negotiation-checklist/</guid><description>&lt;p&gt;&lt;strong&gt;&lt;a href="https://docs.google.com/document/d/1Y6-8JowG3swI0ABdyGZpfxKOtOwa-JfWHtexarrhYvY/edit#"&gt;Fork this template on Google Docs&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Negotiating contracts is an important part of managing costs,
but it&amp;rsquo;s also something that you only do infrequently. Particularly in an earlier stage company,
you might only negotiate one large contract a year. It&amp;rsquo;s quite hard to get better at something
that you do so infrequently, but using a checklist is one way to be consistent in your approach,
and to ensure learnings from one negotiation carry over into the next.&lt;/p&gt;
&lt;div class="ba b--light-gray"&gt;
&lt;p&gt;&lt;img src="https://infraeng.dev/tools/contract-negotiation-checklist.png" alt="Chart of recruiter velocity check tool"&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;p class="tc"&gt;
&lt;em&gt;&lt;a href="https://docs.google.com/document/d/1Y6-8JowG3swI0ABdyGZpfxKOtOwa-JfWHtexarrhYvY/edit#"&gt;Contract Negotiation Checklist&lt;/a&gt;&lt;/em&gt;
&lt;/p&gt;
&lt;p&gt;There&amp;rsquo;s no perfect checklist: you should customize the checklist for your company
based on your process, preferences and the experience level of those who are involved.
If this feels too heavy, then by all means remove some steps.&lt;/p&gt;
&lt;div class="callout ba b--light-gray br4 bg-lightest-blue ph4 pv2"&gt;
&lt;p&gt;&lt;strong&gt;More Readings On Vendor Negotiations&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://getdivvy.com/blog/how-to-negotiate-with-vendors/"&gt;How to negotiate with vendors effectively&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ironcladapp.com/journal/contract-management/negotiating-contracts-with-vendors/"&gt;Negotiating Contracts With Vendors: What to Look For&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://lethain.com/renegotiate-first-vendor-contract/"&gt;Renegotiating Your First Vendor Contract&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://lethain.com/build-vs-buy/"&gt;Build vs Buy&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;h2 id="how-to-use"&gt;How to Use&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href="https://docs.google.com/document/d/1Y6-8JowG3swI0ABdyGZpfxKOtOwa-JfWHtexarrhYvY/edit#"&gt;Fork this template on Google Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Follow the template&amp;rsquo;s checklist&lt;/li&gt;
&lt;li&gt;Link your template into an internal repository of all negotiations so folks can find it the next time you&amp;rsquo;re negotiating this or related contracts&lt;/li&gt;
&lt;li&gt;Now you&amp;rsquo;re done!&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="tips"&gt;Tips&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Negotiating contracts is a learned skill, and you&amp;rsquo;re probably not going to be good at it for the first few.
That&amp;rsquo;s ok! Find someone else with more experience to partner with on your first few, even if it&amp;rsquo;s just asking
a more experienced friend at another company to brainstorm with you through the process&lt;/li&gt;
&lt;li&gt;Whenever possible, negotiate knowing how much other companies are paying the vendor you&amp;rsquo;re talking to.
This creates a clear price ceiling to negotiate towards&lt;/li&gt;
&lt;li&gt;If the vendor hasn&amp;rsquo;t hit their quota and is approaching their financial year or quarter,
you can almost always make significant progress on terms if you&amp;rsquo;re willing to move quickly&lt;/li&gt;
&lt;li&gt;Everyone should know their role in each negotiation. Sometimes your role is being the difficult, inflexible person!
Sometimes your role is being the person who thinks they&amp;rsquo;re the final decider but is infact mistaken when your
manager &amp;ldquo;take overs&amp;rdquo; later &amp;ldquo;much to your chagrin&amp;rdquo;&lt;/li&gt;
&lt;li&gt;Not all contracts are equal. Sometimes the absolute number of dollars isn&amp;rsquo;t high enough to follow a time consuming process.
On the other hand, sometimes the numbers are genuinely massive and are worth pulling in even your most senior leadership
to get the best possible deal&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Efficiency: Managing Infrastructure Costs</title><link>https://infraeng.dev/efficiency/</link><pubDate>Sun, 03 Apr 2022 07:00:00 -0700</pubDate><guid>https://infraeng.dev/efficiency/</guid><description>&lt;p&gt;In my early career roles, I worked at companies that never worried about their infrastructure costs at all.
They were simply too low a cost and growing too slowly for the Finance team to pay much attention to it.
This &amp;ldquo;ignore it until it&amp;rsquo;s too large to ignore&amp;rdquo; approach served me well.&lt;/p&gt;
&lt;p&gt;Until it didn&amp;rsquo;t.&lt;/p&gt;
&lt;p&gt;Working at Uber, I was caught me off guard when a new Director joined and overnight
infrastructure costs were recategorized from insignificant to requiring urgent, detailed review every month.
Adding the instrumentation and accountability for these costs retroactively was a difficult retrofit.
Although I was surprised that time, I&amp;rsquo;ve come to appreciate that all successful
companies go through the transition from ignoring to setting goals on infrastructure costs,
and an early focus during my time at Stripe was ensuring we were ready ahead of that shift.&lt;/p&gt;
&lt;p&gt;Your job as an infrastructure leader is diagnosing the right mode of operation for your company&amp;rsquo;s infrastructure costs today,
understanding when you&amp;rsquo;re likely to switch modes, and ensuring you&amp;rsquo;ve done the prework to make the
transition relatively painless.&lt;/p&gt;
&lt;p&gt;We&amp;rsquo;ll explore this topic by digging into:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;three distinct operating modes for infrastructure costs: early-stage, growth, and late-stage&lt;/li&gt;
&lt;li&gt;concrete tools and tactics such as managing infrastructure costs with cloud-specific reductions,
including costs in your &lt;a href="https://infraeng.dev/business-review-template/"&gt;Business Review Template&lt;/a&gt;,
and using a &lt;a href="https://infraeng.dev/contract-negotiation-checklist/"&gt;Contract Negotiation Checklist&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;whether you should spin up a dedicated team working in this area&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;When you finish reading this, you won&amp;rsquo;t have your entire efficiency plan worked out,
but you will have the high-level pieces, know where you need to dig in, and have a clear
approach to communciate to anyone who has been pushing you for a documented approach around infrastructure costs.&lt;/p&gt;
&lt;div class="callout ba b--light-gray br4 bg-lightest-blue ph4 pv2"&gt;
&lt;p&gt;&lt;strong&gt;Related Interviews&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://infraeng.dev/smruti-patel/"&gt;Smruti Patel: Head of Engineering for L.E.A.P. at Stripe&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;h2 id="should-you-prioritize-infrastructure-costs"&gt;Should you prioritize infrastructure costs?&lt;/h2&gt;
&lt;p&gt;Before diving into the mechanics of managing infrastructure costs,
the first question to answer is whether it&amp;rsquo;s a valuable use of organizational time to make your current infrastructure spend more efficient.
How you think about this will vary a bit depending on whether your company is early-stage, prioritizing growth, or focused on profitability
in late-stage.&lt;/p&gt;
&lt;h3 id="early-stage"&gt;Early-Stage&lt;/h3&gt;
&lt;p&gt;Generally speaking, very early-stage companies shouldn&amp;rsquo;t spend much time thinking about infrastructure costs.
You should instead be focused on finding product-market fit for your first product.&lt;/p&gt;
&lt;p&gt;Here are two checks you can run to determine if it&amp;rsquo;s worth reducing your infrastructure costs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;If you were to reduce your infrastructure costs to $0, and it still doesn&amp;rsquo;t increase your runway by at least two months,
then it&amp;rsquo;s not worth focusing on&lt;/li&gt;
&lt;li&gt;If you&amp;rsquo;re spending less than $2,000/month per employee on infrastructure costs, then it&amp;rsquo;s probably not a significant priority
because your headcount spend will be so much higher&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you&amp;rsquo;re not violating either of those checks, then keep on ignoring infrastructure spend.
If you are exceeding one, and infrastructure costs are a significant part of your overall burn, then invest a sprint into reducing spend,
and then resume ignoring it once these checks resume passing.&lt;/p&gt;
&lt;p&gt;The one notable exception is if you&amp;rsquo;re building a low-margin product or product where cost efficiency is a pillar
of your long-term strategy. For example, if you&amp;rsquo;re operating a metrics collection and dashboarding product like
Datadog, then efficiency probably &lt;em&gt;is&lt;/em&gt; worth considering earlier than usual.&lt;/p&gt;
&lt;h3 id="growth"&gt;Growth&lt;/h3&gt;
&lt;p&gt;When you’re prioritizing growth, the primary focus of the engineering organization in a technology company is creating, operating and advancing the products that support the business.
Managing costs is important, but even immaculate cost management won’t make your company a success if enough energy isn’t being invested in your product.&lt;/p&gt;
&lt;p&gt;The fundamental question to ask is whether
infrastructure&amp;rsquo;s share of &lt;a href="https://en.wikipedia.org/wiki/Cost_of_goods_sold"&gt;cost of goods sold (COGS)&lt;/a&gt; is increasing as a percentage of revenue?
(The simplest way to think COGS is all your non-headcount costs, although a slightly better definition would be all costs
to operate your software.)&lt;/p&gt;
&lt;p&gt;&lt;img src="https://infraeng.dev/efficiency/efficiency-growth-abs.png" alt="Chart showing revenue increasing faster than infrastructure costs over time."&gt;&lt;/p&gt;
&lt;p&gt;Start answering this question by plotting revenue and infrastructure costs on a chart to get a sense of how these two numbers are moving.
Although logarithmic scales often generate more confusion than they&amp;rsquo;re worth, in this case it&amp;rsquo;s usually
the only way to see both lines closely enough to understand their slopes within a single chart.
You particularly want to understand if either line has experienced an inflection over the past few quarters.
If costs have started accelerating without corresponding acceleration of revenue, that’s worth digging into.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://infraeng.dev/efficiency/efficiency-growth-ratio.png" alt="Chart showing infrastructure costs as a percentage of revenue decreasing over time."&gt;&lt;/p&gt;
&lt;p&gt;Once you’ve looked at the two lines independently to understand their movement, simplify your first chart into a chart showing infrastructure costs as a percentage of revenue.
This chart hides some detail but is easier to parse for folks further away from the details.
As long as the ratio is going down and your company is focused on growth, then this data should be sufficient to justify your current level of investment into efficiency:
if growth is key, and infrastructure costs are not getting in the way, why should you slow down growth to reduce them?&lt;/p&gt;
&lt;h3 id="late-stage"&gt;Late-Stage&lt;/h3&gt;
&lt;p&gt;Even the best business lines stop growing at some point. Facebook is one of the most valuable businesses
in the world, but even they at some point ran out of new users to attract to their platform.
Once growth slows, a business naturally starts focusing more on costs, including infrastructure spend.&lt;/p&gt;
&lt;p&gt;In those scenarios, the easiest approach is to work with the business to align on
two numbers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Dollars spent on infrastructure overhead per engineer&lt;/em&gt;: this includes things like development environments, testing tools, and so on.
Determine your starting point by bucketing vendors and non-production infrastructure costs into a chart and plotting them over time
divided by headcount. Pick a reasonable point on that line as your target. Refine it by reaching out to industry peers to get a sense
of how this number compares to theirs (be sure to pick industry peers in companies that are currently focused on profitability, otherwise
their answers won&amp;rsquo;t be very helpful to you)&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Infrastructure dollars spent per N product operations served&lt;/em&gt;: anchoring on cost of operating the product.
This will vary a bit depending on your product or business, but it might be &amp;ldquo;$1.00 in infrastructure costs to powering every 10,000 searches&amp;rdquo;,
&amp;ldquo;$2.50 for every 10,000 payments processed&amp;rdquo;, or &amp;ldquo;$3.00 for every 10,000 trips scheduled&amp;rdquo;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In both, the key thing is moving away from anchoring on a percentage of revenue and instead
setting a target against the fundamental operations that you support.
Thinking of costs as a percentage of revenue works well when you&amp;rsquo;re growing, but is too abstract
and hides too many details once you&amp;rsquo;re focused on reducing costs.&lt;/p&gt;
&lt;p&gt;If you find yourself exceeding those targets, then it&amp;rsquo;s time to dive into reducing them.&lt;/p&gt;
&lt;h2 id="tools-for-managing-infrastructure-costs"&gt;Tools for Managing Infrastructure Costs&lt;/h2&gt;
&lt;p&gt;What I&amp;rsquo;ll introduce here is the fairly common playbook for managing infrastructure costs.
As you work through these approaches, your goal is to do &lt;em&gt;as few of them as possible&lt;/em&gt; while
meeting your efficiency goals. I&amp;rsquo;ve prefixed a few particularly high return-on-investment
tools with a &amp;ldquo;⭐, if you&amp;rsquo;re debating where to start, consider starting with them.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;⭐ &lt;strong&gt;Use cloud vendor&amp;rsquo;s cost optimization tools&lt;/strong&gt;. Every cloud vendor has a program along the lines of &lt;a href="https://aws.amazon.com/savingsplans/"&gt;AWS Savings Plans&lt;/a&gt;
or &lt;a href="https://aws.amazon.com/ec2/pricing/reserved-instances/"&gt;AWS Reserved Instances&lt;/a&gt;. These plans allow you to trade
usage or spend commitments for reduced pricing. If you aren&amp;rsquo;t already using these, you can usually
reduce your infrastructure costs by 20-40% in a few weeks of work&lt;/li&gt;
&lt;li&gt;⭐ &lt;strong&gt;Standardize your vendor negotiation process&lt;/strong&gt;. Beyond a core cloud vendor, many companies have five or six additional large vendor contracts for things like
observability, security, or developer productivity. Introducing a structured process for negotiating
and renegotiating, like using a &lt;a href="https://infraeng.dev/contract-negotiation-checklist"&gt;Contract Negotiation Checklist&lt;/a&gt;,
will significantly improve your pricing (as well as visibility into costs)&lt;/li&gt;
&lt;li&gt;⭐ &lt;strong&gt;Run periodic deep dives on cost&lt;/strong&gt;. Until you have a dedicated team actively looking at your infrastructure costs, you can usually identify
significant cost reductions by periodically taking a week to dig into your biggest infrastructure costs and prioritizing low-hanging fruit.
These will usually be accidents, like storing unused data, development environments not getting retired, etc.
The key thing is scoping the opportunity to work that the infrastructure team can take on themselves&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Update your &lt;a href="https://infraeng.dev/tech-spec/"&gt;Tech Spec&lt;/a&gt; template&lt;/strong&gt; to include a section that estimates costs.
Many engineers will be unfamiliar with that process, so make sure the template links to examples of how a few representative
services estimated their costs. A great example will be onerously detailed, including links to the specific queries
and tools to estimate their costs. A template that requires cost estimation without guiding folks through that process will
inevitably trend towards make-work rather than a useful discussion&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Find the executive sponsor&lt;/strong&gt; who really cares about infrastructure costs and is willing to push inefficient users to spend less.
This is usually your CTO or your CFO. Without an executive sponsor willing to prioritize this efficiency work,
you&amp;rsquo;ll find progress further down this list difficult. If you can&amp;rsquo;t find a sponsor, that&amp;rsquo;s usually a good sign
that you&amp;rsquo;re already doing enough to prevent infrastructure costs from becoming a top priority&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Find product cost optimizations.&lt;/strong&gt; There will be significant opportunities to reduce costs by changing how your product works,
e.g. improving your data model, changing storage technologies, moving workloads between streaming and batch.
However, product changes have a much wider set of stakeholders, which makes these sorts of improvements harder
to prioritize. Generally, only try to pursue these if there is a massive opportunity&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pursue a cloud vendor contract discount.&lt;/strong&gt; Negotiating your cloud contract to include a discount is very doable after you reach a certain level of spend,
or are a sufficiently strategic partner, but before you reach that level of spend it&amp;rsquo;s quite difficult to
get a meaningful discount. Is it worth spending three weeks and making multi-year financial commitments to get a six percent discount
on your cloud spend? Maybe! It depends on your priorities and your confidence in future spending estimates, but it certainly isn&amp;rsquo;t worth it to everyone.
Conversely, at a certain spend level&amp;ndash;think, tens of millions of USD per year&amp;ndash;your discount can get much higher
without requiring any product-level changes&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Set coarse goals on infrastructure costs.&lt;/strong&gt; Partner with your company&amp;rsquo;s Finance team to coarsely attributing costs across teams, then set and monitor goals against those costs.
Fine-grained goals and cost attribution requires a deeper investment into tooling, but most companies can split costs across
their production environment, development environments, and data engineering. Once you done that split, you can set a goal and assign
that goal to appropriate teams (respectively, something along the lines of infrastructure, developer productivity, and data engineering).
This will provide some visibility and pressure on costs without requiring much attribution prework&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Review costs in Business Reviews.&lt;/strong&gt; Once you&amp;rsquo;ve set those coarse goals around infrastructure spend, ensure your company&amp;rsquo;s
&lt;a href="https://infraeng.dev/business-review-template/"&gt;Business Review Template&lt;/a&gt; includes a section on their costs.
If you run &lt;a href="https://infraeng.dev/business-review-meeting/"&gt;Business Review Meetings&lt;/a&gt;, then make sure someone is showing up
to ask questions about costs for teams whose costs are missing goal or otherwise accelerating&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Expand metadata to facilitate fine-grained goals on infrastructure costs.&lt;/strong&gt; Implement an approach to &lt;a href="https://infraeng.dev/ownership-metadata/"&gt;Ownership Metadata&lt;/a&gt; such that you can assign all usage and storage costs
against a specific team. Once you have that ownership metadata maintained, you can go further by generating proactive nudges
to teams on following best practices, prioritizing high costs, and helping them identify accelerating spend early&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If doing all of these sounds overwhelming, it should!
Few companies do all of these, and those that do either operate
in a business that is unusually margin sensitive or are spending many millions a year on
their infrastructure costs.&lt;/p&gt;
&lt;h2 id="should-you-have-a-dedicated-efficiency-team"&gt;Should You Have a Dedicated Efficiency Team?&lt;/h2&gt;
&lt;p&gt;Generally, the way I think through spinning out any given area into a dedicated team
is described in &lt;a href="https://infraeng.dev/trunk-and-branches/"&gt;Trunk and Branches Model&lt;/a&gt;, and that
applies for the efficiency as well.
That said, let me add a few caveats to that general approach as it applies here.&lt;/p&gt;
&lt;p&gt;Much like &lt;a href="https://staffeng.com/guides/manage-technical-quality"&gt;managing technical quality&lt;/a&gt;,
efficiency is an area where you can make significant progress with one-off initiatives.
Improving how you use AWS Reserved Instances or renegotiating your vendor contracts can
reduce your spent by 30-40% in a week or two. Product-level improvements to your architecture
can reduce your spend even more, although they&amp;rsquo;ll probably take a bit longer.&lt;/p&gt;
&lt;p&gt;Because you can make significant progress through one-off initiatives, the default is to wait
until late into a company&amp;rsquo;s growth to spin out a dedicated team, and in most cases that&amp;rsquo;s the
right decision.&lt;/p&gt;
&lt;p&gt;The three factors to consider as you think through whether postponing a dedicated team is the
best solution for you are:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Is infrastructure efficiency a fundamental strategic pillar for your business?&lt;/li&gt;
&lt;li&gt;Are your infrastructure costs, today as an absolute cost, 10x more expensive than a team working to reduce them?&lt;/li&gt;
&lt;li&gt;For the past year have you had pressure to reduce costs but an inability to prioritze the work because
other critical work continues to displace efficiency efforts?&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If you answer yes to any of those, then you may want to spin out a team earlier than the &lt;em&gt;Trunk and Branches Model&lt;/em&gt; suggests.
As you start sourcing candidates, it&amp;rsquo;ll become apparent that this is a bit of a custom role with folks who specifically enjoy
working on the problem. Recruiting one or two folks with siginficant preexisting experience will save you years!&lt;/p&gt;</description></item><item><title>Strategies</title><link>https://infraeng.dev/posts/strategies/</link><pubDate>Sun, 03 Apr 2022 07:00:00 -0700</pubDate><guid>https://infraeng.dev/posts/strategies/</guid><description/></item><item><title>Business Review Template</title><link>https://infraeng.dev/business-review-template/</link><pubDate>Wed, 30 Mar 2022 07:00:00 -0700</pubDate><guid>https://infraeng.dev/business-review-template/</guid><description>&lt;p&gt;&lt;strong&gt;&lt;a href="https://docs.google.com/document/d/12kqcGYQzkHpY884viKGsh3zeBioYYlMeFJQYx-vFibE/edit"&gt;Fork this template on Google Docs&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;As your company gets larger and more complex, it&amp;rsquo;s easy to become embroiled
in supporting incoming asks from other teams. That&amp;rsquo;s important work, but it&amp;rsquo;s
also important that your team is operating effectively and prioritizing &lt;em&gt;your&lt;/em&gt; goals
in addition to the goals of other teams making requests.&lt;/p&gt;
&lt;p&gt;If you&amp;rsquo;re getting mixed signals on whether your team is doing the right work,
the &lt;strong&gt;Business Review Template&lt;/strong&gt; can help cut through the confusion.
This written document facilitates an operational review of your team,
and even more importantly creates an opportunity for you, your team, and your stakeholders
to discuss if you&amp;rsquo;re focused on the right work.&lt;/p&gt;
&lt;div class="ba b--light-gray"&gt;
&lt;p&gt;&lt;img src="https://infraeng.dev/tools/business-review-template.png" alt="Chart of recruiter velocity check tool"&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;p class="tc"&gt;
&lt;em&gt;&lt;a href="https://docs.google.com/document/d/12kqcGYQzkHpY884viKGsh3zeBioYYlMeFJQYx-vFibE/edit"&gt;Example using the Business Review Template&lt;/a&gt;&lt;/em&gt;
&lt;/p&gt;
&lt;p&gt;Most companies wind up using a variation of this template by the time they
reach a thousand employees, with some starting much earlier. Even if there&amp;rsquo;s
no structure business review process, it&amp;rsquo;s helpful to start writing them
periodically for the area you&amp;rsquo;re responsible for: think of them as your
area&amp;rsquo;s performance review.&lt;/p&gt;
&lt;div class="callout ba b--light-gray br4 bg-lightest-blue ph4 pv2"&gt;
&lt;p&gt;&lt;strong&gt;Related Meetings&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://infraeng.dev/business-review-meeting/"&gt;Business Review Meeting&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Other Approaches to Business Reviews&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The Kool-Aid Factory&amp;rsquo;s &lt;a href="https://koolaidfactory.com/zines/shipping-great-work/"&gt;The Shipping Great Work Issue&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;h2 id="how-to-use"&gt;How to Use&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href="https://docs.google.com/document/d/12kqcGYQzkHpY884viKGsh3zeBioYYlMeFJQYx-vFibE/edit"&gt;Fork this template on Google Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Find examples of previous business reviews at your company, and if possible ask the authors what was and wasn&amp;rsquo;t
well received in their most recent review&lt;/li&gt;
&lt;li&gt;Fill in the template for your team&amp;rsquo;s area&lt;/li&gt;
&lt;li&gt;Iterate on your draft with feedback from your team and manager&lt;/li&gt;
&lt;li&gt;Identify the key groups you want feedback from, and create copies for each of those groups.
Transparency is important, but transparency too early often mutes the direct feedback that helps you succeed.
Give these groups a week or so to provide feedback, including running a &lt;a href="https://infraeng.dev/business-review-meeting/"&gt;Business Review Meeting&lt;/a&gt;
if that&amp;rsquo;s something your company finds valuable&lt;/li&gt;
&lt;li&gt;Widely publish a clean, readable copy into wherever business reviews are collected,
and let anyone who hasn&amp;rsquo;t gotten a change to see it so far know where to find it and how to share feedback on it&lt;/li&gt;
&lt;li&gt;Now you&amp;rsquo;re done! (At least until the next one.)&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="tips"&gt;Tips&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Writing an effective business review depends first and foremost on understanding the audience you&amp;rsquo;re writing for,
and what that audience cares about. If you&amp;rsquo;re not sure about the answer to either of those, ask!&lt;/li&gt;
&lt;li&gt;Many companies and many teams try to use their business review to solve too many different problems.
Your business review should focus on answering only two questions: how well is your area of the business operating?
What do you need to do for it to operate better?&lt;/li&gt;
&lt;li&gt;Good business reviews are focused on what the reviewers need from the review.
Bad business reviews are comprehensive, capturing everything that someone on the team wants reviwers to know&lt;/li&gt;
&lt;li&gt;Every metric you include in a business review should be a &lt;a href="https://lethain.com/goals-and-baselines/"&gt;well-formed metric&lt;/a&gt;
that includes the current value, the goal, and the trend over time&lt;/li&gt;
&lt;li&gt;Avoid delegating the writing of your business review to multiple different folks.
Short documents with disjoint authors are hard reads&lt;/li&gt;
&lt;li&gt;&lt;a href="https://networkcapital.substack.com/p/the-amazon-way-of-writing"&gt;The Amazon Way of Writing&lt;/a&gt; is a helpful
set of rules for writing these sorts of documents&lt;/li&gt;
&lt;/ul&gt;</description></item></channel></rss>