Tell us a little about your current role: where do you work, your title and generally the sort of work you and your team do.
I currently lead the Data Platform group at Stripe – we operate the centralized data lake, and the bigdata, async & stream processing infrastructure for Stripe’s mission-critical business, while ensuring security, reliability and efficiency. Essentially, supporting Stripe’s core money movement & storage, financial reporting & analytics products for our merchants, and empowering ML infra to build credit, fraud & risk intelligence.
Prior to my current role, I also led the LEAP organization, which stands for Latency, Efficiency, Access & Attribution and Performance - my vision here was to take those small steps needed to unlock the giant leaps for both our engineering organization internally, and our users using Stripe. To enable that, we developed cross-functional strategies and tools for optimizing our cloud spend, and lowering Stripe’s end-to-end latency through performance tuning.
What dashboards or metrics do you personally use to stay aware of your organization’s work? How often do you check these?
I am one of those personality types, who is facts-oriented, analytical, and leverages data to draw patterns and drive decisions. So yes, metrics are my jam!
I’ve been leading teams for over a decade now. In this period, I’ve learnt that engineers don’t lack motivation. They are here to do their best work. Intrinsic motivation talks about a healthy balance of autonomy, competence and purpose. Let’s assume you have solved the hiring problem. You’ve built an inclusive team of highly skilled engineers with the right domain expertise. We’ll also assume that your management and leadership practices lean toward a healthy culture. A culture which provides the right blend of growth mindset, radical candor and psychological safety for individuals to thrive. So competence and autonomy are more or less solved, but how do we as leaders, then address purpose, the northstar, the why?
That’s where it’s important to think about opportunity cost! We have finite resources: and doing X implies not doing Y. For any software-driven company, our engineering talent, their productivity, efficiency and impact is our highest leverage. Hitting the right product-market fit can be extremely time sensitive. The opportunity cost, therefore, of going down a potentially wrong path, can be significantly high.
And so, you need a high fidelity OODA loop to observe, orient, decide, act and react to feedback! And that’s where I leverage metrics heavily to measure and debug engineering velocity:
Precision - What are you shipping and why?
Speed - How frequently are you able to ship?
Quality - What is the failure rate or quality of your software?
Impact - How well does it achieve business goals?
For LEAP: our impact metrics were around measuring overall cloud spend as a function of business volume. Or the tail latency - the p99.9– of the most important ChargePath API.
For Data Platform: some aspects are easier to measure than others. So here, we have 3 categories - starting from the outer loop of Stripe users, to the inner loop of our direct engineering cohorts and the bridge between the 2 - our executive leadership:
Non-functional requirements to measure strong guarantees of security, reliability, and performance of our systems.
Functional requirements to democratize access to data to enable rich insights for various cohorts that work with data - data scientists, data engineers, ML engineers or Product engineers. This is generally the hardest to measure!
The efficacy of operating Stripe’s business through data efficiency, compliance, and rigorous financial accounting.
I personally look at most of these system & business metrics weekly, to determine overall health of our systems and the broader investment within the organization.
In addition to these, I also look at team health metrics (monthly & quarterly) - like employee engagement, hiring ratios, attrition or transfers, #uplevel readiness.
Several of the areas you’ve worked on, especially efficiency (e.g. infrastructure spend) and performance (e.g. CPU utilization and user-facing latency) are areas of distributed accountability. A system’s efficiency is heavily dependent on the individual parts within the system. How do you set goals for areas of distributed accountability? What have you found effective for reducing the challenges of diffused accountability?
I love this question, and especially your reference to Thinking in Systems, a book which blew my mind a decade ago. Here’s how I’ve come to approach these problems.
Frame the problem. The why
For both Efficiency & user-facing latency, the first thing I did was own the whole problem, from the farm to the table- the reason this was a key, fundamental step was it helped provide a unified direction and sense of purpose for the narrative. I established myself as the accountable and authoritative subject matter expert in framing the problem for the company, through trust and verifiable, clean data.
I had learned from past experience that accountability without authority was a kitchen sink at best and a dull knife at its worst. So I secured executive sponsorship to back this key impactful initiative for Stripe, aligning on the outcomes through charter metrics (eg: overall cloud spend as a function of the business and p99.9 latency of the most popular ChargePath API), and setting expectations on the relative agency of a centralized team in driving those outcomes.
While this was necessary, it was far from sufficient. And that brings us to identifying the key elements of this system - the movers & the shakers.
Identify the elements. Thewhat / who
In order to determine whom to hold accountable, we had to invest a few quarters in doing the gardening - creating a M.E.C.E. attribution our total cloud spend down to the last dollar to _a team. _This required navigating the notion of organizational hierarchies, supporting reorg workflows, re-attributing and backfilling to support error handling. This can feel toilsome, and be the valley of slow death – but here’s where I’d recommend persisting through, cause it will pay dividends when done right, and well.
Once we had attributed every dollar or every time slice, we then used Pareto’s 80-20 rule to focus on the top 5-10 product or platform teams, which provided the highest leverage.
Identify the interconnectedness and the flows. The how
Drucker said, culture eats strategy for breakfast. And a key aspect to changing behavior, especially when accountability is diffused, is to motivate culture. And culture is nothing but the behaviors the system incentivizes or disincentivizes.
We saw that the Hadoop platform team allocated its resources to teams through statically assigned queues, which led to local fragmentation and overall dropped system utilization when those jobs weren’t running. We needed the platform team to implement elasticity and the job runners to release resources – but both needed to be made aware, and then incentivized to prioritize this work.
Easy to self-serve costs through attribution, rich cost observability tooling and automated customized Nudges providing insights and recommendations on ways to meet their goals
Attractive to incentivize and reward Efficiency efforts by tracking wins, providing badges or company-equivalent means of public recognition
Social by driving ownership and accountability through cohort analysis, leaderboards, public Ops Reviews, and
Timely by introducing LEAP in Eng101 onboarding classes.
Lastly, there are systems where the carrots work better, or the stick. Depending on the urgency of the problem, some levers to drive the latter are setting explicit budgets (eg: cloud spend or headcount, spend budgets for org size of 25+), ensuring that teams have the right company level prioritization for related work, enforcing capacity governance processes or ring-fencing engineering bandwidth to drive centralized optimization.
Are there any processes or forums (like a “quarterly business review” or whatnot) that you’ve found valuable for inspecting execution within your team or across the many teams that share some responsibility for performance and efficiency?
In addition to diffused accountability, the other biggest challenge with inspecting execution for areas of performance and efficiency is, realized impact.
Let’s say, a data team decides to build a resource request portal, to automate away the static allocation and under-utilization of its compute resources. They ship this feature and move on to solve other problems. However, a few months in, they don’t see any change in the overall spend on the Hadoop infrastructure.
These are especially common in the performance and efficiency space, as the evaluation of the problem is based on several hypotheses, and there’s underlying complexity in the causal chain of dependencies. In the above case, we see that resources are under-utilized and waste is high. We concluded that most waste occurs from queue fragmentation in statically assigned compute resources, dynamic allocation will thus reduce fragmentation, hence cost savings! If we’d looked deeper at the data, we might have identified that the issue wasn’t so much in the fragmentation, but in the release of unused resources – a similar but different problem, begging for a different engineering solution.
Given this complexity in diagnosis, I’ve found it extremely useful to establish a contract with relevant teams (or my own) – anchor around invariants that need to be true at the end of a certain timeline, or around quantifiably, verifiable metrics. Eg: No product engineering team will miss their p99.9 latency service level agreement for over 48 hours, and beyond that, open an incident to follow due protocol. Or, team X will spend no more than 2% over their monthly allocated spend budget; any variances beyond this will need explicit approval from executive leaders.
Whether the teams decide to solve problem X or Y, or engineer a solution Foo or Bar, then, is secondary. We shake on the outcomes and invariants- and this forsters both agency & autonomy for the teams to drive results, and also creates owned accountability.
Speaking of accountability, I am a firm believer of ‘Trust _and verify’. _It is crucial then, to create the right ~real-time alerting and feedback loops, to catch early drifts - and I’ve found the weekly Ops reviews to be at the right cadence for these. This is where we want to leverage the exec sponsor for this program, who’ll recognize the right behaviors we want to see amplified, or facilitate deep dives into the incorrect outcomes to dampen their spread.
Lastly, QBRs are a great way to formally view trends in resource management, and related impact. This is also a great time to strategize and prioritize future investment, in line with the organization’s broader goals.
On that same theme, one particular challenge I’ve encountered is the perception that infrastructure efficiency is less important than developer productivity. To the extent that is true, some would argue that it’s illogical to prioritize things like performance and efficiency. How have you dealt with this tension between efficiency (or performance) and developer productivity?
For me, the joy of engineering lies in the solving of constraints, similar to those linear programming problems in Math. Given a system, and some non-functional requirements ( eg: availability, reliability, security), how do we seek equilibrium in the system? How do we make the right tradeoffs to sustain that?
At the macro level, it goes back to the opportunity cost for the business. When does it make sense for the business to invest in efficiency or performance? When a company is in its growth stage, its engineering talent is the highest asset and finding the right product-market fit is its highest priority. At that time, and at that scale, developer productivity is higher leverage than efficiency.
But as the business matures, and its organization and the engineering systems evolve, the balance shifts. 4YPs and discounted cash flows also start expecting to yield economies of scale- especially given the compounded nature of money. The CFO is likely to assess marginal revenue per net new employee, or overall margin for the business. And for most SaaS companies running infrastructure on the Cloud, their OPEX is the second largest spend.
At Stripe, I intimately witnessed our burgeoning cloud costs, and thanks to your foresight in investing early, we were largely successful in bending the curve along multiple dimensions of our overall spend. In order to justify and equip engineering teams with the agency to drive their investment, we laid down a generally applicable decision-making framework to translate engineering time to cost savings. For example: Invest 1dev-week of effort for $10K/month savings. For our own centralized Efficiency team, we placed a high premium on opportunities worth pursuing: eg: 5X cost savings per IC. These help address some of the tension between investment in dev prod efforts vs those catering to Efficiency.
However, at the micro level, depending on the problem you’re solving, you could either improve both systems efficiency and developer productivity, or face situations where “going faster” necessitates spending more. Eg: Take CI costs: if we were to improve and finetune our selection set of which tests to run, we’d reduce the dev time spent on running tests and reduce CPU hours, thereby being more efficient. But take build times- let’s say throwing 15% more instances to generate builds, reduces average build time from 25mins to 15mins. Is it worth it? Yes. But at what point is it not- how about when going from 15mins down to 12mins?
In Staff Engineer’s Manage Technical Quality, I argued that folks should focus on pursuing quality through improving hot spots, best practices, and so on. The least recommended solution was running an organization program that requires coordination across the entire engineering organization. This is a point of view that I developed in part during our time working together based on how hard it is to coordinate moving an organization. Do you think I came to the wrong conclusion in recommending folks avoid running organizational programs as much as possible? Any advice to make running organizational programs effective?
I especially love that article on Managing Technical Quality and I wholeheartedly agree on your assessment!
I started my career as a Quality Engineer around 2 decades ago, testing key features like distributed resource scheduling and linked clones for VMware’s control plane management solution to manage VMs. It was prior to the DevOps movement, and most enterprise companies ran these through centralized teams. There were several downsides to that model, stemming primarily from misaligned incentives, which arose due to lack of end-to-end ownership in shipping a high quality product to users. The developers were responsible for checking in code, and the QE for identifying defects and performance bottlenecks. This adversarial engagement created tension, as opposed to a joint commitment to delivering value. There was also a downward spiral of brain drain, due to the system perpetuating implicit second-class citizenship, in its hiring, compensation and talent management frameworks.
Fast forward to recent times, the core tenets of DevSecOps place a high value on end-to-end ownership of engineering – from code deployment to managing maintenance and operations. Systems which embrace this model heavily benefit from your recommendation in the article - which is to focus on the hot spots, drive practices, find leverage points and so on.
As cliche as it is, It all comes down to people! People are at the heart of every engineering problem, and its solution - be it more engineering, practice, process or program. I am of the firm opinion that people want to do the right thing, but they are optimizing for the constraints they are given. The most expedient way to then drive change is to provide awareness of the problem, align incentives, and give them the time and space to prioritize the fixes. For example, if a business leader is pushing their org to release product features at a breakneck pace, it will lead to technical debt or low code quality.
Also, running a program has extremely high overhead– sustainable metrics, weekly executive sponsorship and commitment, ongoing program evaluation. A program, its related scoring or goals evaluation, and associated leaderboards, also create a sense of foreboding - it is akin to being called into the Principal’s office– and shift the balance from the program owners being medics and dependable consults to cops who must be dealt with.
But there are times when a technical program is indeed the right solution – factors here range from the scale of the engineering organization (eg: tracking cloud spend for a group of 1000+ vs 200), to bootstrapping baseline shifts in your overall posture (eg: driving least privilege access to all data) or requiring immediate change to uplevel the entire organization simultaneously (eg: compliance needs like GDPR, India data locality).
I’ve had fair success leading such programs, focusing on:
Early (and often) alignment with key stakeholders on defining the goals and soliciting their buy-in.
Fostering trust and autonomy: trust in the data you leverage to guide ongoing decisions, trust in your intention to meet the mutually beneficial goal, and trust in being an equal, supportive partner throughout the journey. Trust _and verify. _
Effective communication and tight collaboration: create feedback loops to ensure information flows at the right cadence, at the right zoom factor for the right audience.
Giving credit liberally; publicly recognizing the good citizens, or the early adopters.
What are some of the most impactful projects or tools that your teams have rolled out to improve performance or efficiency that were impactful without requiring mass-coordination across many teams?
Efficiency, Performance and to that extent, even Reliability and Security are horizontal programs. For each of these, I’ve found it valuable to establish the right balance of tooling, education & practices to drive organizational behavior and simultaneously land direct improvements by solving real engineering problems. Anchoring on either end of the spectrum disproportionately impacts the end outcome. For example, if you index heavily on laying down patterns and practices for the org to adopt, but don’t build critical infrastructure or land impact by fixing existing systems, it erodes trust and credibility. If you are making point fixes, and landing impact a system at a time, you’re likely not evolving fast enough in a rapidly scaling company.
Keeping that balance in mind, and similar to macro-economic cyclicality, I developed our Efficiency strategy around 3 dimensions:
Pay Less (optimize procurement) ,
Use Less (optimize utilization) and
Need less (optimize performance).
Early on, optimizing procurement was the single biggest lever in reducing our cloud spend. Automating Reserved Instances & Saving Plans purchasing, implementing storage tiering for hot/warm/cold data accesses and centrally leading vendor discount negotiation (in collaboration with F&S) significantly dropped the spend/business volume bps.
We then focused on the second bucket - use less- improving utilization. This involved auditing unused/unclaimed stuff, automating brownouts to those unaccessed resources and then releasing the resources to prevent future spend.
Similarly, on the latency side, we rolled out an incident-free Ruby GC optimization without needing to coordinate with Product teams. This change dropped the tail from 4.6 seconds down to 2.9 seconds.
I’ve often considered Efficiency to be an “obvious spot” to partner with a Technical Program Manager (TPM), because it’s such a cross-organizational effort and there’s no finish line: the work just keeps going further. Do you agree? How would you approach involving TPMs in areas like efficiency and performance?
There are 3 key pillars to navigate when running an effective Efficiency & Performance program - the engineering, the organization and the Finance & Strategy.
Engineering comprises the centralized team which drives the execution of the strategy, and related projects serving the end outcomes.
Organization involves the product and other infrastructure teams within engineering, their organizational leaders and the executives leading the business.
Finance & Strategy leads the overall capital allocation at the macro business level, often reporting into the CFO.
A solid TPM can serve as _the _glue and the singular force operationalizing the strategy and seamlessly bridging all 3 pillars:
Identifying technical inefficiencies in product & infra engineering and creating the Efficiency portfolio of opportunities. This could involve: a. Tracking big scale up, scale down and swings in cloud spend, and enforcing capacity governance processes. b. Tracking platform rate cards (eg: avg cost per vcpu for a Hadoop job) and quantity of resources consumed (eg: #vcore-hours for team X). c. Creating effective feedback loops to bridge the utilization with consumption and budgets.
Enablement & education to motivate change bottom up and shift left the culture of efficiency & performance - Facilitate prioritization conversations across various stakeholders and leadership to unlock resourcing for highest leverage work items. Partner with the centralized Efficiency team, Education and other Infrastructure teams to develop best patterns and practices for developing systems and services efficiently.
Operationalize budget tracking and drive high forecast fidelity by organizing monthly spend budget reviews for: a. Identifying the right set of teams and tracking org-wise budgets vs actuals. b. Evaluating engineering plans for new investments. c. Accounting for budget variances due to delayed execution (eg: Team X budgeted $Y for the month of March for a new feature launch, but came in lower cause they encountered issues), or overspend (and identifying critical remediations). d. Enabling identification of potential cost saving opportunities.
Lastly, a TPM is a core partner to the Engineering Manager and F&S, in identifying and unifying KPIs to tune the OODA loop (observe, orient, decide & act), to make macro or micro refinements to the overall strategy.
There’s a tendency for infrastructure engineering to be invisible when nothing is going wrong. How do you articulate the value of your organization’s work?
I am very glad you brought this up! With infrastructure, when something’s going wrong, there’s nowhere to hide. But the key challenge is when nothing is going wrong, how do you know it’s actually going right? So, when I think of infrastructure problems, I think of ‘great power comes with great responsibility’. And here’s why.
The beauty of Infrastructure is in its whittling down of essential complexity, through simplified abstractions which bring joy to its users. And through that, key leverage for the business.
Most mid-sized companies looking to scale, start investing in infrastructure engineering teams; with typical hiring ratios being 7-8 Product engineers to every infra IC. This makes it imperative that every infra-eng-week of effort be dedicated to high leverage work. Infra problems also take more rigor to solve, and get right, to avoid thrashing the rest of the engineering organization. Imagine building out a cloud compute abstraction, which changed every quarter, and fanned itself out to 20+ product engineering teams doing daily deploys. It’d be a nightmare!
This combination of complexity, rigor and the expectation of high ROI, makes Infrastructure engineering a very high stakes endeavor :)! Teams which romanticize or idealize the tech, over its customers or business value tend to languish – either due to missing the mark on realized impact, ceased investment due to lost credibility or due to internal employee burnout. And so articulating value, within and without, at every stage of software development is extremely crucial to leading a high performing, value delivering Infra team.
The 3 tenets I’ve found useful are 1. Know your customer 2. Bring in a product mindset, whether it involves doing initial market study (eg: evaluating build vs buy options), customer analysis & segmentation (eg: focus on data scientists over business analysts ), or even developing a go-to-market strategy (eg: white-glove migration workshops to facilitate Data Locality needs). 3. Measure what matters, not what’s easy to measure (and do the early work to identify what this is!)
At the planning stage, drive precision through alignment and prioritization — are you focusing on the right problem? And for whom? Here, you need to be grounded in the why, before the what. Do the 5 whys exercise, especially if embarking on multi-half infrastructure investments (eg: migrating a monolith to a SOA). And to seek alignment and early feedback, I’ve found the PRFAQs practice from Amazon, quite useful to build trust and credibility with your stakeholders and executive leadership.
At execution, drive focus, speed & quality, leveraging the SPACE metrics whenever applicable. Be extremely paranoid about scoping the problem just so, go deep before you go broad, and aim for vertical slivers of delivering impact vs all-or-nothing. I recently led the data security strategy for Stripe, and the biggest win we had was in the underlying approach. We identified a data access metric, and pivoted from securing one data system at a time, to incrementally driving value and moving the needle. Depending on the culture of your organization, communicate early, and often, through shipped emails, company all-hands or demos. This is a great avenue to seek feedback with your beta users, confirming the validity of your approach.
Finally, ensure that you’re maximizing overall impact: Are folks using what you delivered? Are you actually seeing movement toward your northstar metric? This is when we hone the “what” and validate the “why”. Quite recently, we shipped some work expecting to see change, and moved onto solving other problems. Looking back retrospectively, we realized that we had needed to build adjacencies to the shipped work to actually capitalize on all the effort. Sometimes, you need to evaluate what an additional 5–10% looks like to realize the most impact; this could be a marketing strategy, a small UX improvement, a small optimization (e.g. making load times much faster), or in many cases, better documentation. _Take that time. Bring it home. _
On March 15, 2013, 1,200 Japanese workers converted the Shibuya Station train Line from above ground to underground, in just 3 hours, before the first morning train the next day! I have worked on Infrastructure for nearly 2 decades now – when I think of Infrastructure, I think of this. It is this behind-the-scenes symphony of dedicated, resilient and talented people, working together to keep the masses moving, with zero friction or downtime- THIS gives me joy. And pride.
Asking the same question again but from a different perspective: how does working on something like efficiency or performance impact someone’s career, particularly in terms of getting promoted?
There’s the finite game of uplevels and promotions, and the infinite one of constant learning, development and evolution.
For the former, when we think of individual career impact, there are 3 systems at play – individual career aspirations, the engineering ladder and expectations for different levels/roles, and the business need/team opportunity.
Efficiency/Performance-shaped problems tend to be both broad and deep (eg: spark tuning for bigdata computation or improving your Kafka publish tail latency). To navigate such problems, some key traits and competencies are: highly motivated, proactive problem-solvers who can move with urgency and focus, while balancing critical thinking, comfort in dealing with data-driven diagnosis, hypotheses and analyses, and ability to work cross-organizationally, collaborating with different teams, systems and organizational dynamics. Let’s assume that individuals working on efficiency or performance-shaped problems are inherently motivated and excited about solving such problems.
That brings us to ensuring that there is indeed a strong business need to be solving these problems. A company, in its initial phases, might not want to invest in efficiency and performance, and rightly so as we discussed earlier. If we’ve secured the need and the buy-in, it comes down to demonstrated value - results, results, results! And once you’re secured results, identify the narrative for the impact driven - what’s the before/after story? What got better? What gets worse, if left unsolved? Understand and evaluate your organization’s leveling rubric, to assess if the complexity or realized impact, are in line with the system’s expectation from someone at that level and role. Eg: At Stripe, we’ve intentionally introduced the “Fixer” archetype for Staff engineers, to create room for, and acknowledge the value of associated impact to the business.
Some things to keep in mind:
Going back to your earlier bit about diffused accountability, we also need to ensure that the individuals working on these problems are well equipped to navigate this aspect of the system (especially, because ICs tend to avoid situations involving potential conflict).
If in the discovery phase of evaluating which problems to solve, identify a rubric for effort and impact (eg: 1 eng-quarter for $X million in annualized savings), and stack rank those opportunities to avoid missing the forest for the trees.
Balance driving value with incoming interrupts, when driving change through the rest of the org – ICs want to code, and solve problems, so leverage partners like the EM and TPM to help ICs get focus time.
Lastly, speaking of the long game, and my own experience : I’ve developed some key strengths through my journey from quality engineering to leading efficiency/performance programs - ability to seamlessly operate & diagnose varied distributed systems, strong business communication skills, and effectively drive influence without authority across cross-functional organizations.
What are some resources (books, blogs, people, etc) you’ve learned from? Who are your role models in the field?
You and yours - infraeng is a great resource to hear from other practitioners, operators and builders!