You’re Measuring Software Engineering Productivity Wrong

    Matt Watson
    By Matt Watson · CEO of Full Scale, 4x Founder, Author of Product Driven
    11 min read

    Almost every popular way of measuring software engineering productivity is BS. Story points, lines of code, velocity, commits per day, time to merge: they all measure how busy your team looks, not whether any of the work actually mattered to a customer.

    And here’s the thing every engineering leader learns the hard way: whatever you measure, you get more of. So ask what your customers actually want. Do they want more story points, or more lines of code? Of course not. They want the product to solve their problem. Start measuring something that’s actually valuable to them.

    I know all this because I ran engineering for years on exactly those vanity numbers. We hit our targets every sprint, we shipped constantly, and the dashboards stayed green. I told myself that meant the team was productive. Meanwhile the product wasn’t getting meaningfully better for the people paying us, and I couldn’t see it, because I was watching the wrong scoreboard.

    It took me a long time, and eventually a team scattered across the planet, to figure out what to watch instead. You can hit every metric on the board and still ship nothing your customers needed.

    This is a post about why the usual numbers lie, and the small set that don’t.

    Most software engineering productivity metrics measure activity, not value

    I’ve tried just about every developer productivity metric there is. Lines of code, commits, pull request counts, story points, velocity, defect rate, time to merge. Some of them are interesting. Most of them are easy to game, and the ones that aren’t easy to game still don’t apply to every developer or every kind of work.

    Here’s the rundown of the usual suspects and how each one falls apart:

    • Lines of code. Measures typing, not thinking. A strong engineer often solves a problem by writing less code, or deleting some.
    • Commits and pull requests. Counts how often someone saves their work. Easy to inflate by chopping one change into ten.
    • Story points and velocity. These are the team’s own guesses about its own work. Reward higher velocity and teams quietly inflate their estimates. You’ll see the number go up while nothing ships faster.
    • Hours and seat time. Measures attendance. The person typing all day might be the one creating the mess everyone else cleans up later.

    The real danger is in what these numbers quietly train your team to do. Measure lines of code and you’ll get more lines of code. You just won’t get better software, you’ll get a bigger pile of technical debt to maintain.

    Story points were never meant to be a report card

    Story points are a planning tool. They help a team talk about how big a piece of work feels before they start it. That’s useful. The trouble starts when a number invented to help a team plan gets promoted into a number used to judge the team. The estimate stops being honest the moment someone’s performance depends on it.

    This is also why “10x developer” arguments tend to fall apart. On her episode of the Product Driven podcast, Laura Tacho, CTO of DX, pointed out that the famous “10x” gap actually traces back to a study showing roughly 11x variance between whole organizations, not between individuals. Inside a single team, the spread is much smaller. Most of the difference comes from the environment people work in, not some superhuman coder. If you’re grading individuals, you’re measuring the wrong unit.

    The whole industry already learned this the hard way

    If this sounds like a fringe opinion, it isn’t. The serious people in this field figured it out, and there was a very public fight about it.

    In 2023, McKinsey published a framework for measuring developer productivity. The response from engineering leaders was brutal. Gergely Orosz and Kent Beck wrote a detailed takedown arguing the approach measured activity and risked pushing teams toward exactly the behaviors that make software worse. The argument wasn’t “you can never measure anything.” It was that measuring the wrong layer, and then handing managers a tidy individual score, does real damage.

    That’s the part worth holding onto. Good measurement frameworks exist. DORA, which came out of the Accelerate research by Nicole Forsgren and her team, and the SPACE framework are both genuinely useful. They go wrong when a leader takes a system-level signal and uses it as a stick to rank one engineer against another. The framework didn’t fail. The way it got used did.

    Running an offshore team forced me to measure the right way

    Here’s where I got my own education, and it wasn’t by choice.

    Over the years I’ve hired developers in Russia, Uruguay, Colombia, and the Philippines. When your team is on the other side of the planet, asleep while you’re awake, a lot of the comfortable old habits stop working. You can’t walk the floor and see who looks busy. You can’t reward the person who stays late, because you’re not there to see it. The packed standup where everyone sounds productive doesn’t happen on your clock.

    So what’s left? The only thing you can actually see is what shipped and whether it worked.

    When you can’t watch people work, you measure what they ship instead, and that turns out to be the right way to manage everyone. Offshore didn’t give me a worse view of productivity. It stripped away the theater and left me with the real thing. The rest of the industry, with its DORA dashboards and its arguments about McKinsey, is slowly arriving at what working across time zones forced me to learn early.

    It’s the honest answer to the question I get most at Full Scale from leaders weighing offshore software development or staff augmentation: “How will I know they’re being productive if I can’t see them?” You’ll know the same way you should already know with your in-house team. By what they deliver, not by how busy they look.

    Done isn’t when the code ships. It’s when the customer sees value

    That line is the operating definition I wish someone had handed me twenty years ago. For a long time I didn’t run things that way. At Stackify I had a solid roadmap and we released constantly, and the product still didn’t land the way I expected. The reason was simple: we weren’t stopping to ask whether each thing we built mattered to the customer. We measured the shipping. We forgot to measure the point of shipping.

    I wrote about this at length in my book, Product Driven. One of the quietest ways a team loses the plot is through what it chooses to measure. The more you measure output, the more your team optimizes for motion, and motion feels great right up until you notice nothing is improving for the people who pay you.

    What to measure instead: a north star, not a scoreboard

    So if lines of code and velocity are out, what goes in? I think about it as three layers, worst to best.

    Building a development team?

    See how Full Scale can help you hire senior engineers in days, not months.

    What you’re tempted to measureWhat it actually tells youMeasure this instead
    Lines of code, commitsHow much was typedDid working software ship
    Story points, velocityHow the team estimates itselfWhether delivery is getting faster and safer
    Hours, seat timeWho was presentWhether the customer’s number moved
    Tickets closedHow much got processedWhether the right problems got solved

    The bottom two layers, activity and delivery health, are means to an end. The top layer, the customer outcome, is the end. Most teams obsess over the first and never define the third.

    Pick a north star metric that’s the customer’s success, not yours

    A north star metric is the one number that captures whether you’re creating real value for your customer, and it pulls product, engineering, support, and leadership into the same direction. The best companies already run on them. Spotify built its whole business around time spent listening rather than raw signups. Airbnb cares how many nights actually get booked. And the number LinkedIn chases is whether people engage with content, not how many accounts exist.

    The one I lived was VinSolutions. We sold software to car dealerships, and we didn’t grade our engineering team on features shipped or story points burned. We measured two things: how fast a dealership responded to its sales leads, and how many cars it sold. Every feature we built was judged against whether it moved those numbers. That was the whole job.

    The most productive engineering team isn’t the one that ships the most. It’s the one whose work moves the number your customer actually cares about.

    There’s a simple test from the book for finding yours. Ask the team: what does success look like for the user, and can we measure it? If you can’t answer that, no dashboard is going to save you.

    DORA metrics are a floor, not the finish line

    I don’t want to throw out delivery metrics. DORA’s four keys, deployment frequency, lead time for changes, change failure rate, and time to restore service, are a real upgrade over counting commits. They measure the health of your delivery system, and a team that deploys often and recovers fast is usually a team in good shape.

    But be honest about what they are. Deployment frequency is still output. It tells you the machine is running smoothly, not that the machine is building the right thing. Use DORA as guardrails and as a way to spot when something’s going sideways. Just don’t mistake a fast, reliable pipeline for a team that’s actually moving your customer’s number. And never turn the four keys into individual scorecards, because that’s the exact move that turns a good framework toxic.

    A quality KPI I actually use: the deployment pullback ratio

    Here’s a homegrown metric I like, because it ties speed back to whether the work held up. I call it the deployment pullback ratio: how much work you shipped in a release versus how much time you then spent fixing what that release broke.

    The story that stuck this in my head came from an engineering leader at AMC Theatres, which is based in my hometown of Kansas City. He told me, “Every Monday we do a production release. On Tuesday we do two hotfixes, on Wednesday one, and if we’re lucky, none on Thursday.” That’s a normal week for a lot of teams. And it reframes the whole productivity question, because a release that needs three days of hotfixes wasn’t productive no matter what your velocity chart said. Tracking the pullback, alongside the defect escape rate, tells you something a line-of-code count never will: whether the work you shipped was actually done.

    How to roll this out without breaking trust

    Switching to outcome metrics is the right move, but it comes with real tradeoffs, and pretending otherwise will burn your team.

    Outcome metrics are harder to game, which is the point. They’re also slower to read and harder to attribute to one person’s work, which is the cost. A customer number can move for reasons that have nothing to do with engineering, so you can’t run a performance review off it the way managers wish they could.

    The bigger risk is using any of this as a weapon. On his Product Driven episode, Gleb Braverman, who founded HackerPulse after talking to hundreds of engineering leaders, made the point that most problems on engineering teams are people problems, not technical ones, and that metrics like DORA should inform how you lead, not dictate it. That matches what Laura Tacho’s research at DX shows too: developers lose around a fifth of their week to organizational friction, the dumb stuff the rest of the company creates. Fix the system before you grade the person.

    A short version of how to do this well:

    • Measure delivery health and outcomes at the team and system level, never as an individual leaderboard.
    • Define one north star tied to a customer result, and make sure everyone can name it.
    • Use DORA and pullback as guardrails to catch problems early, not as a report card.
    • Coach individuals through conversations, the way you always should, not through a metric.
    • Spend your energy removing friction, because that’s where most of the lost productivity actually hides.

    If you want to go deeper on the day-to-day side of this, we’ve written separately about how to increase developer productivity and the tools that help you track delivery metrics without adding overhead.

    Frequently asked questions

    What is the best way to measure software engineering productivity?

    Measure outcomes and delivery health at the team level rather than individual activity. The strongest signal is whether working software shipped, held up in production, and moved a customer or business number. Frameworks like DORA and SPACE give you good system-level signals, but pair them with the actual result the work was meant to produce, because that customer outcome is the only thing that proves the work mattered.

    Why are lines of code a bad productivity metric?

    Lines of code measure typing, not value. A great engineer often solves a problem by writing less code or deleting some, so rewarding volume punishes exactly the right instinct. On top of that, any metric you reward gets gamed, so measuring lines of code mostly gets you more lines of code and the extra technical debt that comes with them.

    Are DORA metrics enough to measure developer productivity?

    DORA’s four keys, deployment frequency, lead time, change failure rate, and time to restore, are excellent measures of delivery health, but they describe your system, not a person, and they don’t tell you whether the right thing got built. Use them as guardrails alongside a customer-outcome metric, and never turn them into individual scorecards, because that pushes teams to optimize the dashboard instead of the product.

    How do you measure the productivity of a remote or offshore engineering team?

    You measure it the same way you should measure any team, except distance forces your hand. Since you can’t watch people work, you judge what they ship and whether it worked. Define “done” as the customer seeing value, track delivery health and defects, and look at whether the outcome moved, rather than counting hours or watching who looks busy.

    The point was never to keep your engineers busy

    Busy is easy to fake, and most of our favorite metrics are very good at faking it. The point was never to keep your engineers busy. It was to ship the things your customers needed, and that’s the only productivity worth measuring.

    If you’re trying to build an engineering team that’s measured on outcomes instead of activity, whether that’s your in-house group or a distributed team across time zones, let’s talk about what that looks like.

    Get Product-Driven Insights

    Weekly insights on building better software teams, scaling products, and the future of offshore development.

    Subscribe on Substack

    Ready to add senior engineers to your team?

    Have questions about how our dedicated engineers can accelerate your roadmap? Book a 15-minute call to discuss your technical needs.