Software Architecture 11 min read

We Tripled the Test Suite. Then Everything Else Had to Change.

I set out to raise test coverage across a large monorepo. Five weeks later I'd also rewritten the test pipeline, the coverage gate, the deploy workflows, and half the backend dependency graph. Some of it was forced. Some was learned. None of it was on the ticket.

I started the quarter with one goal: raise test coverage on a backend monorepo that had grown faster than its tests.

I ended the quarter having rewritten the test pipeline, added a coverage gate, redesigned the deploy workflows, and bumped most of the backend’s dependency stack. None of that was on the original ticket. All of it turned out to be necessary.

Looking back, the thing that stuck with me isn’t any individual change. It’s that each one came out of the one before — some forced by it, some learned from it. I didn’t choose to do five projects. I chose to do one, and the other four followed from it in ways I didn’t plan for.

How it started

The monorepo had grown organically for years. Many modules were well-tested. Many were tested ceremonially. Some weren’t tested at all. Coverage wasn’t catastrophic — it was uneven, and the uneven parts were growing. A few utilities, a couple of shared libraries, most of the response-generator code, and a long tail of lambda functions had token smoke tests or none.

Adding a coverage gate to that situation would have been dead on arrival. Whole-tree coverage thresholds punish contributors for historical gaps they didn’t create. Somebody touches one method in a module that’s been at 20% since 2019 and the gate blocks their PR until they bring the whole module up. They didn’t sign up to fix that module. They signed up to fix a bug. That’s how you teach a team to resent CI.

So the plan was: raise the baseline first, then gate. Get the floor high enough that a delta-style gate — one that only measures the lines touched in this PR — won’t flag false positives on modules that are already fine.

With AI-assisted test generation, that was tractable at a scale it hadn’t been before. Hundreds of thousands of lines of new tests, landed across several large, module-sized batches, in a few weeks. By the time the baseline push was done, the suite had roughly tripled — well over twenty thousand tests across more than a hundred subprojects. Huge number, straightforward goal. That’s where I thought the project ended.

It wasn’t.

What broke first: the test pipeline

The morning after the third big test-writing PR merged, CI started to feel wrong. Not broken. Noticeably slower, and slow in a way that compounded.

The old backend test pipeline ran three self-hosted runners per PR. One runner warmed caches and did a bare compile. A second runner pulled the same cache, did the actual test run, and pushed results. A third ran a narrow dependency-injection sanity check in parallel. Each runner paid its own JVM startup, its own zinc warm-up, its own cache pull. On the old suite, this was fine. On a suite three times the size, the cross-runner cache handoff plus the in-process serial test execution inside SBT turned every PR into a fifteen-minute wait.

The fix came in two pieces. First, collapse the pipeline: one runner, one SBT session, compile and test in sequence sharing the same JVM and the same zinc state. No intermediate cache push/pull. Second, turn on SBT’s forked-test parallelism so the runner’s cores actually get used. I wrote about the SBT settings themselves in when sbt test is secretly single-threaded — that’s the narrow technical version of this part.

Both pieces together: the local suite went about 2.1× faster, the pipeline dropped a runner, and the compile/test handoff tax went to zero. One fewer self-hosted instance per PR, multiplied by how often PRs merge, turned out to matter in the pool.

Somewhere in doing this, I internalized something I didn’t know I was internalizing: one machine, sized honestly, with a concurrency ceiling that matches its cores, will comfortably do the work of a dozen undersized machines. That sentence is obvious written down. It wasn’t obvious before I’d spent a week staring at forked-JVM tuning. A week later, I’d apply it somewhere I hadn’t planned to.

Second-order: the baseline can’t regress

Getting the baseline up was the hard part. Keeping it up was the subtle part. Once you’ve spent weeks writing tests, the last thing you want is for the next PR to quietly land zero tests against fifty new lines of a controller.

The gate had to be per-PR, on the diff, not the whole tree. A different threshold for new files than for modified ones — you can ask more of something being written from scratch than of something being touched in passing. Exemptions for config, migrations, barrel files, type-only declarations, generated code, styles, docs. Without exemptions the gate complains about a package.json bump or a CSS rename, and contributors learn to route around it.

The first threshold I shipped was 70% on new files, 50% on modified. By the end of the next day I’d relaxed it to 50/35. A small UI change can touch many lines of JSX the author isn’t meaningfully rewriting, and the original thresholds were blocking PRs in ways nobody agreed with. A gate that blocks real work doesn’t stay a gate — people find ways around it, the team loses trust in CI, and you end up rebuilding credibility you didn’t spend to earn in the first place.

I also shipped it as advisory first, not required. Report the delta as a sticky PR comment, show the status check, let everyone see the numbers for a couple of weeks, then flip branch protection. A gate that’s been quietly correct for ten PRs is easier to make required than one that arrives required on day one.

Third-order: I couldn’t un-see it

Once you’ve spent a few days packing forked JVMs onto a single runner — cores/2, Tags.limitAll, xargs -P 4, the whole mental model of “what can I actually fit on one machine if I’m careful about it” — you can’t go back to thinking about work the way you did before.

The test-pipeline changes were supposed to be the end of the parallelism thread. Instead, the day I finished tuning them I opened the deploy workflows, and the shape of what was in front of me stopped looking normal.

The old backend deploy pipeline was a matrix fan-out with roughly a hundred self-hosted runners per deploy — one runner per module to build a single fat-jar, then one runner per lambda function to call the AWS update API. Every runner cold-started its own JVM. Every runner re-pulled the cache. Every runner compiled the shared base modules independently. Nothing about this was obviously broken. It worked. Deploys completed. Nobody was complaining.

But it was the exact opposite shape of the one I’d just spent two weeks proving out on tests. One well-sized machine, tasks packed with a concurrency ceiling, sharing a warm JVM and a warm zinc state, comfortably doing the work of many undersized machines. The deploys were running the many undersized machines because nobody had stopped to ask whether they had to.

The rewrite applies the same model. One SBT session runs all a/assembly b/assembly … instead of fifty-plus runners each doing a single assembly. Shared-base modules compile once and every downstream assembly reuses them. Docker builds and pushes happen on a single runner with xargs -P 4 — I swept from -P 1 through -P 8 and 4 was the clear sweet spot; above that, ECR egress and disk IOPS dominated and throughput went monotonically worse. Same principles. Same ceilings. Same concurrency primitives I’d just spent two weeks tuning on tests.

Headline: peak concurrent runners per deploy dropped by about twenty times. Runner-hours per full deploy dropped by about thirty times.

The framing that matters here is that this rewrite wasn’t driven by a problem. It was driven by having learned something. The deploy pipeline wasn’t on fire. It would have stayed the way it was indefinitely if nothing else had nudged me. What changed wasn’t the system — it was that I now had a model that made the old shape look wrong. That’s a different kind of engineering trigger than “fix what’s broken,” and I think a more honest one.

Fourth-order: the stack itself had to catch up

The parallelism work wanted newer SBT plugins. The deploy rewrite wanted a newer assembly plugin. The coverage instrumentation wanted a newer coverage library, which wanted a newer test framework, which wanted a newer mocking library.

At some point I stopped trying to dodge the upgrade and went through the whole stack. A milestone release of a web framework replaced with its stable version. A mocking library four majors behind bumped to current. A long-EOL database-driver coordinate migrated to the actively-maintained one. A logback version carrying an unpatched CVE replaced. Dozens of test files codemodded over to the new test-framework base traits.

The newer assembly plugin exposed a latent bug where JVM path defaults on the build container couldn’t handle non-ASCII class names in a merge-report path. The old plugin silently tolerated it. The new one crashed. The fix was setting LANG=C.UTF-8 and a couple of JVM flags at the container level. Caught before production, not after — because without the upgrade the bug was just sitting there waiting.

I didn’t want to do this work. It was the boring, unglamorous kind of maintenance that never makes a good PR title. But every single one of these upgrades was a precondition for something earlier in this post. Forked-test parallelism at scale. The parallel-assembly deploy path. The coverage aggregation across a hundred-plus modules. Each of those wanted a modern plugin ecosystem, and a modern plugin ecosystem wanted a modern everything-else.

The pattern

Read as a list, this looks like five projects:

  1. Raise test coverage.
  2. Fix the test pipeline.
  3. Add a delta coverage gate.
  4. Rewrite the deploy workflows.
  5. Modernize the backend stack.

That’s not what happened. What happened is that I did one project — (1) — and the rest followed, but not all for the same reason.

Test volume forced the parallelism. Without it, the new suite was a wall-time regression nobody would accept. The parallelism made the gate credible, because a slow suite plus a strict gate equals a team that stops running tests. Those two steps were forced in the strict sense: skip either, and the one before it reverts itself.

The deploy rewrite was different. Nothing was forcing it. Deploys worked. Nobody was asking me to touch them. What changed was that the parallelism work had handed me a mental model — one machine, honestly sized, well-packed — and once I had that model I couldn’t look at the old deploy shape without seeing a waste I wasn’t willing to leave in place. That’s not a forced problem. It’s a learned lens. I think the best engineering work splits roughly evenly between the two, and I think the learned-lens kind tends to be underrated because it’s harder to justify in a status update.

The stack bump was the tail. Some of it was mandatory (the new assembly plugin, the coverage instrumentation, the test framework’s newer base traits). Some of it was just true of where I’d put myself — once you’re deep in SBT plugin internals, the unpatched logback CVE and the EOL database driver coordinate become very hard to leave alone.

Any one of these in isolation would have been self-defeating. More tests on the old pipeline: you’d have reverted the tests. A new gate on a low baseline: you’d have reverted the gate. A deploy rewrite on stale plugins: you’d have eaten the UTF-8 bug in production.

I think this is the real shape of a lot of infrastructure work. The official ticket describes the first step. The steps after it don’t exist in the ticket system yet, because nobody knew they existed until the first one brought them into view — some as forced next-moves, some as things you now can’t un-see. The engineering judgment isn’t in whether to do them. It’s in noticing, mid-project, that you’ve left the original scope, and deciding whether that’s scope creep or the work itself.

For this one, I’m sure it was the work. The original ticket asked “can we raise coverage?” The real answer turned out to be longer: you can raise coverage, and here’s what that forces, and here’s what it teaches you to see next, and here’s the maintenance tail that comes along for the ride. That answer doesn’t fit on a ticket. It fits in a blog post.

The best engineering rarely fits the ticket it opened against. When you finish three layers deeper than you planned, that’s usually a sign you understood the problem correctly — not a sign that you lost the plot.

Back to Blog

Related Posts

View All Posts »

Designing Type-Safe Query DSLs in Scala

Build compile-time safe database queries with zero runtime string errors. Learn how to create fluent query APIs that catch typos, type mismatches, and schema changes at compile time using Scala's type system.

Some Decisions Aren't Decisions

When someone senior pushed back on why we'd isolated our callback servers instead of just scaling vertically, I stopped arguing mid-explanation. Not because he was right—because I couldn't articulate defaults my team had long since internalized.