We Tripled the Test Suite. Then Everything Else Had to Change.

I started the quarter with one goal: raise test coverage on a backend monorepo that had grown faster than its tests.

I ended the quarter having rewritten the test pipeline, added a coverage gate, redesigned the deploy workflows, and bumped most of the backend’s dependency stack. None of that was on the original ticket. All of it turned out to be necessary.

The thing that stuck with me, looking back, isn’t any individual change. It’s that each one came out of the one before — some forced by it, some learned from it. I didn’t choose to do five projects. I chose to do one, and the other four followed in ways I didn’t plan for.

How it started

The monorepo had grown organically for years. Some modules were well-tested. Some were tested ceremonially. A few weren’t tested at all. Coverage wasn’t catastrophic — it was uneven, and the uneven parts were the ones still growing. A handful of utilities, a couple of shared libraries, most of the response-generator code, and a long tail of lambdas had token smoke tests or none.

Adding a coverage gate to that situation would have been dead on arrival. Whole-tree thresholds punish contributors for historical gaps they didn’t create. Somebody touches one method in a module that’s been at 20% since 2019, the gate blocks their PR until they bring the whole module up. They didn’t sign up to fix that module. They signed up to fix a bug. That’s how you teach a team to resent CI.

So the plan was: raise the baseline first, then gate. Get the floor high enough that a delta-style gate — one that only measures the lines touched in this PR — wouldn’t flag false positives on modules that were already fine.

With AI-assisted test generation, that was tractable at a scale it hadn’t been before. A lot of new test code, landed in module-sized batches over a few weeks. By the time the baseline push was done, the suite had roughly tripled, into the tens of thousands of tests across the hundred-plus subprojects. Big number, simple goal. That’s where I thought the project ended.

It wasn’t.

What broke first: the test pipeline

The morning after the third big test-writing PR merged, CI started to feel wrong. Not broken. Slower, in a way that compounded.

The old pipeline ran three self-hosted runners per PR. One warmed caches and did a bare compile. A second pulled the same cache, did the actual test run, and pushed results. A third ran a narrow dependency-injection sanity check in parallel. Each one paid its own JVM startup, its own zinc warm-up, its own cache pull. On the old suite this was fine. On a suite three times the size, the cross-runner cache handoff plus the in-process serial test execution inside SBT turned every PR into a long wait.

The fix came in two pieces. Collapse the pipeline: one runner, one SBT session, compile and test in sequence sharing the same JVM and the same zinc state. No intermediate cache push and pull. Then turn on SBT’s forked-test parallelism so the runner’s cores actually got used. I wrote about the SBT settings themselves in when sbt test is secretly single-threaded — that’s the narrow technical version of this part.

Both pieces together: the local suite roughly doubled in speed, the pipeline dropped a runner, and the compile-to-test handoff tax went to zero. One fewer self-hosted instance per PR, multiplied by how often PRs merge, turned out to matter in the runner pool.

Somewhere in doing this, I internalized something I didn’t know I was internalizing: one machine, sized honestly, with a concurrency ceiling that matches its cores, will comfortably do the work of a dozen undersized machines. That sentence is obvious written down. It wasn’t obvious before I’d spent a week staring at forked-JVM tuning. A week later, I’d apply it somewhere I hadn’t planned to.

Second-order: the baseline can’t regress

Getting the baseline up was the hard part. Keeping it up was the subtle part. Once you’ve spent weeks writing tests, the last thing you want is for the next PR to quietly land zero tests against fifty new lines of a controller.

The gate had to be per-PR, on the diff, not the whole tree. A different threshold for new files than for modified ones — you can ask more of something being written from scratch than of something being touched in passing. Exemptions for config, migrations, barrel files, type-only declarations, generated code, styles, docs. Without exemptions the gate complains about a package.json bump or a CSS rename, and contributors learn to route around it.

The first threshold I shipped was 70% on new files, 50% on modified. By the end of the next day I’d relaxed it to 50/35. A small UI change can touch many lines of JSX the author isn’t meaningfully rewriting, and the original numbers were blocking PRs in ways nobody agreed with. A gate that blocks real work doesn’t stay a gate — people find ways around it, the team loses trust in CI, and you end up rebuilding credibility you didn’t spend to earn in the first place.

I also shipped it as advisory first, not required. Report the delta as a sticky PR comment, show the status check, let everyone see the numbers for a couple of weeks, then flip branch protection. A gate that’s been quietly correct for ten PRs is easier to make required than one that arrives required on day one.

Third-order: I couldn’t un-see it

Once you’ve spent a few days packing forked JVMs onto a single runner — cores/2, Tags.limitAll, xargs -P 4, the whole mental model of “what can I actually fit on one machine if I’m careful about it” — you can’t go back to thinking about work the way you did before.

The test-pipeline changes were supposed to be the end of the parallelism thread. Instead, the day I finished tuning them I opened the deploy workflows, and the shape of what was in front of me stopped looking normal.

The old deploy pipeline was a matrix fan-out — one self-hosted runner per module to build a single fat-jar, then one runner per lambda to call the AWS update API. Dozens of runners. Every one of them cold-started its own JVM. Every one re-pulled the cache. Every one compiled the shared base modules independently. Nothing was obviously broken. It worked. Deploys completed. Nobody was complaining.

But it was the exact opposite shape of the one I’d just spent two weeks proving out on tests. One well-sized machine, tasks packed with a concurrency ceiling, sharing a warm JVM and a warm zinc state, comfortably doing the work of many undersized ones. The deploys were running the many undersized machines because nobody had stopped to ask whether they had to.

The rewrite applies the same model. One SBT session runs all a/assembly b/assembly … instead of dozens of runners each doing a single assembly. Shared-base modules compile once and every downstream assembly reuses them. Docker builds and pushes happen on a single runner with xargs -P 4 — I swept from -P 1 through -P 8 and 4 was the clear sweet spot; above that, ECR egress and disk IOPS dominated and throughput went monotonically worse. Same principles. Same ceilings. Same primitives I’d just spent two weeks tuning on tests.

The result was a real drop in both peak concurrent runners and total runner-hours per deploy — the kind of difference where nobody had to argue about whether it was worth doing once the new pipeline ran end-to-end.

The framing that matters here is that this rewrite wasn’t driven by a problem. It was driven by having learned something. The deploy pipeline wasn’t on fire. It would have stayed the way it was indefinitely if nothing else had nudged me. What changed wasn’t the system — it was that I now had a model that made the old shape look wrong. That’s a different kind of engineering trigger than “fix what’s broken,” and I think a more honest one.

Fourth-order: the stack itself had to catch up

The parallelism work wanted newer SBT plugins. The deploy rewrite wanted a newer assembly plugin. The coverage instrumentation wanted a newer coverage library, which wanted a newer test framework, which wanted a newer mocking library.

At some point I stopped trying to dodge the upgrade and went through the whole stack. A milestone release of a web framework swapped for its stable version. A mocking library four majors behind, bumped to current. A long-EOL database-driver coordinate migrated to the actively-maintained one. A logback version carrying an unpatched CVE replaced. Test files codemodded over to the new test-framework base traits.

The newer assembly plugin exposed a latent bug where JVM path defaults on the build container couldn’t handle non-ASCII class names in a merge-report path. The old plugin silently tolerated it. The new one crashed. The fix was setting LANG=C.UTF-8 and a couple of JVM flags at the container level. Caught before production, not after — because without the upgrade the bug was just sitting there waiting.

I didn’t want to do this work. It was the boring, unglamorous kind of maintenance that never makes a good PR title. But every one of those upgrades was a precondition for something earlier in this post. Forked-test parallelism at scale. The parallel-assembly deploy path. Coverage aggregation across the hundred-plus modules. Each of those wanted a modern plugin ecosystem, and a modern plugin ecosystem wanted a modern everything-else.

The pattern

Read as a list, this looks like five projects:

Raise test coverage.
Fix the test pipeline.
Add a delta coverage gate.
Rewrite the deploy workflows.
Modernize the backend stack.

That’s not what happened. What happened is that I did one project — (1) — and the rest followed, but not all for the same reason.

Test volume forced the parallelism. Without it, the new suite was a wall-time regression nobody would accept. The parallelism made the gate credible, because a slow suite plus a strict gate equals a team that stops running tests. Those two were forced in the strict sense: skip either, and the one before reverts itself.

The deploy rewrite was different. Nothing was forcing it. Deploys worked. Nobody was asking me to touch them. What changed was that the parallelism work had handed me a mental model — one machine, honestly sized, well-packed — and once I had that model I couldn’t look at the old deploy shape without seeing waste I wasn’t willing to leave in place. That’s not a forced problem. It’s a learned lens. I think the best engineering work splits roughly evenly between the two, and I think the learned-lens kind tends to be underrated because it’s harder to justify in a status update.

The stack bump was the tail. Some of it was mandatory — the new assembly plugin, the coverage instrumentation, the test framework’s newer base traits. Some of it was just true of where I’d put myself: once you’re deep in SBT plugin internals, the unpatched logback CVE and the EOL database driver coordinate become very hard to leave alone.

Any one of these in isolation would have been self-defeating. More tests on the old pipeline: the tests get reverted. A new gate on a low baseline: the gate gets reverted. A deploy rewrite on stale plugins: you eat the UTF-8 bug in production.

I think this is the real shape of a lot of infrastructure work. The ticket describes the first step. The steps after it don’t exist in the ticket system yet, because nobody knew they existed until the first one brought them into view — some as forced next-moves, some as things you now can’t un-see. The judgment isn’t whether to do them. It’s noticing, mid-project, that you’ve left the original scope, and deciding whether that’s scope creep or the work itself.

For this one, I’m sure it was the work. The original ticket asked “can we raise coverage?” The real answer turned out to be longer: you can raise coverage, and here’s what that forces, and here’s what it teaches you to see next, and here’s the maintenance tail that comes along for the ride. That answer doesn’t fit on a ticket.

We Tripled the Test Suite. Then Everything Else Had to Change.

How it started

What broke first: the test pipeline

Second-order: the baseline can’t regress

Third-order: I couldn’t un-see it

Fourth-order: the stack itself had to catch up

The pattern

Related Posts

Modernizing a Production Frontend: 10x Faster Builds in 2 Weeks

Multi-Tenant Database Architecture Patterns for SaaS Platforms

Designing Type-Safe Query DSLs in Scala

Some Decisions Aren't Decisions