48 Builds and a Crash from the Grave

Some bugs are hard because they’re complex. Some are hard because they’re intermittent. The worst ones are both.

This one took three months. Forty-eight builds. An iOS crash that appeared under conditions that resisted every attempt to reproduce it reliably.

If you read the first post, you know the setup: a third-party Flutter-backed ticketing SDK embedded inside an Expo app via an Expo native module. Three languages, three build systems, a glorified iframe.

The integration worked. Our QA signed off. We sent it to the customer for testing. Then they came back: “the app crashes when I close it”.

The shape of the problem

iOS only. Android was clean. That asymmetry was the first useful signal: whatever was crashing, it was due to iOS-specific behaviour in either our module or the SDK’s internal Flutter engine.

The crash was intermittent, but not in a predictable way. It didn’t happen every Nth run. It manifested differently every time: sometimes it required staying on a particular screen for a while, other times it only happened when you closed the app immediately after opening the SDK, other times only when navigating back and forth between screens.

Rarely reproducible on the simulator. Sometimes reproducible in dev mode on a real device. The most “reliable” environment was always the TestFlight production build, but even there it was finicky. There were stretches where only a single device on the customer’s side could trigger it.

No warning, no error. Just an iOS crash modal, moments after the app was swiped up from the app switcher. It looked like the app was already gone, but it threw the crash from the grave.

From the outside, every report looked the same: “the app crashed when I closed it”. Same symptom, same sentence, every time. But internally, these were different issues. Different triggers, different timing, different points in the lifecycle. The crash reports all looked alike to QA and the customer. Under the hood, each one was a different conversation.

The black box problem (that you can’t even log)

The standard approach to intermittent crashes is: add logging, narrow the reproduction case, and confirm the fix. None of that worked here.

The crash was happening during app shutdown. iOS gives your app a limited window to clean up and free resources. If it doesn’t finish in time, iOS kills it and generates a crash report. That’s exactly what was happening: the Flutter engine teardown was taking too long, and iOS was pulling the plug.

Here’s the catch: adding logging consumes time during that shutdown window. The very act of trying to observe the problem made it worse. More logging meant more work during cleanup, which meant less time for the engine to shut down gracefully, which meant more crashes. A textbook Heisenbug: the bug that changes behaviour when you try to observe it. The debugging tool and the bug were competing for the same resource.

So logging was out. Breakpoints were out (can’t attach a debugger to a TestFlight build on a customer’s device). Stepping through the SDK’s internal code was out (black box).

All I had were crash reports and stack traces. The stack traces pointed squarely at the Flutter engine: a stack overflow in its recursive cleanup code. The engine was trying to tear itself down, and the cleanup was calling itself until the stack blew up. Sometimes it crashed from the overflow directly; sometimes it just consumed enough of the shutdown window that iOS killed the process first. Either way, same result from the user’s perspective: crash from the grave.

The validation cycle from hell

Each hypothesis required a build, testing on my own device first, then uploading to TestFlight if it looked good. Sometimes it didn’t even get that far: I’d message QA “ok it works, please test” and then five minutes later “hold your horses, I just got the crash on my device”. Back to the code.

When a build did survive my device, the crash didn’t play fair across others. There were builds that worked on mine and on QA’s devices, but crashed on one specific customer device. There was no consistency about which device, which sequence, or which timing would trigger it. Back to the board. Again.

Forty-eight builds over approximately three months. Not full-time: the crash wasn’t severe enough to drop everything. But it was always there, and every time we thought we’d fixed it, another device or another sequence proved us wrong.

Connecting the dots across three lifecycles

This wasn’t a problem you could solve by knowing one framework well. The crash lived in the intersection of three lifecycle models: iOS (UIKit view controller lifecycle), Expo (React Native’s bridge and module lifecycle), and the SDK’s internal Flutter engine lifecycle. Each one had its own rules about when things get created, when they’re active, and when they’re torn down. And each layer added its own slight delay before the shutdown sequence reached the next one. Those delays added up, and iOS gives you roughly 5 seconds in applicationWillTerminate to finish cleaning up before it kills the process. Every millisecond of lifecycle overhead between those three layers was one less millisecond for the Flutter engine to shut down cleanly.

The skill that mattered wasn’t deep expertise in any one of them. It was being able to reason across all three simultaneously. Understanding that an iOS view controller dismissal triggers a specific sequence, that the Expo module lifecycle doesn’t necessarily align with that sequence, and that the Flutter engine inside the SDK has its own intermediate states that neither iOS nor Expo knows about. Connecting those dots from different angles, different systems, different platforms, is what eventually surfaced the root cause.

Managing the problem, not just the code

Around build 20 or so, I had an honest conversation with the customer.

The app had never been released to the public yet, so we didn’t know how many users it would actually affect. Based on my observations and assumptions about user behaviour, I believed it was a corner case: it didn’t happen every time, and the app was otherwise working well. But we couldn’t be sure. There was a real question on the table: is this worth continuing to chase before launch?

I explained the situation clearly: this is an intermittent crash in a very specific flow. It might affect a small percentage of sessions. We can keep investigating, but the feedback loop is slow and there’s no guaranteed timeline. The alternative is to accept the risk and move on to other work.

The customer decided to push on. They wanted it fixed properly.

That decision mattered. It gave me the space to keep iterating without the pressure of justifying every build. And honestly, I still had ideas. Each failed attempt narrowed the problem. I wasn’t guessing anymore; I was eliminating. The crash was solvable. It was just expensive to prove.

Working with AI (and giving it memory)

Claude Code was a useful partner in this loop, but not in the way you might expect. It didn’t find the bug. What it did was accelerate the reasoning cycle.

Early on I realised that context was the bottleneck. Each debugging session started cold: what have we tried? What did we learn? What’s left? So I built a workflow around a memory file.

The prompt included instructions to: (1) read the memory file before doing anything, (2) read the crash logs, (3) analyse and plan a fix, (4) check the memory file to verify this hasn’t been attempted before, (5) if it’s genuinely new, add the approach to the memory file, then proceed with the fix.

This went for at least 18 iterations, probably more. And even with those instructions and a persistent log of previous attempts, Claude sometimes proposed the same fix again. That’s the reality of working with AI on long-running problems: the memory helps, but it’s not perfect. The memory file itself eats into the context window, leaving fewer tokens for each subsequent attempt. And you still need to be the one who catches the loop.

What it did well was compress the reasoning cycle and reach for techniques I wouldn’t have. At one point it suggested method swizzling in Objective-C to intercept UIKit’s cleanup calls. I’d never have gone there on my own. Even that didn’t help, but having a partner willing to reach outside my usual toolkit was genuinely valuable. Combined with the memory file carrying context forward between sessions, the AI could review what had been tried, what changed, what didn’t, and propose the next hypothesis from a higher starting point. For a problem where 90% of the time is spent thinking and 10% typing, that mattered.

Breaking open the black box

We’d been in contact with the SDK provider throughout, sharing crash reports and stack traces as they came in. But for most of the process, the engine was still a black box. We could report what was happening; we couldn’t control it.

Eventually I pushed for direct access to the Flutter engine lifecycle. The argument was simple: the crash was in their engine’s shutdown, but the trigger was on our side. We needed a way to start the shutdown process earlier, before iOS started its own cleanup countdown. They agreed, and exposed the engine lifecycle to us.

That was the breakthrough. Not a fix we applied on our side, but a collaboration that opened up the debugging surface. Once we could control the engine shutdown sequence explicitly, the race condition became visible and solvable.

The root cause was what we’d suspected for weeks: when the user navigated away from the ticketing screen, our module and the SDK’s engine were both cleaning up, but the order wasn’t deterministic. Most of the time, the timing worked out. Occasionally the engine was in an intermediate state when iOS decided the view hierarchy was inconsistent and killed the process.

With the engine lifecycle exposed, we could ensure a deterministic shutdown: our module waits for the engine to reach a fully stopped state before proceeding with its own cleanup. Not “we sent the signal”. Actually stopped.

The crash rate went from intermittent to zero.

What I took away from this

Three months to fix a crash that most users would never experience. A few things are true simultaneously:

Black boxes are expensive. The SDK being opaque constrained the debugging surface in a real way. The fix ultimately required the provider to open up their engine lifecycle. When you embed a third-party SDK, you’re not just embedding their code. You’re embedding their debugging surface, or the lack of it.

Push for what you need from the vendor. We were in contact throughout, but the breakthrough came when I pushed for direct access to the engine lifecycle. Sharing crash reports is collaboration; getting control of the shutdown sequence is what actually fixed it.

Cross-platform experience pays off in unexpected places. This wasn’t a Flutter bug, an Expo bug, or an iOS bug. It was a timing issue that only existed at the intersection of all three lifecycle models. Solving it required understanding all three well enough to reason about their interactions. Years of working across different platforms and frameworks is what made that possible.

Intermittent bugs need a different mindset. You can’t reproduce-and-fix. You have to hypothesise, ship, and wait. That requires patience and the discipline to change one thing at a time. There’s no shortcut.

AI with memory is more useful than AI without it. Giving Claude Code a persistent markdown log of attempted fixes turned a cold-start problem into a continuous investigation. The technique is simple and transferable to any long-running debugging effort.

The integration is stable. The crash is gone. It took 48 builds to get there.

Sometimes that’s what shipping looks like.

Leave a Reply

Your email address will not be published. Required fields are marked *