A common piece of Rust perf folklore: “mark the rare branch cold, the compiler will lay out the hot one as fall-through.” I followed that advice on a streaming indicator’s warmup path. Every wrapping indicator regressed. Supertrend by 134%.
This is the post I wish existed last month.
The setup
quantedge-ta is a streaming technical-analysis library. One of the internal helpers shared across several indicators is EmaCore: a pub(crate) struct that mirrors the public indicator API with push for closed bars (backtests, advancing the stream) and replace for live in-progress bars (terminal repainting, intra-bar updates). Both have the same shape: a warmup phase that seeds an SMA over length ticks, then a steady-state mul_add forever after.
impl EmaCore {
pub(crate) fn push(&mut self, price: Price) -> Option<Price> {
if self.converged {
// steady-state: one mul_add and we're done
self.previous_value = self.value;
self.value = self.alpha
.mul_add(price - self.previous_value, self.previous_value);
} else {
// warmup: seed an SMA for `length` ticks, then flip `converged`
// ...
}
self.value()
}
}
EmaCore is wrapped by several indicators: EMA itself, MACD (×3 EmaCores), ATR, ADX (×4 EmaCores), Supertrend, and KC. ATR is itself wrapped by KC and Supertrend, so the cascade is genuinely deep. Every one of them wants to inline through EmaCore::push on its own hot path.
The crate ships with lto = "fat", codegen-units = 1, and EmaCore is pub(crate). In release mode, the inliner sees every body. That detail will matter shortly.
The tempting refactor
I’d just finished a perf audit. One ranked item said:
Split
EmaCore::pushhot/cold paths. Move warmup to a separate function, mark it#[cold]and#[inline(never)]so the steady-state body stays small and inliner-friendly.
The “obvious” implementation:
pub(crate) fn push(&mut self, price: Price) -> Option<Price> {
if self.converged {
return Some(self.steady_state(price));
}
self.push_warming(price)
}
#[cold]
#[inline(never)]
fn push_warming(&mut self, price: Price) -> Option<Price> {
// ... seed SMA, flip converged ...
}
(steady_state here stands in for the inline mul_add the real code uses; not a helper that exists in the crate.)
#[cold] to mark the warmup as unlikely. #[inline(never)] to keep the steady-state body small and let the inliner pull it into wrappers. Two well-known hints, doing what they advertise.
The criterion deltas (Apple M5 Max, --release, criterion defaults: 100 samples for stream/<em></em>, 200 for tick/):
| Indicator | Before | After | Delta |
|---|---|---|---|
| `stream/ema20` | 854 ns | 1.32 us | **+55%** |
| `stream/ema200` | 856 ns | 1.32 us | **+55%** |
| `stream/macd12269` | 914 ns | 1.56 us | **+71%** |
| `stream/kc2010` | 1.00 us | 2.17 us | **+115%** |
| `stream/supertrend20` | 1.05 us | 2.44 us | **+134%** |
| `tick/ema20` | 2.77 ns | 4.47 ns | **+67%** |
| `tick/kc2010` | 6.49 ns | 10.83 ns | **+75%** |
Every wrapper regressed. The deeper the wrapping, the worse it got. Indicators that don’t touch EmaCore (stoch, stochrsi) stayed within noise threshold, confirming the regression is EmaCore-specific. Removed the commit before it landed.
To establish the noise floor I re-ran the unmodified baseline against itself. All deltas were within ±1%, several flagged “no change in performance detected.” So the +55% to +134% above is real signal, not measurement drift. I never publish performance claims I haven’t run the benchmarks for.
Isolating the blame
#[cold] and #[inline(never)] are two attributes doing different things. The combination above told me they’re harmful together, but not which one was carrying the damage. So I ran a second experiment: removed #[inline(never)] from both push_cold and replace_cold, leaving #[cold] in place on its own. Same baseline, same hardware, same build config.
| Indicator | Baseline | `#[cold]` alone | Delta vs baseline |
|---|---|---|---|
| `stream/ema20` | 854 ns | 863 ns | **+1.3%** (noise) |
| `stream/ema200` | 856 ns | 856 ns | **+0.3%** (noise) |
| `stream/kc2010` | 1.00 us | 1.00 us | **−0.2%** (noise) |
| `stream/kc200100` | 1.00 us | 1.00 us | **−0.1%** (noise) |
| `stream/macd12269` | 914 ns | 1.57 us | **+71%** |
| `stream/macd12026090` | 917 ns | 1.55 us | **+69%** |
| `stream/supertrend20` | 1.05 us | 2.45 us | **+134%** |
| `stream/supertrend200` | 1.05 us | 2.45 us | **+135%** |
| `tick/ema20` | 2.77 ns | 4.43 ns | **+66%** |
| `tick/kc2010` | 6.49 ns | 10.84 ns | **+75%** |
This split the regressions into three groups.
Group A: returned to baseline. stream/ema<em></em> and stream/kc (which were +55% and +115% with both attributes). With #[cold] alone they’re noise. So for these benchmarks, #[inline(never)] was the entire cause of the regression.
Group B: regressed identically with or without #[inline(never)]. MACD, Supertrend, ATR-chains, and most tick/* measurements. Same numbers within noise. #[cold] alone was sufficient to deinline these wrappers; the explicit #[inline(never)] was redundant damage.
Group C: untouched. stoch, stochrsi, anything not using EmaCore. Still flat. Negative control passes.
That second group is the interesting result. The headline number, +134% on Supertrend, holds with #[cold] alone, no #[inline(never)] anywhere in the crate. A single attribute, on a private helper, regressed a leaf indicator by more than 2×.
The puzzle (and an educated guess)
Before I get to the mechanism, the strangest part of this whole story:
The benchmark harness explicitly skips the warmup phase. Two helpers in the harness, split_at_warmup and max_convergence, feed a seed indicator with enough bars to converge before timing starts, so the timed loop only exercises the steady-state arm. push_cold and replace_cold are never called during the measured iterations. By the time criterion starts the stopwatch, every EmaCore in the harness has self.converged = true and stays that way.
So the cold body is dead code at measurement time. Yet hinting it changes the steady-state numbers by 50-150%. The hot path is what runs; the cold path is what got annotated; only the hot path is measured; the steady-state numbers move anyway.
That’s the part of the result I’m most confident about (it’s directly measured) and least confident about explaining. Here’s my best guess.
The LLVM cold attribute does three things to the function it’s attached to:
- Branch probability metadata.
MachineBlockPlacementlays out the calling block such that the hot path falls through and the cold call is the not-taken edge. - Register allocation priority. The hot path gets the better register assignments.
- Inline cost penalty. Calls into cold-marked functions are charged extra by the inliner.
The first two only matter on the body that’s actually executed, and the bench never executes the cold body. The third one is where I’d put the money. #[cold] raises the inliner’s cost estimate for EmaCore::push itself, because push‘s body contains a call into a cold-attributed helper. The inliner now sees a function whose call graph touches “expensive” territory and gets conservative about pulling push into its callers.
For MACD (three EmaCores composed in series) and Supertrend (EMA-of-ATR plus conditional update), the inliner’s budget is tighter, and the cost penalty pushes push out of the wrapper. For bare EMA and KC the budget has more slack, so the same penalty doesn’t change the decision and the wrapper stays inlined.
The honest summary: I know #[cold] alone, on a never-executed branch, regressed the deep-wrapper indicators by ~70-135%. I have a plausible mechanism, but I haven’t confirmed it. Take the explanation as a hypothesis, not a verdict.
Rules of thumb I now actually believe
#[cold]is for panics, error returns, once-per-process init. Not for streaming warmup, not for “the rare side of any frequent branch.” It’s a function-boundary annotation, not a local layout hint.- Observation:
#[cold]on a leaf helper changed steady-state codegen in wrappers that never invoked the cold body. Magnitude scaled with composition depth: noise on bare EMA and KC, +71% on MACD-of-EMAs, +134% on Supertrend-on-ATR-on-EMA. I have a hypothesis for the mechanism (see above), not a confirmed cause. - In this experiment,
#[inline(never)]was either the entire cause of a regression or pure redundant damage. Group A (bare EMA, KC) regressed only because of#[inline(never)]; with#[cold]alone they returned to baseline. Group B (MACD, Supertrend) regressed identically with or without it. Never neutral, never helpful in my measurements. - Always bench wrappers, not just the bare function. Bare EMA showed +55% on the bad refactor, already enough to revert. The +134% on Supertrend was the full picture. Validating a perf change on the shallowest indicator and shipping it would have missed most of the regression.
- Don’t extrapolate perf results across composition depths. A change that’s noise on a leaf or shallow wrapper can be a 2× regression on a deeper one. Group A and Group B in my experiment exercise the exact same
EmaCore::pushbody; the difference is what’s calling it. - A bench that excludes warmup can still move when you change the warmup arm. The hint changes how the hot path is compiled, not just where the cold body lives. The cold body never runs and the hot body still gets slower.
The meta-lesson
The audit ranked the bad refactor “low effort, expected gain.” Three days and two measurement rounds later it was −134% on Supertrend, with #[cold] carrying the damage on its own. The audit was wrong because it reasoned about codegen by intuition instead of measurement. The intuition (mark the rare branch cold → compiler optimises layout) was correct in isolation and disastrous in context.
The lesson beyond “don’t do this specific thing”: codegen hints have non-local effects, and the size of the effect depends on what’s wrapping you. #[cold] on a leaf helper isn’t a local layout hint, it’s an inline-cost message that travels up the call graph as far as the inliner’s cost estimator cares. Pick the hint that matches your blast radius, and never publish a perf claim without the bench output.
The fix (what actually replaced this refactor in quantedge-ta) is a separate story about Rust 1.95’s core::hint::cold_path(), function exit shapes, and what #[inline] actually does in a fat-LTO crate. Different post, different week.
