Skip to content

Benchmark results

Benchmark findings from Phase 5 onward, measured with scripts/bench_vs_scipy.py. In-process via ctypes.CDLL (no subprocess-per-trial cost). 20 trials per point, sizes 10 to 100k, f64 throughout. Single box, single thread.

KernelNautilus ns/elscipy ns/elSpeedupThroughput
erfinv (n=100k)2.214.76.5x450M/s
normal_inv_cdf (n=100k)2.718.56.8x370M/s
normal_cdf (n=100k)4.619.24.2x220M/s
erf (n=100k)3.511.33.2x285M/s
gamma_cdf(2,1) (n=100k)18.241.12.3x55M/s
student_t_cdf(df=5) (n=100k)64.4144.52.3x15.5M/s

Every scalar kernel is Nautilus-wins at large n after the P5 fixes.

This is the dominant performance story. A Chelis-level composition like normal_cdf(x) * exp(-x*x) compiles to a single C loop that loads each x once, computes the composition in registers, and stores one result. Numpy evaluates the same expression as a chain of ufunc calls with intermediate temporaries; at medium-to-large n the intermediates don't fit in L2 and bandwidth dominates.

Compound expressionNautilus ns/elnumpy ns/elSpeedupThroughput
normal_cdf(x)*exp(-x^2) (n=100k)6.720.43.0x150M/s
norm_pdf from primitives (n=100k)2.27.93.6x450M/s

Nautilus wins at every size tested on fused compounds, including n=100k where we had previously expected bandwidth to cap the win.

Adding -flto -fuse-linker-plugin -fvisibility=hidden -Wl,-Bsymbolic to the bench shared-object build improves erf by 3.8x (13.66 to 3.58 ns/el) and the compound normal_cdf(x)*exp(-x^2) by 2.6x (24.59 to 9.40 ns/el) without touching a line of Chelis source.

This is larger than any algorithmic change in Phase 5.

The Chelis C backend emits scalar helpers (normal_cdf, erf, exp) as extern symbols in separate translation units. Under -fPIC -shared defaults, every call goes through the PLT (procedure linkage table), blocking gcc's cross-TU inlining. The LTO flags fix this:

  • -flto enables link-time optimization across TUs
  • -fvisibility=hidden prevents symbol export
  • -Wl,-Bsymbolic resolves symbols at link time, not load time

These flags are a workaround applied in scripts/bench_vs_scipy.py only. They are not applied in the test harness or any consumer-facing build path. The proper fix is upstream: the Chelis compiler should emit scalar helpers as static inline when a reef package is linked into a single shared object.

Measured regression from dropping the flags:

  • v0.1.5: 6.7 to 18.9 ns/el (3x) on compound at n=100k
  • v0.1.6: 9.4 to 20.9 ns/el (2.2x) same kernel

gamma_cdf and student_t_cdf flipped from 40-100x slower to 2.3x faster than scipy at large n after replacing hardcoded 200-iteration series/continued-fraction loops with cephes-style convergence checks.

The key correctness detail: relative-with-floor termination.

The naive textbook check |term| < eps * |acc| fails when the accumulator starts near zero (e.g., gammainc_series(a, x) with a large and x small). The floor pins the threshold:

scale = max(|acc_next|, 1.0)
converged = |term_next| < eps * scale

This falls back to absolute tolerance until the accumulator grows past unity, then switches to relative. For continued-fraction helpers (gammaq_cf_rec, betacf_rec), no floor is needed: those measure convergence on the correction ratio |delta - 1|, which is order-unity during CF convergence by construction.

Correctness gate: 526/526 scipy-parity assertions preserved. A follow-up red-team round ran 142 additional probes at 10x tighter than golden tolerance across extreme parameter regimes, with zero regressions.

SolverNautilusscipyRatioCaveat
rk4_solve (100 steps)0.43 us1734 us4000xFixed-step RK4 vs adaptive RK45 + Python dispatch
brent root-find0.12 us5.6 us45xCompiled C vs Python brentq wrapper

These numbers measure Python dispatch overhead, not numerical-method quality. scipy's solve_ivp includes adaptive step control, event handling, dense output, and argument validation, none of which Nautilus's fixed-step RK4 does. A fair RK4-vs-RK4 comparison against a hand-written C loop would be much closer to parity.

The 4000x number is the real cost a Python user pays today, but it is not a claim about the numerical method itself.

Small-n everything (n <= 100). Dominated by the ~3 us ctypes dispatch floor, irreducible for any Python-to-C FFI path. Nautilus from Chelis at small n would not pay this cost; the tax is on the bench harness, not the runtime.

No remaining large-n losers after LTO and early-termination fixes.

  • LinAlg small-n (inv_2x2, solve_2x2, cholesky_2x2): need a real chelis_tensor* runtime shim. The current RUNTIME_STUBS return-NULL policy cannot support them.
  • Distribution sampling (normal_sample): same runtime gate.
  • Grad-through-rk4 neural-ODE demo: blocked on tensor runtime.
Terminal window
# Full sweep:
python scripts/bench_vs_scipy.py --sizes 10,100,1000,10000,100000
# Fused compound thesis only:
python scripts/bench_vs_scipy.py --tier 3
# Per-kernel sweeps:
python scripts/bench_vs_scipy.py --tier 2
# Correctness gate (526 scipy-parity assertions):
python scripts/bench_vs_scipy.py --tier 1