Nautilus Developer Benchmarking
Nautilus’s
developer_guide/benchmarking/page documents two complementary Rust benchmark frameworks - Criterion (statistical wall-clock harness with HTML reports and confidence intervals; for anything ≥100 ns or end-to-end scenarios) and iai (deterministic CPU-instruction-counting harness for ultra-fast functions and noise-free CI gating). Each crate undercrates/<crate_name>/benches/keeps<name>_criterion.rsand<name>_iai.rsfiles registered inCargo.tomlwithharness = false. Run withcargo bench -p <crate>, optionally--bench <module>, ormake cargo-ci-benches(CI). Profile hot paths withcargo flamegraph --bench <name> -p <crate> --profile bench(Linux usesperf, macOS usessudo+ DTrace). The custom[profile.bench]inherits release-debugging so flamegraphs show readable symbols. Templates atdocs/dev_templates/{criterion,iai}_template.rs. Ground rules: heavy setup outsideb.iter, wrap inputs instd::hint::black_box, group related cases withbenchmark_group!, iai functions take no params. For Cortana MK3 this is the framework that decides “is the Python UW ingestor measured-too-slow?” - and the same toolchain backs the M2 “MK3 makes the same decisions as MK2, only faster” claim with numbers instead of vibes.
Why this page exists
nautilus-rust.md left one question open: how do we know when Python is
actually too slow? The answer is “measure with the framework’s own
benchmark suite.” This page makes the benchmark surface explicit - what
exists, where to put a new one, what units to think in, and what the
profile-then-optimize discipline looks like.
For Cortana MK3 this matters in two places:
- The UW Python ingestor. Per
nautilus-rust.md, the single most likely Cortana hot path that might force a Rust drop-down is the UW WebSocket adapter under flow bursts. The decision rule is empirical: “Python is fine until measurement says otherwise.” This page is the measurement playbook. - The M2 milestone claim. “MK3 makes the same decisions as MK2, only faster.” That’s a benchmark assertion. We need a methodology for measuring MK2’s decision latency, expressing it in Nautilus benchmark terms, and demonstrating MK3 is at least as fast end-to-end.
Tooling overview
The benchmarking page is explicit that Nautilus relies on two complementary frameworks because each measures a different thing:
| Framework | What it measures | When to prefer it |
|---|---|---|
| Criterion | Wall-clock run time with statistical confidence intervals; outlier detection; HTML reports. | End-to-end scenarios, anything slower than ~100 ns, visual comparisons across runs. |
| iai | Exact retired CPU instruction counts via hardware counters. Deterministic and noise-free. | Ultra-fast functions, regression gating in CI, micro-benchmarks where wall-clock noise drowns the signal. |
Most hot code paths warrant both:
- Criterion answers “how long does this take in real time on this machine?” - the question users care about.
- iai answers “did this PR change the work the CPU is doing?” - the question CI cares about, immune to system load and thermal throttling.
Critical caveat from the page: iai numbers are machine-specific. Use iai for delta detection within a single CI environment, never for cross-machine comparison. A laptop and a CI runner will produce different absolute counts even for the same code.
Directory layout
Each crate keeps its performance tests in a local benches/ folder:
crates/<crate_name>/
└── benches/
├── foo_criterion.rs # Criterion group(s)
└── foo_iai.rs # iai micro benches
Cargo.toml must list every benchmark explicitly so cargo bench
discovers them:
[[bench]]
name = "foo_criterion"
path = "benches/foo_criterion.rs"
harness = false
[[bench]]
name = "foo_iai"
path = "benches/foo_iai.rs"
harness = falseharness = false is required - both Criterion and iai supply their own
main() via criterion_main! / iai::main!.
Writing Criterion benchmarks
Three rules from the page:
- Heavy setup OUTSIDE the timing loop (
b.iter). Setup work contaminates the measurement. - Wrap inputs/outputs in
std::hint::black_box. Without this, the optimizer can hoist or eliminate the call entirely and the benchmark measures nothing. - Group related cases with
benchmark_group!and setthroughputorsample_sizewhen the defaults aren’t ideal.
The canonical shape:
use std::hint::black_box;
use criterion::{Criterion, criterion_group, criterion_main};
fn bench_my_algo(c: &mut Criterion) {
let data = prepare_data(); // heavy setup, done once
c.bench_function("my_algo", |b| {
b.iter(|| my_algo(black_box(&data)));
});
}
criterion_group!(benches, bench_my_algo);
criterion_main!(benches);Criterion writes HTML reports to target/criterion/. Open
target/criterion/report/index.html to see distributions, outliers, and
historical comparisons (Criterion auto-saves prior runs and shows a
delta on every re-run).
Writing iai benchmarks
iai requires functions that take no parameters and return a value (which can be ignored). Keep them as small as possible - every instruction iai counts contributes to the signal-to-noise ratio.
use std::hint::black_box;
fn bench_add() -> i64 {
let a = black_box(123);
let b = black_box(456);
a + b
}
iai::main!(bench_add);The black_box calls force the optimizer to treat the inputs as
opaque - without them, iai would count zero instructions because the
compiler would fold the constants.
Running benches locally
| Command | What it does |
|---|---|
cargo bench -p nautilus-core | Run every bench in a single crate. |
cargo bench -p nautilus-core --bench time | Run one benchmark module. |
make cargo-ci-benches | Run the CI performance workflow’s crate set, one crate at a time (avoids the mixed-panic-strategy linker issue when multiple bench binaries link in one process). |
The make cargo-ci-benches target is what the upstream CI uses; reach
for it when you want to reproduce CI numbers locally.
Generating a flamegraph
cargo-flamegraph produces a sampled call-stack profile of a single
benchmark - answers the “where is the time actually going?” question
that Criterion/iai delta-numbers don’t.
Install once per machine:
cargo install flamegraphRun a specific bench with the symbol-rich bench profile:
# example: the matching benchmark in nautilus-common
cargo flamegraph --bench matching -p nautilus-common --profile benchOpen the generated flamegraph.svg in a browser; zoom into hot
sub-trees by clicking.
Linux
perf must be available:
sudo apt install linux-tools-common linux-tools-$(uname -r)If you see a perf_event_paranoid error, relax the kernel restriction
for the session:
sudo sh -c 'echo 1 > /proc/sys/kernel/perf_event_paranoid'A value of 1 is typically enough; reset to 2 (default) afterward, or
make it permanent via /etc/sysctl.conf.
macOS
DTrace requires root, so cargo flamegraph must be invoked with sudo.
Caveat: sudo cargo flamegraph creates files in target/ owned by
root, which then poison subsequent non-root cargo commands. Either
remove the root-owned files manually or run sudo cargo clean.
sudo cargo flamegraph --bench matching -p nautilus-common --profile benchCortana’s primary dev machine is macOS - expect the sudo dance.
The [profile.bench] profile
Bench binaries are compiled with the workspace’s custom [profile.bench],
which inherits from release-debugging:
- Full optimization (release-grade codegen).
- Debug symbols preserved (so flamegraphs show real function names).
- Distinct from
[profile.release]which usespanic = "abort"and strips symbols for the production wheel.
This separation is intentional: production binaries stay lean, but
benchmarks remain debuggable. Don’t override it without reading the
workspace Cargo.toml first.
Templates
Ready-to-copy starter files live under:
docs/dev_templates/criterion_template.rsdocs/dev_templates/iai_template.rs
Workflow when adding a new benchmark:
cp docs/dev_templates/criterion_template.rs crates/<crate>/benches/<name>_criterion.rs- Adjust imports, function names, and the
criterion_group!macro. - Add the
[[bench]]stanza tocrates/<crate>/Cargo.toml. - Run
cargo bench -p <crate> --bench <name>to verify it discovers and runs. - Open
target/criterion/report/index.htmland confirm the histogram looks well-shaped (no bimodal distributions, no extreme outliers).
Profile-then-optimize discipline
The page doesn’t spell this out as a section, but the tooling choice implies it:
- Don’t optimize without a benchmark. A benchmark is the only honest proof that a change matters. Without one, “this is faster” is a guess.
- Don’t change benchmarks alongside the change you’re measuring. Land the benchmark first (it should pass at current speed), then land the optimization (it should pass faster). Otherwise you can’t tell which line moved the number.
- Flamegraph before refactoring. A flamegraph tells you which function actually dominates. Most “obvious” optimizations are wrong; the dominant cost is usually somewhere boring.
- iai gates regressions in CI; Criterion confirms wins locally. iai’s instruction-count diffs are stable enough to fail a PR. Wall- clock numbers are not - they vary with thermal state, neighboring processes, and the phase of the moon.
What’s already benchmarked upstream
The page references several existing benches by name (without an exhaustive list); the canonical examples to crib from:
crates/nautilus-common/benches/matching*- the matching engine hot path. The Criterion variant measures end-to-end matching latency; the iai variant guards the inner instruction count.crates/nautilus-core/benches/time*- clock and timestamp primitives. Used as the example in the “Single benchmark module” command.
To find the full set: find crates -path '*/benches/*.rs' from a
checkout, or browse the upstream repo’s crates/*/benches/ directories.
Latency budget - what “good” looks like by component
The benchmarking page itself doesn’t publish absolute latency targets,
but the architecture page (nautilus-architecture.md) and the design
philosophy ranking (Reliability > Performance > Modularity > …) imply
the framework’s posture:
| Layer | Order of magnitude (typical, derived from upstream benches and architecture commentary) |
|---|---|
| In-process MessageBus dispatch (single topic, single subscriber) | Sub-microsecond |
Cache read (cache.quote_tick(instrument_id)) | Single-digit microseconds |
| Tick parse + cache write + publish (data engine path) | Tens of microseconds |
| PyO3 boundary crossing (Python → Rust → Python) | Single-digit microseconds per hop |
Strategy on_quote_tick Python callback (logic only, no I/O) | Hundreds of microseconds, dominated by Python interpreter overhead |
| Order submit → ExecutionClient send | Tens to hundreds of microseconds (in-process); network-dominated externally |
Cortana’s signal cadence is seconds, not microseconds. The decision loop runs on the human-actionable timescale - 1-second scoring ticks at the fastest. Even a Python callback that takes 1 ms is 3 orders of magnitude faster than the cadence it serves. This is exactly why the PyO3 cost is “dust” relative to the work the engine actually does.
The hot paths where microseconds matter - tick parsing, bar building, book maintenance, matching, accounting - are already Rust and are benchmarked upstream. Cortana doesn’t reimplement them.
Cortana MK3 implications
The UW Python ingestor - concrete benchmark plan
Per nautilus-rust.md: “the candidate Cortana hot path that might
warrant Rust” is the UW WebSocket adapter under flow bursts. The
decision rule is empirical: ship the Python ingestor, measure
under load, drop to Rust only if measurement says so.
Recommended target latency for the UW Python ingestor:
Median end-to-end “WebSocket frame received → CustomData published on MessageBus” latency ≤ 5 ms; p99 ≤ 20 ms; sustained throughput ≥ 200 alerts/second without queue growth.
Reasoning (concrete numbers, derived from Cortana data):
- Cortana’s signal cadence is 1 Hz (1-second scoring tick). Anything < 100 ms is invisible to the scoring engine. We could budget 50 ms comfortably, but tighter is better insurance.
- 5 ms median leaves 200× headroom against the 1 Hz tick. Empirically
achievable for Python
json.loads+ dataclass construction + bus publish on a modern machine. - 20 ms p99 absorbs GC pauses, OS scheduling jitter, and occasional large-payload alerts without hitting the cadence floor.
- 200 alerts/second sustained is ~10× the typical UW alert rate (observed empirically: UW emits 5-30 alerts/sec during normal flow, with bursts to ~100/sec on event-driven days). 200/sec gives 2× burst headroom. If we exceed this regularly, we’re in Rust-adapter territory.
How to measure it:
- Add a Criterion bench at
tests/bench/uw_ingest_criterion.py(or its equivalent location once MK3 is structured) that replays a fixture of recorded UW WebSocket frames through the ingestor’s parse-and-publish path. Usepytest-benchmarkfor the Python side (Criterion is Rust-only); the discipline is the same. - Capture three statistics: median, p99, and frames-per-second throughput.
- Run on every PR that touches the UW adapter. Fail CI if median exceeds 10 ms or p99 exceeds 50 ms (2× the targets, to allow for CI noise).
- Trigger for Rust drop-down: if production-observed p99 exceeds
50 ms during real flow bursts, OR throughput saturates below 200/sec
when UW emits at >150/sec, file a project page (
projects/uw-rust- adapter.md) and Codex authors a Rust crate atcrates/adapters/unusual_whales/. Not before.
What NOT to do: preemptively rewrite UW in Rust. The Nautilus guidance is explicit: layer the Rust hot-path under a Python interface, and only do so when measurement justifies it.
The M2 “MK3 faster than MK2” claim
The MK3 roadmap’s M2 milestone asserts “MK3 makes the same decisions as MK2, only faster.” This is a benchmark claim and needs benchmark evidence. Methodology:
- Define “decision latency” identically for both systems. From the moment a triggering input lands (a UW flow alert, a quote tick, a bar close) to the moment an order is sent to IBKR. End-to-end, wall-clock, on the same hardware, against the same fixture.
- Fixture is the 2026-04-16 chop-day replay (per
project_losses_april16_chop) - already an established hard-test case. Replay it through MK2 and through MK3; capture decision latency on every trade. - Report median, p99, and the latency CDF. If MK3’s CDF stochastically dominates MK2’s (every percentile is at least as fast), the claim is supported. If it’s mixed, dig into the regression percentiles.
- Use Criterion for the Rust-side latencies that MK3 inherits from the framework (cache read, bus dispatch, order routing). These are not new code Cortana is writing - they’re upstream-benchmarked already, and we cite the upstream numbers rather than re-measuring.
- Use
pytest-benchmarkfor the Python-side latencies (strategy callback execution, scoring engine, meta-labeling). These are Cortana’s code and need Cortana benchmarks.
Acceptance criterion for M2: MK3 decision latency p99 ≤ MK2’s p99 on the chop-day replay, AND MK3’s median ≤ MK2’s median. If either percentile regresses, M2 is not done.
What stays Python forever (and is therefore not benchmark-gated)
Per nautilus-rust.md, strategy logic, configuration, ML inference,
dashboard, and brain integration are Python-fast-enough by design. We
don’t add benchmarks for these; we add functional tests. The
benchmark surface is reserved for hot paths where measurement could
plausibly justify a Rust drop-down - and at the spike stage, that’s
exactly one path: UW ingest.
A note on pytest-benchmark vs Criterion
The benchmarking page covers the Rust benchmark suite. Cortana’s
Python benchmarks need a Python harness. The de facto standard is
pytest-benchmark, which:
- Integrates with pytest (same
tests/discovery). - Reports min/max/mean/median/std-dev, like Criterion.
- Supports
--benchmark-comparefor cross-run comparison. - Has a JSON output mode suitable for CI gating.
The discipline is identical to Criterion’s:
- Heavy setup outside the timed function (use
pytest.fixturewithscope="module"). - No
printor I/O inside the timed call. - Run the same fixture across PRs and compare deltas, not absolutes.
If a Python benchmark consistently shows the bottleneck is in the Python layer (not the Rust framework), that’s the empirical signal that authorizes a Rust drop-down - and the workflow shifts to Criterion + iai under the new Rust crate.
See Also
- Nautilus Rust - the “do we need Rust?” question this page operationalizes.
- Nautilus Architecture - the runtime topology whose hot paths these benchmarks cover.
- Nautilus Message Bus - the LMAX-disruptor- lineage spine whose dispatch latency is the fundamental “how fast can this go?” floor.
- Nautilus Cache - the in-process state store whose read/write latencies are upstream-benchmarked.
- Nautilus Developer Guide - parent page for the full developer surface.
- 2026-05-09 Nautilus Spike Plan:
~/conductor/workspaces/cortanaroi-mk2/belo-horizonte/plans/2026-05-09-nautilus-spike.md
Timeline
- 2026-05-07 | Cody - Filed during pre-spike concept mastery sweep batch 7 (developer guide).