Nautilus Developer Benchmarking

Nautilus’s developer_guide/benchmarking/ page documents two complementary Rust benchmark frameworks - Criterion (statistical wall-clock harness with HTML reports and confidence intervals; for anything ≥100 ns or end-to-end scenarios) and iai (deterministic CPU-instruction-counting harness for ultra-fast functions and noise-free CI gating). Each crate under crates/<crate_name>/benches/ keeps <name>_criterion.rs and <name>_iai.rs files registered in Cargo.toml with harness = false. Run with cargo bench -p <crate>, optionally --bench <module>, or make cargo-ci-benches (CI). Profile hot paths with cargo flamegraph --bench <name> -p <crate> --profile bench (Linux uses perf, macOS uses sudo + DTrace). The custom [profile.bench] inherits release-debugging so flamegraphs show readable symbols. Templates at docs/dev_templates/{criterion,iai}_template.rs. Ground rules: heavy setup outside b.iter, wrap inputs in std::hint::black_box, group related cases with benchmark_group!, iai functions take no params. For Cortana MK3 this is the framework that decides “is the Python UW ingestor measured-too-slow?” - and the same toolchain backs the M2 “MK3 makes the same decisions as MK2, only faster” claim with numbers instead of vibes.

Why this page exists

nautilus-rust.md left one question open: how do we know when Python is actually too slow? The answer is “measure with the framework’s own benchmark suite.” This page makes the benchmark surface explicit - what exists, where to put a new one, what units to think in, and what the profile-then-optimize discipline looks like.

For Cortana MK3 this matters in two places:

The UW Python ingestor. Per nautilus-rust.md, the single most likely Cortana hot path that might force a Rust drop-down is the UW WebSocket adapter under flow bursts. The decision rule is empirical: “Python is fine until measurement says otherwise.” This page is the measurement playbook.
The M2 milestone claim. “MK3 makes the same decisions as MK2, only faster.” That’s a benchmark assertion. We need a methodology for measuring MK2’s decision latency, expressing it in Nautilus benchmark terms, and demonstrating MK3 is at least as fast end-to-end.

Tooling overview

The benchmarking page is explicit that Nautilus relies on two complementary frameworks because each measures a different thing:

Framework	What it measures	When to prefer it
Criterion	Wall-clock run time with statistical confidence intervals; outlier detection; HTML reports.	End-to-end scenarios, anything slower than ~100 ns, visual comparisons across runs.
iai	Exact retired CPU instruction counts via hardware counters. Deterministic and noise-free.	Ultra-fast functions, regression gating in CI, micro-benchmarks where wall-clock noise drowns the signal.

Most hot code paths warrant both:

Criterion answers “how long does this take in real time on this machine?” - the question users care about.
iai answers “did this PR change the work the CPU is doing?” - the question CI cares about, immune to system load and thermal throttling.

Critical caveat from the page: iai numbers are machine-specific. Use iai for delta detection within a single CI environment, never for cross-machine comparison. A laptop and a CI runner will produce different absolute counts even for the same code.

Directory layout

Each crate keeps its performance tests in a local benches/ folder:

crates/<crate_name>/
└── benches/
    ├── foo_criterion.rs   # Criterion group(s)
    └── foo_iai.rs         # iai micro benches

Cargo.toml must list every benchmark explicitly so cargo bench discovers them:

[[bench]]
name = "foo_criterion"
path = "benches/foo_criterion.rs"
harness = false
 
[[bench]]
name = "foo_iai"
path = "benches/foo_iai.rs"
harness = false

harness = false is required - both Criterion and iai supply their own main() via criterion_main! / iai::main!.

Writing Criterion benchmarks

Three rules from the page:

Heavy setup OUTSIDE the timing loop (b.iter). Setup work contaminates the measurement.
Wrap inputs/outputs in std::hint::black_box. Without this, the optimizer can hoist or eliminate the call entirely and the benchmark measures nothing.
Group related cases with benchmark_group! and set throughput or sample_size when the defaults aren’t ideal.

The canonical shape:

use std::hint::black_box;
use criterion::{Criterion, criterion_group, criterion_main};
 
fn bench_my_algo(c: &mut Criterion) {
    let data = prepare_data();  // heavy setup, done once
 
    c.bench_function("my_algo", |b| {
        b.iter(|| my_algo(black_box(&data)));
    });
}
 
criterion_group!(benches, bench_my_algo);
criterion_main!(benches);

Criterion writes HTML reports to target/criterion/. Open target/criterion/report/index.html to see distributions, outliers, and historical comparisons (Criterion auto-saves prior runs and shows a delta on every re-run).

Writing iai benchmarks

iai requires functions that take no parameters and return a value (which can be ignored). Keep them as small as possible - every instruction iai counts contributes to the signal-to-noise ratio.

use std::hint::black_box;
 
fn bench_add() -> i64 {
    let a = black_box(123);
    let b = black_box(456);
    a + b
}
 
iai::main!(bench_add);

The black_box calls force the optimizer to treat the inputs as opaque - without them, iai would count zero instructions because the compiler would fold the constants.

Running benches locally

Command	What it does
`cargo bench -p nautilus-core`	Run every bench in a single crate.
`cargo bench -p nautilus-core --bench time`	Run one benchmark module.
`make cargo-ci-benches`	Run the CI performance workflow’s crate set, one crate at a time (avoids the mixed-panic-strategy linker issue when multiple bench binaries link in one process).

The make cargo-ci-benches target is what the upstream CI uses; reach for it when you want to reproduce CI numbers locally.

Generating a flamegraph

cargo-flamegraph produces a sampled call-stack profile of a single benchmark - answers the “where is the time actually going?” question that Criterion/iai delta-numbers don’t.

Install once per machine:

cargo install flamegraph

Run a specific bench with the symbol-rich bench profile:

# example: the matching benchmark in nautilus-common
cargo flamegraph --bench matching -p nautilus-common --profile bench

Open the generated flamegraph.svg in a browser; zoom into hot sub-trees by clicking.

Linux

perf must be available:

sudo apt install linux-tools-common linux-tools-$(uname -r)

If you see a perf_event_paranoid error, relax the kernel restriction for the session:

sudo sh -c 'echo 1 > /proc/sys/kernel/perf_event_paranoid'

A value of 1 is typically enough; reset to 2 (default) afterward, or make it permanent via /etc/sysctl.conf.

macOS

DTrace requires root, so cargo flamegraph must be invoked with sudo. Caveat: sudo cargo flamegraph creates files in target/ owned by root, which then poison subsequent non-root cargo commands. Either remove the root-owned files manually or run sudo cargo clean.

sudo cargo flamegraph --bench matching -p nautilus-common --profile bench

Cortana’s primary dev machine is macOS - expect the sudo dance.

The `[profile.bench]` profile

Bench binaries are compiled with the workspace’s custom [profile.bench], which inherits from release-debugging:

Full optimization (release-grade codegen).
Debug symbols preserved (so flamegraphs show real function names).
Distinct from [profile.release] which uses panic = "abort" and strips symbols for the production wheel.

This separation is intentional: production binaries stay lean, but benchmarks remain debuggable. Don’t override it without reading the workspace Cargo.toml first.

Templates

Ready-to-copy starter files live under:

docs/dev_templates/criterion_template.rs
docs/dev_templates/iai_template.rs

Workflow when adding a new benchmark:

cp docs/dev_templates/criterion_template.rs crates/<crate>/benches/<name>_criterion.rs
Adjust imports, function names, and the criterion_group! macro.
Add the [[bench]] stanza to crates/<crate>/Cargo.toml.
Run cargo bench -p <crate> --bench <name> to verify it discovers and runs.
Open target/criterion/report/index.html and confirm the histogram looks well-shaped (no bimodal distributions, no extreme outliers).

Profile-then-optimize discipline

The page doesn’t spell this out as a section, but the tooling choice implies it:

Don’t optimize without a benchmark. A benchmark is the only honest proof that a change matters. Without one, “this is faster” is a guess.
Don’t change benchmarks alongside the change you’re measuring. Land the benchmark first (it should pass at current speed), then land the optimization (it should pass faster). Otherwise you can’t tell which line moved the number.
Flamegraph before refactoring. A flamegraph tells you which function actually dominates. Most “obvious” optimizations are wrong; the dominant cost is usually somewhere boring.
iai gates regressions in CI; Criterion confirms wins locally. iai’s instruction-count diffs are stable enough to fail a PR. Wall- clock numbers are not - they vary with thermal state, neighboring processes, and the phase of the moon.

What’s already benchmarked upstream

The page references several existing benches by name (without an exhaustive list); the canonical examples to crib from:

crates/nautilus-common/benches/matching* - the matching engine hot path. The Criterion variant measures end-to-end matching latency; the iai variant guards the inner instruction count.
crates/nautilus-core/benches/time* - clock and timestamp primitives. Used as the example in the “Single benchmark module” command.

To find the full set: find crates -path '*/benches/*.rs' from a checkout, or browse the upstream repo’s crates/*/benches/ directories.

Latency budget - what “good” looks like by component

The benchmarking page itself doesn’t publish absolute latency targets, but the architecture page (nautilus-architecture.md) and the design philosophy ranking (Reliability > Performance > Modularity > …) imply the framework’s posture:

Layer	Order of magnitude (typical, derived from upstream benches and architecture commentary)
In-process MessageBus dispatch (single topic, single subscriber)	Sub-microsecond
Cache read (`cache.quote_tick(instrument_id)`)	Single-digit microseconds
Tick parse + cache write + publish (data engine path)	Tens of microseconds
PyO3 boundary crossing (Python → Rust → Python)	Single-digit microseconds per hop
Strategy `on_quote_tick` Python callback (logic only, no I/O)	Hundreds of microseconds, dominated by Python interpreter overhead
Order submit → ExecutionClient send	Tens to hundreds of microseconds (in-process); network-dominated externally

Cortana’s signal cadence is seconds, not microseconds. The decision loop runs on the human-actionable timescale - 1-second scoring ticks at the fastest. Even a Python callback that takes 1 ms is 3 orders of magnitude faster than the cadence it serves. This is exactly why the PyO3 cost is “dust” relative to the work the engine actually does.

The hot paths where microseconds matter - tick parsing, bar building, book maintenance, matching, accounting - are already Rust and are benchmarked upstream. Cortana doesn’t reimplement them.

Cortana MK3 implications

The UW Python ingestor - concrete benchmark plan

Per nautilus-rust.md: “the candidate Cortana hot path that might warrant Rust” is the UW WebSocket adapter under flow bursts. The decision rule is empirical: ship the Python ingestor, measure under load, drop to Rust only if measurement says so.

Recommended target latency for the UW Python ingestor:

Median end-to-end “WebSocket frame received → CustomData published on MessageBus” latency ≤ 5 ms; p99 ≤ 20 ms; sustained throughput ≥ 200 alerts/second without queue growth.

Reasoning (concrete numbers, derived from Cortana data):

Cortana’s signal cadence is 1 Hz (1-second scoring tick). Anything < 100 ms is invisible to the scoring engine. We could budget 50 ms comfortably, but tighter is better insurance.
5 ms median leaves 200× headroom against the 1 Hz tick. Empirically achievable for Python json.loads + dataclass construction + bus publish on a modern machine.
20 ms p99 absorbs GC pauses, OS scheduling jitter, and occasional large-payload alerts without hitting the cadence floor.
200 alerts/second sustained is ~10× the typical UW alert rate (observed empirically: UW emits 5-30 alerts/sec during normal flow, with bursts to ~100/sec on event-driven days). 200/sec gives 2× burst headroom. If we exceed this regularly, we’re in Rust-adapter territory.

How to measure it:

Add a Criterion bench at tests/bench/uw_ingest_criterion.py (or its equivalent location once MK3 is structured) that replays a fixture of recorded UW WebSocket frames through the ingestor’s parse-and-publish path. Use pytest-benchmark for the Python side (Criterion is Rust-only); the discipline is the same.
Capture three statistics: median, p99, and frames-per-second throughput.
Run on every PR that touches the UW adapter. Fail CI if median exceeds 10 ms or p99 exceeds 50 ms (2× the targets, to allow for CI noise).
Trigger for Rust drop-down: if production-observed p99 exceeds 50 ms during real flow bursts, OR throughput saturates below 200/sec when UW emits at >150/sec, file a project page (projects/uw-rust- adapter.md) and Codex authors a Rust crate at crates/adapters/unusual_whales/. Not before.

What NOT to do: preemptively rewrite UW in Rust. The Nautilus guidance is explicit: layer the Rust hot-path under a Python interface, and only do so when measurement justifies it.

The M2 “MK3 faster than MK2” claim

The MK3 roadmap’s M2 milestone asserts “MK3 makes the same decisions as MK2, only faster.” This is a benchmark claim and needs benchmark evidence. Methodology:

Define “decision latency” identically for both systems. From the moment a triggering input lands (a UW flow alert, a quote tick, a bar close) to the moment an order is sent to IBKR. End-to-end, wall-clock, on the same hardware, against the same fixture.
Fixture is the 2026-04-16 chop-day replay (per project_losses_april16_chop) - already an established hard-test case. Replay it through MK2 and through MK3; capture decision latency on every trade.
Report median, p99, and the latency CDF. If MK3’s CDF stochastically dominates MK2’s (every percentile is at least as fast), the claim is supported. If it’s mixed, dig into the regression percentiles.
Use Criterion for the Rust-side latencies that MK3 inherits from the framework (cache read, bus dispatch, order routing). These are not new code Cortana is writing - they’re upstream-benchmarked already, and we cite the upstream numbers rather than re-measuring.
Use pytest-benchmark for the Python-side latencies (strategy callback execution, scoring engine, meta-labeling). These are Cortana’s code and need Cortana benchmarks.

Acceptance criterion for M2: MK3 decision latency p99 ≤ MK2’s p99 on the chop-day replay, AND MK3’s median ≤ MK2’s median. If either percentile regresses, M2 is not done.

What stays Python forever (and is therefore not benchmark-gated)

Per nautilus-rust.md, strategy logic, configuration, ML inference, dashboard, and brain integration are Python-fast-enough by design. We don’t add benchmarks for these; we add functional tests. The benchmark surface is reserved for hot paths where measurement could plausibly justify a Rust drop-down - and at the spike stage, that’s exactly one path: UW ingest.

A note on `pytest-benchmark` vs Criterion

The benchmarking page covers the Rust benchmark suite. Cortana’s Python benchmarks need a Python harness. The de facto standard is pytest-benchmark, which:

Integrates with pytest (same tests/ discovery).
Reports min/max/mean/median/std-dev, like Criterion.
Supports --benchmark-compare for cross-run comparison.
Has a JSON output mode suitable for CI gating.

The discipline is identical to Criterion’s:

Heavy setup outside the timed function (use pytest.fixture with scope="module").
No print or I/O inside the timed call.
Run the same fixture across PRs and compare deltas, not absolutes.

If a Python benchmark consistently shows the bottleneck is in the Python layer (not the Rust framework), that’s the empirical signal that authorizes a Rust drop-down - and the workflow shifts to Criterion + iai under the new Rust crate.

Timeline

2026-05-07 | Cody - Filed during pre-spike concept mastery sweep batch 7 (developer guide).

CortanaROI Brain

Explorer

nautilus-dev-benchmarking

Nautilus Developer Benchmarking

Why this page exists

Tooling overview

Directory layout

Writing Criterion benchmarks

Writing iai benchmarks

Running benches locally

Generating a flamegraph

Linux

macOS

The `[profile.bench]` profile

Templates

Profile-then-optimize discipline

What’s already benchmarked upstream

Latency budget - what “good” looks like by component

Cortana MK3 implications

The UW Python ingestor - concrete benchmark plan

The M2 “MK3 faster than MK2” claim

What stays Python forever (and is therefore not benchmark-gated)

A note on `pytest-benchmark` vs Criterion

See Also

Timeline

Graph View

Table of Contents

CortanaROI Brain

Explorer

nautilus-dev-benchmarking

Nautilus Developer Benchmarking

Why this page exists

Tooling overview

Directory layout

Writing Criterion benchmarks

Writing iai benchmarks

Running benches locally

Generating a flamegraph

Linux

macOS

The [profile.bench] profile

Templates

Profile-then-optimize discipline

What’s already benchmarked upstream

Latency budget - what “good” looks like by component

Cortana MK3 implications

The UW Python ingestor - concrete benchmark plan

The M2 “MK3 faster than MK2” claim

What stays Python forever (and is therefore not benchmark-gated)

A note on pytest-benchmark vs Criterion

See Also

Timeline

Graph View

Table of Contents

The `[profile.bench]` profile

A note on `pytest-benchmark` vs Criterion