Nautilus Trader - Testing (Developer Guide)

Tests in Nautilus are executable specifications, not “coverage”. The developer-guide testing page defines a mechanism ladder (unit → parametrized → property → integration → fuzz → spec acceptance → DST → formal) and a projection rule that maps module shape (pure/stateful, sync/async, I/O-bound) onto which rungs of the ladder actually pay off. Two physical Python suites: tests/ (legacy v1 Cython, run with make pytest) and python/tests/ (v2 PyO3-backed, run with make pytest-v2). Three foundational rules drive every Cortana MK3 test: (1) pytest free-functions + fixtures, no test classes, in python/tests/; (2) TestInstrumentProvider / TestDataProvider from python/tests/providers.py are the canonical instrument and bar/tick factories - never roll your own; (3) never capture log output to assert behaviour, never pytest.raises(BaseException) against PyO3 panic paths. Async waits use await eventually(...) from nautilus_trader.test_kit.functions, NOT time.sleep. The BacktestEngine is the deterministic harness - wire actor + strategy + data, call engine.run(), assert against engine.trader, engine.cache, engine.portfolio. MK3 M1 (50+ tests) and M2 (1000+ tests) only work if Cortana’s tests are Nautilus-shaped from the first commit. This page is the contract for that shape.

Source URL availability - Spike Step 0 result

URL	Status	Implication
`/developer_guide/testing/`	200 OK	Authoritative testing recipe
`/developer_guide/spec_data_testing/`	200 OK (parallel page)	Filed separately as `nautilus-dev-spec-data-testing.md`
`/developer_guide/spec_exec_testing/`	200 OK (parallel page)	Filed separately as `nautilus-dev-spec-exec-testing.md`

This page covers the general testing mechanism ladder and Python test conventions. The two spec pages cover live-venue acceptance matrices (DataTester / ExecTester) - separate filings because they’re each large enough to deserve their own concept page.

TL;DR - what every Cortana MK3 test must do

# python/tests/cortana/test_scoring_actor.py
from decimal import Decimal
 
import pytest
 
from nautilus_trader.backtest.engine import BacktestEngine, BacktestEngineConfig
from nautilus_trader.model.enums import OmsType, AccountType
from nautilus_trader.model.identifiers import Venue
from nautilus_trader.model.objects import Money, Currency
from nautilus_trader.test_kit.providers import TestInstrumentProvider
from nautilus_trader.test_kit.stubs.data import TestDataStubs
 
from cortana.nautilus.scoring_actor import ScoringActor, ScoringActorConfig
from cortana.nautilus.custom_data import UWFlowAlert, ScoreUpdate
 
 
@pytest.fixture
def spy() -> "Equity":
    return TestInstrumentProvider.equity(symbol="SPY", venue="ARCA")
 
 
@pytest.fixture
def engine(spy) -> BacktestEngine:
    engine = BacktestEngine(config=BacktestEngineConfig(trader_id="TESTER-001"))
    engine.add_venue(
        venue=Venue("ARCA"),
        oms_type=OmsType.NETTING,
        account_type=AccountType.MARGIN,
        starting_balances=[Money(1_000_000, Currency.from_str("USD"))],
    )
    engine.add_instrument(spy)
    yield engine
    engine.dispose()
 
 
def test_scoring_actor_publishes_score_on_uw_flow(engine, spy):
    actor = ScoringActor(
        ScoringActorConfig(
            instrument_id=spy.id,
            bar_type=TestDataStubs.bartype_spy_1min_last(),
            score_threshold=65,
        )
    )
    engine.add_actor(actor)
 
    # Inject one synthetic UW flow alert + one bar
    alert = UWFlowAlert(
        instrument_id=spy.id,
        side="CALL",
        strike=580.0,
        premium=12_000.0,
        aggressor="BUY",
        confidence=0.9,
        ts_event=0,
        ts_init=0,
    )
    engine.add_data([alert])
    engine.add_data([TestDataStubs.bar_5decimal()])
 
    engine.run()
 
    # Assert observable behaviour on the cache, not log output
    scores = engine.cache.custom_data(ScoreUpdate)
    assert len(scores) >= 1
    assert scores[-1].bias == "BULL"

What this demonstrates verbatim:

pytest free function, not a TestCase.
TestInstrumentProvider for the instrument; never hand-roll.
TestDataStubs for canonical Bars/QuoteTicks.
BacktestEngine as the harness - same code as live; just swap data.
yield fixture with engine.dispose() teardown.
Assert against engine.cache (observable state), not on log strings.

Every Cortana MK3 test that isn’t a pure-function unit test follows this shape. Deviations cost more than they save.

Test layout - where things live

Two physical suites

nautilus_trader/                    # repo root
├── tests/                          # v1 LEGACY (Cython package)
│   ├── unit_tests/
│   ├── integration_tests/
│   └── performance_tests/
└── python/                         # v2 PyO3 path (THE ONE Cortana uses)
    ├── tests/
    │   ├── unit_tests/
    │   │   ├── common/             # Actor, MessageBus, Cache
    │   │   ├── execution/          # OrderFactory, RiskEngine
    │   │   ├── model/              # Price, Quantity, InstrumentId
    │   │   └── ...
    │   ├── integration_tests/      # Multi-component
    │   ├── performance_tests/      # Benchmarks (run separately)
    │   ├── providers.py            # TestInstrumentProvider, TestDataProvider
    │   └── conftest.py             # Shared pytest fixtures
    └── .venv/                      # Suite-specific virtualenv

Run commands (verbatim from the testing page):

# v2 - what Cortana MK3 uses
make pytest-v2
# or directly (won't pick up isolation):
uv run --active --no-sync pytest python/tests/...
 
# v1 legacy - Cortana does NOT use; included for completeness
make pytest
 
# Performance (run in isolation; never alongside unit tests)
make test-performance

The make pytest-v2 target isolates certain test modules in separate pytest processes to avoid global Rust state conflicts. Do not invoke pytest directly on the whole v2 suite; per-file invocation is fine for local iteration but full-suite runs go through Make.

Cortana MK3 mirror

Cortana lives outside the Nautilus tree but the test layout mirrors it exactly:

cortana-mk3/
├── cortana/
│   ├── nautilus/
│   │   ├── scoring_actor.py
│   │   ├── meta_gate_actor.py
│   │   ├── ema_decay_actor.py
│   │   ├── regime_detector_actor.py
│   │   ├── cortana_strategy.py
│   │   ├── custom_data.py           # @customdataclass definitions
│   │   └── risk_rules.py            # custom RiskEngine rules
│   └── ...
└── tests/
    ├── unit_tests/
    │   ├── test_scoring_actor.py
    │   ├── test_meta_gate_actor.py
    │   ├── test_ema_decay_actor.py
    │   ├── test_cortana_strategy.py
    │   ├── test_uw_data_client.py
    │   ├── test_custom_data.py
    │   └── test_risk_rules.py
    ├── integration_tests/
    │   ├── test_actor_strategy_pipeline.py
    │   ├── test_full_session_replay.py
    │   └── test_eod_market_exit.py
    ├── property_tests/
    │   └── test_score_invariants.py
    ├── replay_tests/
    │   ├── fixtures/
    │   │   ├── 2026-04-16-chop-day.parquet
    │   │   └── 2026-05-06-power-outage.parquet
    │   └── test_replay_fixtures.py
    ├── conftest.py
    └── fixtures/
        ├── synthetic_chains.py
        └── decisions_db_seed.py

Mapping rule: every cortana/nautilus/<X>.py has at least one tests/unit_tests/test_<X>.py. This drives M1’s 50+ test target almost mechanically - 7 actors + 1 strategy + 1 data client + 1 custom-data + 1 risk-rules module = 11 modules × ~5 unit tests each = 55 tests. M2’s 1000+ comes from parametrized expansions, property tests, and full-session replays (each replay fixture asserts dozens of invariants).

Mechanism ladder - when each layer applies

The developer guide is explicit: start at the lowest layer that proves what matters; climb only when the layer below stops detecting regressions or when the input space outgrows hand-picked cases.

Layer	Trigger	Cortana example
Unit	Single function, enumerable cases	`_conviction_bucket(120) == "HIGH"`
Parametrized	Same shape across discrete inputs	`meta_gate(prob)` over 0.1, 0.4, 0.55, 0.9
Property	Invariant must hold for whole class of inputs	EMA decay always in [0, ∞), monotonic decline
Integration	Multiple modules through real engine/runtime	Actor → MessageBus → Strategy publishes → Strategy submits order
Fuzz	Untrusted bytes cross parser/decoder	UW WebSocket JSON adversarial inputs
Spec acceptance	Behaviour depends on live venue contract	`DataTester` against IBKR paper; `ExecTester` against IBKR paper
DST	Correctness depends on task scheduling/ordering/timeouts	Multi-actor publish-order under wall-clock skew
Formal	Pure function + crisp invariants + bounded input space	(aspirational; no Kani/Prusti yet in workspace)

Projection rule - module shape determines which rungs pay off:

Module shape	Layers that apply
Pure function, crisp invariants	Unit, parametrized, property, fuzz
Pure function, no stated invariants	Unit, parametrized, property, fuzz
Stateful, synchronous	Unit, parametrized, property over transitions
Stateful, async	Unit, integration, deterministic simulation
I/O-bound, venue contract	Integration, spec acceptance, boundary fuzz

For Cortana MK3:

scoring_actor composite math - pure function once flow_pressure is passed in: unit + parametrized + property.
scoring_actor flow-decay timer - stateful sync: unit + property over transitions.
cortana_strategy.on_data - stateful async: unit + integration via BacktestEngine + (eventually) DST.
uw_data_client.parse_alert - pure parser: unit + parametrized + fuzz against malformed UW WS payloads.
uw_data_client WebSocket loop - I/O bound venue contract: integration via mock-Axum + spec acceptance once UW publishes a contract.

When NOT to add coverage (verbatim from the page)

Don’t add debug_assert! where no test reaches it. Release builds strip.
Prefer a proptest over hand-written edge-case tests when the invariant spans a class of inputs.
Don’t duplicate a live spec acceptance card as an integration test - link.
Don’t pad coverage with tests that assert language guarantees (Option::is_some after Some(...)).

Fixture conventions

1. `TestInstrumentProvider` - canonical instruments

Lives at python/tests/providers.py. Every standard instrument has a factory:

from nautilus_trader.test_kit.providers import TestInstrumentProvider
 
# Equities
spy = TestInstrumentProvider.equity(symbol="SPY", venue="ARCA")
aapl = TestInstrumentProvider.equity(symbol="AAPL", venue="NASDAQ")
 
# Options (Cortana 0DTE case)
spy_call_580_today = TestInstrumentProvider.option_contract(
    symbol="SPY", strike=580.0, expiry=date.today(),
    option_kind=OptionKind.CALL, venue="OPRA",
)
 
# FX, futures, crypto perpetuals - all available
fx = TestInstrumentProvider.default_fx_ccy("EUR/USD")
btcusdt = TestInstrumentProvider.btcusdt_perp_binance()

Rule: never instantiate Equity / OptionsContract directly in a test. Always go through TestInstrumentProvider. If the instrument isn’t in the provider, add it to the provider first (one PR), then use it.

2. `TestDataProvider` / `TestDataStubs` - canonical data

from nautilus_trader.test_kit.stubs.data import TestDataStubs
 
bar = TestDataStubs.bar_5decimal()              # canonical Bar
quote = TestDataStubs.quote_tick()              # canonical QuoteTick
trade = TestDataStubs.trade_tick()              # canonical TradeTick
deltas = TestDataStubs.order_book_deltas()      # canonical OrderBookDeltas

Stubs ship reasonable defaults (timestamps, prices, sizes) so tests stay short. Override fields you actually care about; let the rest default.

3. `conftest.py` - shared per-suite fixtures

python/tests/conftest.py provides:

clock - a TestClock with set_time(ns) and advance_time(ns).
cache - a fresh Cache() instance.
msgbus - a fresh MessageBus() instance.
logger - a no-op test logger.

Cortana’s tests/conftest.py extends with project-specific fixtures:

# tests/conftest.py
import pytest
from datetime import datetime, timezone
from decimal import Decimal
 
from nautilus_trader.test_kit.providers import TestInstrumentProvider
from nautilus_trader.test_kit.stubs.identifiers import TestIdStubs
 
 
@pytest.fixture
def spy():
    return TestInstrumentProvider.equity(symbol="SPY", venue="ARCA")
 
@pytest.fixture
def spy_call_580():
    return TestInstrumentProvider.option_contract(
        symbol="SPY", strike=580.0, expiry=datetime(2026, 5, 7, tzinfo=timezone.utc).date(),
        option_kind="CALL", venue="OPRA",
    )
 
@pytest.fixture
def trader_id():
    return TestIdStubs.trader_id()
 
@pytest.fixture
def synthetic_uw_alert(spy):
    """Single canonical UW flow alert for unit tests."""
    from cortana.nautilus.custom_data import UWFlowAlert
    return UWFlowAlert(
        instrument_id=spy.id, side="CALL", strike=580.0,
        premium=12_000.0, aggressor="BUY", confidence=0.9,
        ts_event=0, ts_init=0,
    )

The fixture naming convention mirrors Nautilus’s: noun for the thing itself (spy, trader_id), descriptive prefix for variants (synthetic_uw_alert, bull_score_update).

4. Yield fixtures for engine teardown

@pytest.fixture
def engine():
    engine = BacktestEngine(config=BacktestEngineConfig(trader_id="TESTER-001"))
    yield engine
    engine.dispose()    # MUST run; otherwise Rust resources leak

yield fixtures are required wherever the resource has a dispose() / close() / disconnect() method. Without dispose(), Rust-side resources (open sockets, allocated arenas) survive the test and cause state bleed between cases.

`BacktestEngine` as a test harness

The most underused fact about Nautilus testing: the BacktestEngine is the test harness. Same engine code in backtest, sandbox, and live - there is no separate “test engine”. Wire your actors + strategies, feed data through engine.add_data(...), call engine.run(), assert against engine.cache / engine.portfolio / engine.trader.

from nautilus_trader.backtest.engine import BacktestEngine, BacktestEngineConfig
from nautilus_trader.model.enums import AccountType, OmsType
from nautilus_trader.model.identifiers import Venue
from nautilus_trader.model.objects import Money, Currency
 
 
def test_cortana_strategy_submits_bracket_on_score(spy, spy_call_580):
    engine = BacktestEngine(
        config=BacktestEngineConfig(trader_id="TESTER-001"),
    )
    engine.add_venue(
        venue=Venue("OPRA"),
        oms_type=OmsType.HEDGING,           # 0DTE options: hedging mode
        account_type=AccountType.MARGIN,
        starting_balances=[Money(50_000, Currency.from_str("USD"))],
    )
    engine.add_instrument(spy)
    engine.add_instrument(spy_call_580)
 
    actor = ScoringActor(ScoringActorConfig(instrument_id=spy.id, ...))
    strategy = CortanaStrategy(CortanaConfig(underlying_id=spy.id, ...))
    engine.add_actor(actor)
    engine.add_strategy(strategy)
 
    # Feed a sequence: bar → flow alert → bar (to trigger score crossing)
    engine.add_data([bar1, flow_alert, bar2])
 
    engine.run()
 
    # Observable assertions
    orders = engine.cache.orders()
    assert len(orders) == 3       # entry + TP + SL bracket legs
    parent = next(o for o in orders if not o.is_contingent)
    assert parent.order_type.name == "MARKET"
    assert parent.instrument_id == spy_call_580.id

Why this is the right harness:

Deterministic. Same-input → same-output; replay-stable.
Real engine code. No mocks of MessageBus, Cache, RiskEngine.
Same tests cover live behaviour. What passes in backtest also covers the live path (modulo I/O latency and adapter quirks).
engine.cache.orders() / .positions() is the assertion surface. Don’t poke into private state.

Per-test engine vs session engine

For unit-shaped tests: per-test engine via yield fixture. For integration tests that want to test multiple sequenced runs (e.g., “reset between sessions” semantics): session-scoped engine + engine.reset() between cases. The page calls reset out as the supported between-run hook (on_reset Actor lifecycle hook is what fires).

Mock objects - TestClock, MessageBus, Cache, in-memory adapters

Hand-written stubs > mocking frameworks

Verbatim from the page: “Prefer hand-written stubs that return fixed values over mocking frameworks. Use MagicMock only when you need to assert call counts/arguments or simulate complex state changes. Avoid mocking the objects you’re actually testing.”

For Cortana that means:

✅ Hand-written FakeUWWebSocket that emits a deterministic alert sequence - perfect for test_uw_data_client.py.
✅ MagicMock on RiskEngine.size_position to assert it was called with the expected meta-prob - only for that one assertion.
❌ Don’t mock MessageBus. Use the real one - it’s already in-memory and deterministic in tests.
❌ Don’t mock Cache. Same reason.

`TestClock` - controllable wall-clock for handlers

from nautilus_trader.common.component import TestClock
 
def test_ema_decay_actor_decays_on_timer_tick():
    clock = TestClock()
    clock.set_time(0)
    # ... wire actor with this clock via msgbus/component init
    actor.start()
 
    clock.advance_time(1_000_000_000)   # +1 second
    # decay timer fires; assert state
    assert actor.flow_pressure == pytest.approx(initial * 0.977, rel=1e-3)

TestClock exposes set_time_ns(), advance_time(ns), and advance_time(ns, set_time=False) for relative deltas. Never call time.time() or datetime.now() from production Actor/Strategy code - the test then can’t drive the clock.

Adapter mocking for venue-contract tests

Spec acceptance tests use the real venue. For below-the-spec tests (parsing, request signing, retry semantics), Nautilus uses mock Axum servers in Rust and aiohttp test servers in Python. For Cortana V1: fake_uw_ws_server.py running an aiohttp.web echo with a scripted message log is the right shape.

Test data factories

Three escalating mechanisms - pick the smallest that works:

1. Inline literals (for one-off cases)

def test_meta_gate_blocks_low_prob():
    update = MetaProbUpdate(instrument_id=spy.id, prob=0.42, ts_event=0, ts_init=0)
    assert meta_gate(update) is False

2. Stub helpers (for repeated shapes)

def make_score(side="BULL", score=70, conviction="MED"):
    return ScoreUpdate(
        instrument_id=spy.id, composite_score=score, bias=side,
        conviction=conviction, flow_pressure=0.0, ts_event=0, ts_init=0,
    )

3. Parquet replay fixtures (for full-session tests)

For Cortana’s 1000-test target, this is the heavy hitter. A single 2026-04-16-chop-day.parquet fixture replayed through BacktestEngine asserts 100s of invariants:

@pytest.fixture(scope="session")
def chop_day_replay(spy):
    """Full 2026-04-16 session - chop day; replay produces 3 trades, all SL."""
    return load_parquet_data_for_replay("tests/replay_tests/fixtures/2026-04-16-chop-day.parquet")
 
 
def test_chop_day_produces_only_three_trades(chop_day_replay, spy):
    engine = BacktestEngine(config=BacktestEngineConfig(trader_id="TESTER-001"))
    # ... wire venue, actor, strategy
    engine.add_data(chop_day_replay)
    engine.run()
 
    closed = [p for p in engine.cache.positions() if p.is_closed]
    assert len(closed) == 3
    assert all(p.realized_pnl.as_decimal() < 0 for p in closed)
 
 
def test_chop_day_no_position_held_through_eod(chop_day_replay, spy):
    engine = ...
    engine.run()
 
    open_positions = [p for p in engine.cache.positions() if p.is_open]
    assert open_positions == []   # market_exit() at 14:55 CT

Each replay fixture amortizes a dozen-plus assertions. M2’s 1000-test target is roughly: 50 unit tests/module × 11 modules + 5 replay fixtures × 80 invariants each + property tests + parametrized expansions.

Parametrized tests

Use @pytest.mark.parametrize to cover discrete input matrices without duplicating bodies. The page is explicit: pytest-style free functions plus parametrize is the preferred shape.

@pytest.mark.parametrize(
    "side,aggressor,expected_sign",
    [
        ("CALL", "BUY", +1.0),
        ("CALL", "SELL", -1.0),
        ("PUT", "BUY", -1.0),
        ("PUT", "SELL", +1.0),
    ],
)
def test_uw_flow_directional_sign(side, aggressor, expected_sign, spy):
    actor = ScoringActor(ScoringActorConfig(instrument_id=spy.id, ...))
    actor.start()
    alert = UWFlowAlert(instrument_id=spy.id, side=side, aggressor=aggressor,
                        strike=580.0, premium=10_000.0, confidence=1.0,
                        ts_event=0, ts_init=0)
    actor._on_uw_flow(alert)
    assert (actor.flow_pressure > 0) == (expected_sign > 0)

Cortana naming convention: parametrize ID per case via ids=[...] when the tuple values aren’t self-documenting. Keep the matrix size <12; if you need more, you’ve outgrown parametrize and want a property test.

Property-based tests

Property tests verify invariants over a class of inputs the mind cannot enumerate. The page recommends proptest in Rust; Python’s equivalent is hypothesis (already a Nautilus dev dep).

Cortana invariants worth proptest’ing:

from hypothesis import given, strategies as st
 
@given(
    initial=st.floats(min_value=0.0, max_value=1e9, allow_nan=False),
    seconds=st.integers(min_value=0, max_value=3600),
    half_life=st.floats(min_value=1.0, max_value=300.0, allow_nan=False),
)
def test_ema_decay_monotonic_non_negative(initial, seconds, half_life):
    """Decay is monotonic non-increasing and never negative."""
    decayed = initial * (0.5 ** (seconds / half_life))
    assert 0.0 <= decayed <= initial
    if seconds > 0 and initial > 0:
        next_step = initial * (0.5 ** ((seconds + 1) / half_life))
        assert next_step <= decayed

The page calls out three canonical property shapes:

Round-trip serialization: parse(to_string(value)) == value.
Inverse operations: (A + B) - B == A.
Transitivity: A < B ∧ B < C ⇒ A < C.

Cortana applications:

ScoreUpdate JSON round-trip (catalog write/read).
Quantity arithmetic in sizing path (size + delta − delta == size).
Conviction-bucket monotonicity (mag1 ≤ mag2 ⇒ bucket(mag1) ≤ bucket(mag2)).

Integration tests

Multi-component tests that exercise real engine routing:

def test_actor_publishes_score_strategy_submits_bracket(engine, spy, spy_call_580):
    """End-to-end: UW alert → ScoringActor → ScoreUpdate → CortanaStrategy → bracket."""
    engine.add_actor(ScoringActor(ScoringActorConfig(...)))
    engine.add_strategy(CortanaStrategy(CortanaConfig(...)))
 
    engine.add_data([uw_alert_strong_bull, bar_close_above_threshold])
    engine.run()
 
    orders = engine.cache.orders()
    bracket_parents = [o for o in orders if o.contingency_type.name == "OTO_OCO"]
    assert len(bracket_parents) == 1

Integration tests are where Cortana’s most subtle bugs hide:

Actor publishes ScoreUpdate; Strategy doesn’t subscribe → silent loss.
Strategy submits bracket; RiskEngine vetoes → OrderDenied event the Strategy doesn’t handle → cooldown stays unset → next signal also blocked.
Bracket child references go stale on OrderEmulator release → SL modify path crashes.

Integration tests catch all three. Unit tests catch none of them.

Golden-file tests (replay determinism)

The Nautilus BacktestEngine is fully deterministic - same inputs produce the same outputs to the byte. This enables golden-file tests: record an expected output once, assert future runs match.

def test_chop_day_replay_matches_golden(chop_day_replay):
    engine = ...
    engine.add_data(chop_day_replay)
    engine.run()
 
    # Serialize cache positions to a stable JSON shape
    actual = serialize_positions_for_golden(engine.cache.positions())
    expected = read_golden("tests/replay_tests/golden/2026-04-16-chop-day.json")
    assert actual == expected

Two rules for golden files:

The serializer must be stable - sort by ts_event, drop volatile fields like client_order_id (which embeds a UUID).
Updating the golden is a deliberate commit - never auto-update; the diff is the review artifact.

For Cortana: golden-file tests replace the MK2 “did the replay produce the same 15 trades?” diff that’s currently a manual eyeball check.

Anti-patterns the page calls out (verbatim)

Don’t capture log output to assert behaviour. Log capture is fragile because loggers are global state, test execution order is non-deterministic, and assertions break when log wording changes. Verify observable behaviour (return values, state changes, side effects) that the log message reflects.
Don’t pytest.raises(BaseException) against PyO3 panic paths in python/tests/. Debug builds may pass; release wheels abort the interpreter. For abort-prone PyO3 / FFI methods, verify the Python signature/parameter names, or isolate the call in a subprocess.
Don’t add debug_assert! where no test reaches it. Release builds strip the check; unexercised assertions have zero signal.
Don’t sleep arbitrarily. Use await eventually(...) from nautilus_trader.test_kit.functions (Python) or wait_until_async(...) from nautilus_common::testing (Rust).
Don’t use test classes in python/tests/. Free functions + fixtures only. (Mixed allowed in tests/ legacy suite, but free functions still preferred.)
Don’t mock the object under test. Mock its collaborators (or better, stub them).
Don’t pad coverage with language-guarantee assertions. Option::is_some after Some(...) adds no signal.

DST readiness - when async tests promote to deterministic simulation

Deterministic simulation testing (madsim in Rust) requires the runtime to be free of ambient non-determinism. Before promoting an async module to DST:

Time/task/runtime/signal primitives route through nautilus_common::live::dst, not tokio directly.
Wall-clock reads go through nautilus_core::time, not SystemTime::now().
State maps with ordering-dependent iteration use IndexMap / IndexSet, not default hash collections.
Every tokio::select! on a control-plane path sets biased.
No escape hatches: no Instant::now(), SystemTime::now(), tokio::signal::ctrl_c, std::thread::spawn, or tokio::task::spawn_blocking outside the seam.
Replay-sensitive IDs (trade_id, venue_order_id) are pure functions of inputs.

For Cortana MK3: DST is deferred until M3+. The Python actors don’t sit in the Rust async hot path, so DST applies primarily to UW data client (Rust) once we write it. For Python integration tests, the BacktestEngine is already deterministic - that’s our DST equivalent until we have async Rust to verify.

See Nautilus DST for the full DST spec.

Async waits - `await eventually(...)`

For tests that touch live components (live integration suite), arbitrary sleeps are forbidden. Use the polling helpers:

from nautilus_trader.test_kit.functions import eventually
 
async def test_live_actor_receives_first_bar():
    actor.start()
    # ... trigger upstream feed
    await eventually(
        lambda: actor.received_bar_count > 0,
        timeout_secs=5.0,
    )

eventually polls the predicate at fast intervals (default 10ms) and returns as soon as it’s true, or raises with a useful diagnostic on timeout. CI flakiness from arbitrary sleeps is the #1 reason Nautilus’s CI mandates this helper.

Code coverage

Coverage is published to codecov via the coverage tool. The page is explicit:

“Aim for high coverage without sacrificing appropriate error handling or causing ‘test induced damage’ to the architecture.”

100% is not the target. Some branches are intentionally untestable without modifying production behaviour (defensive-final-condition checks for unexpected values, abstract-method NotImplementedError raises). Use # pragma: no cover for those - and only those. Concrete implementations stay fully covered.

For Cortana M1: target 70%+ coverage. M2: 85%+. The replay-fixture strategy makes high coverage cheap; chasing 100% wastes time.

Cortana MK3 implications - the test shapes for spike Step 5+

This is where the abstract testing patterns become concrete artifacts for the spike Saturday + post-spike work.

`tests/unit_tests/test_cortana_strategy.py`

Canonical tests for the load-bearing Strategy:

def test_strategy_subscribes_to_score_update_on_start(engine, spy, ...):
    """on_start() registers the ScoreUpdate subscription."""
    strategy = CortanaStrategy(CortanaConfig(...))
    engine.add_strategy(strategy)
    engine.run_streaming(steps=1)   # advance 1 bar
    subs = engine.cache.subscribed_data_types()
    assert ScoreUpdate in subs
 
 
def test_strategy_skips_when_score_below_threshold(engine, spy, ...):
    """Below-threshold ScoreUpdate produces zero orders."""
    strategy = CortanaStrategy(CortanaConfig(score_threshold=65, ...))
    engine.add_strategy(strategy)
    engine.add_data([make_score(score=50)])   # below 65
    engine.run()
    assert len(engine.cache.orders()) == 0
 
 
def test_strategy_submits_bracket_on_repeated_hits_trigger(engine, spy, spy_call_580, ...):
    """A repeated_hits ScoreUpdate above threshold submits a 3-leg bracket."""
    strategy = CortanaStrategy(CortanaConfig(...))
    engine.add_strategy(strategy)
    engine.add_data([make_score(kind="repeated_hits", score=80, side="BULL")])
    engine.run()
    orders = engine.cache.orders()
    assert len(orders) == 3
    parent = next(o for o in orders if not o.is_contingent)
    children = [o for o in orders if o.parent_order_id == parent.client_order_id]
    assert len(children) == 2
    assert any(c.order_type.name == "MARKET_IF_TOUCHED" for c in children)   # TP
    assert any(c.order_type.name == "STOP_MARKET" for c in children)          # SL
 
 
def test_strategy_cooldown_prevents_back_to_back_entries(engine, spy, ...):
    """Two ScoreUpdates within cooldown_seconds produce only ONE bracket."""
    strategy = CortanaStrategy(CortanaConfig(cooldown_seconds=60, ...))
    engine.add_strategy(strategy)
    engine.add_data([
        make_score(score=80, ts_event=0),
        make_score(score=85, ts_event=10_000_000_000),   # 10s later
    ])
    engine.run()
    parents = [o for o in engine.cache.orders() if not o.is_contingent]
    assert len(parents) == 1
 
 
def test_strategy_market_exit_at_eod_alert(engine, spy_call_580, ...):
    """At EOD time alert, market_exit() flattens; no positions remain open."""
    strategy = CortanaStrategy(CortanaConfig(eod_flatten_time_ct="14:55", ...))
    # ... wire entry, then advance clock past 14:55 CT
    engine.run()
    open_positions = [p for p in engine.cache.positions() if p.is_open]
    assert open_positions == []

`tests/unit_tests/test_scoring_actor.py`

def test_scoring_actor_publishes_on_uw_flow(engine, synthetic_uw_alert, spy):
    actor = ScoringActor(ScoringActorConfig(...))
    engine.add_actor(actor)
    engine.add_data([synthetic_uw_alert, TestDataStubs.bar_5decimal()])
    engine.run()
    scores = engine.cache.custom_data(ScoreUpdate)
    assert len(scores) >= 1
 
 
def test_scoring_actor_decay_reduces_flow_pressure_over_time(engine, ...):
    actor = ScoringActor(ScoringActorConfig(flow_decay_half_life_seconds=10.0))
    # inject alert at t=0, no bars after, advance clock 10s
    # assert flow_pressure halved within 1% tolerance
 
 
def test_scoring_actor_skips_publish_before_first_bar(engine, synthetic_uw_alert):
    """Before any bar arrives, the actor doesn't have last_bar_close → no publish."""
    actor = ScoringActor(ScoringActorConfig(...))
    engine.add_actor(actor)
    engine.add_data([synthetic_uw_alert])   # no bars
    engine.run()
    assert engine.cache.custom_data(ScoreUpdate) == []
 
 
@pytest.mark.parametrize("side,aggressor,expected_sign", [
    ("CALL", "BUY", +1.0),
    ("CALL", "SELL", -1.0),
    ("PUT",  "BUY", -1.0),
    ("PUT",  "SELL", +1.0),
])
def test_scoring_actor_directional_sign(side, aggressor, expected_sign, ...):
    ...

`tests/unit_tests/test_uw_data_client.py`

def test_uw_client_parses_canonical_alert():
    raw = {"type": "flow_alert", "underlying": "SPY", "side": "CALL",
           "strike": 580, "premium": 12000, "aggressor": "BUY", "confidence": 0.9}
    alert = parse_uw_message(raw)
    assert isinstance(alert, UWFlowAlert)
    assert alert.side == "CALL"
    assert alert.confidence == pytest.approx(0.9)
 
 
@pytest.mark.parametrize("malformed", [
    {},
    {"type": "flow_alert"},
    {"type": "flow_alert", "side": "INVALID"},
    {"type": "unknown_event"},
])
def test_uw_client_rejects_malformed(malformed):
    with pytest.raises((ValueError, KeyError)):
        parse_uw_message(malformed)
 
 
# Property test for fuzz boundary
@given(payload=st.dictionaries(st.text(), st.text() | st.integers() | st.floats()))
def test_uw_client_never_crashes_on_arbitrary_dict(payload):
    """Parser returns Result-shaped or raises ValueError; never panics."""
    try:
        parse_uw_message(payload)
    except (ValueError, KeyError, TypeError):
        pass   # acceptable
    # No other exception class allowed

Testing the dual-trigger pattern (event + timer)

The dual-trigger pattern (per nautilus-tutorial-delta-neutral-options.md) is what makes Cortana’s “5 entry triggers, 4 from events + 1 from timer” robust. Test shape:

def test_timer_trigger_fires_only_when_aging_setup_pending(engine, spy):
    """Timer wakes every 30s; only fires entry if aging setup is in cache."""
    strategy = CortanaStrategy(CortanaConfig(...))
    engine.add_actor(ScoringActor(ScoringActorConfig(...)))
    engine.add_strategy(strategy)
 
    # Score update with aging_setup=True, score above threshold, but no
    # discrete trigger had fired (mimics MK2's "slow build" scenario)
    engine.add_data([make_score(score=80, aging_setup=True)])
 
    engine.run_streaming(seconds=31)   # advance past timer interval
    parents = [o for o in engine.cache.orders() if not o.is_contingent]
    assert len(parents) == 1
 
 
def test_timer_does_not_fire_without_aging_setup(engine):
    """Timer wakes but no aging setup → no entry."""
    strategy = CortanaStrategy(CortanaConfig(...))
    engine.add_strategy(strategy)
    engine.add_data([make_score(score=80, aging_setup=False)])
    engine.run_streaming(seconds=31)
    assert engine.cache.orders() == []

Asserting RiskEngine veto behaviour

When meta-prob is below threshold, the RiskEngine rule denies the order; Strategy receives OrderDenied (not OrderRejected - different event). Test shape:

def test_risk_rule_denies_when_meta_prob_below_threshold(engine, spy, ...):
    """Low MetaProbUpdate → RiskEngine denies → OrderDenied → no fill."""
    engine.add_strategy(CortanaStrategy(CortanaConfig(...)))
    engine.add_actor(MetaGateActor(MetaGateActorConfig(meta_prob_threshold=0.55)))
 
    # Inject a low meta-prob into cache *before* the score update
    engine.add_data([
        MetaProbUpdate(instrument_id=spy.id, prob=0.30, ts_event=0, ts_init=0),
        make_score(score=80),
    ])
    engine.run()
    # Order created but denied - denied orders show up with status DENIED
    orders = engine.cache.orders()
    parents = [o for o in orders if not o.is_contingent]
    assert len(parents) == 1
    assert parents[0].status.name == "DENIED"
    # And no fills
    fills = [e for o in orders for e in o.events if e.__class__.__name__ == "OrderFilled"]
    assert fills == []

Seeding `decisions.db` replay tests

Cortana MK2’s decisions.db is the regression-test goldmine. Two strategies for converting to Nautilus tests:

Strategy A - JSON snapshots → fixtures. For each interesting historical case (chop day, power-outage day, the GH #88 dead-code regression), export the relevant scoring_events rows + timestamps to JSON. Load into fixtures. Replay via BacktestEngine. Assert the decision matches.

@pytest.fixture(scope="session")
def decisions_chop_day_2026_04_16():
    return load_json("tests/fixtures/decisions_2026_04_16.json")
 
 
def test_chop_day_2026_04_16_produces_3_losses(engine, decisions_chop_day_2026_04_16):
    """The historical chop-day cluster - MK3 must produce same 3 trades."""
    engine = ...
    engine.add_data(synthesize_data_from_decisions(decisions_chop_day_2026_04_16))
    engine.run()
 
    closed = [p for p in engine.cache.positions() if p.is_closed]
    assert len(closed) == 3
    assert all(p.realized_pnl.as_decimal() < 0 for p in closed)

Strategy B - Parquet replay. For full sessions, export the decisions.db rows + the synchronized UW flow alerts + the SPY bars to a single Parquet file. Loadable directly via ParquetDataCatalog. This is the M2-target shape - one Parquet file per replay-worthy session.

The 2026-04-16-chop-day.parquet fixture is the first one to build - that day is already documented as a hard-negative training example (project_losses_april16_chop).

M1 50-test breakdown

Roughly:

Module	Unit tests	Notes
`scoring_actor.py`	6	publish, decay, sign, threshold, no-bar, no-flow
`meta_gate_actor.py`	4	publish, threshold transitions
`ema_decay_actor.py`	4	timer, half-life, monotonic, lower bound
`regime_detector_actor.py`	5	power_hour, chop, trend, transitions, hysteresis
`cortana_strategy.py`	8	each trigger branch + cooldown + EOD + reject
`uw_data_client.py`	6	parse + 4 malformed cases + property-test
`risk_rules.py`	5	meta_gate, contradiction, max_loss, sizing, fallback
`custom_data.py`	4	serialization round-trip per type
Integration	6	actor→strategy pipeline, EOD, reject path, OCO, OTO, golden
Replay	2	one chop-day, one happy-path day
Total	50	M1 done.

M2 1000-test path

50 unit tests × 4 (parametrized expansions, error branches, edge cases)

100 property tests + 80 invariants × 5 replay fixtures + 40 integration
20 fuzz cases + 50 spec-acceptance against IBKR paper = ~1000.

The shape is fractal: every module gets full ladder coverage. If a test file exceeds 30 tests, split by concern (test_scoring_actor.py → test_scoring_actor_decay.py + test_scoring_actor_sign.py).

What we DO NOT test

Nautilus’s own internals (RiskEngine routing, OrderEmulator state machine, MessageBus delivery). That’s covered by Nautilus’s suite.
Network paths to IBKR (use sandbox/paper for live spec acceptance, not unit tests).
Specific log strings (anti-pattern; verify observable behaviour).
Exact timing (use TestClock; never assert wall-clock latencies in unit tests).

Timeline

2026-05-07 | Cody - Filed during pre-spike concept mastery sweep batch 7 (developer guide).

CortanaROI Brain

Explorer

nautilus-dev-testing