Nautilus Backtesting

NautilusTrader exposes two backtest API levels: a low-level BacktestEngine for in-memory single-process runs (with manual setup, fast reset() for parameter sweeps, and full control over data ordering) and a high-level BacktestNode that orchestrates multiple engines from BacktestRunConfig objects (production path, supports streaming via ParquetDataCatalog, generators, and manual chunking). The simulated venue runs the same Kernel/MessageBus/Cache/OMS as live, with book_type (L1_MBP / L2_MBP / L3_MBO) selecting matching-engine fidelity. Fill realism is tunable via a FillModel family (no slip / one-tick / two-tier / three-tier / size-aware / volume-sensitive / probabilistic), with prob_fill_on_limit for queue position and prob_slippage for L1 slippage. Execution is deterministic on a single seed: data is sorted by ts_init, the matching loop drains cascading commands within the same timestamp, and a fixed random_seed pins the FillModel’s PRNG. Reproducibility caveat: cross-process reruns may differ in rare cases due to hash-ordering effects outside the fill model.

Core claim

Backtest and live differ only in (a) clock source (data-driven vs wall-clock) and (b) venue (simulated matching engine vs adapter to a real venue). Everything else - strategy code, message bus, OMS, accounting, position semantics - is bit-identical infrastructure. This makes Nautilus’s parity claim structural, not aspirational.

Two API levels

Low-level: BacktestEngine

Imperative Python script. You instantiate the engine, add a venue, add instruments, call add_data() for each instrument, then add_strategy() and run(). Use when:

  • Entire data stream fits in RAM.
  • You want raw CSV / binary / non-Parquet data.
  • You want to re-run the same dataset with different strategies/parameters.

Key methods on BacktestEngine:

  • add_venue(...) - declares a simulated venue with book_type, oms_type, account_type, starting_balances, fill model, latency model, etc.
  • add_instrument(instrument) - must be added before data for that instrument.
  • add_data(data, sort=True) - appends and sorts. Performance trap: sort=True re-sorts the entire accumulated stream on every call, so loading 10 instruments × 1M bars triggers ten increasing-size sorts. Pass sort=False repeatedly, then call sort_data() once before run().
  • add_data_iterator(data_name, generator) - automatic chunking. The engine pulls chunks lazily during run().
  • run() - single-shot run.
  • run(streaming=True) - manual chunking. After the loop, call end() to flush deferred timers and finalize.
  • clear_data() - wipe data between runs.
  • reset() - return all stateful fields to initial value, except data and instruments which persist (controlled by CacheConfig.drop_instruments_on_reset=False in the default config). Strategies are removed; you must re-add before the next run().
  • sort_data() - idempotent, safe to call multiple times.

Data integrity invariants the engine enforces:

  • All data must be sorted before run(). The engine validates this and raises RuntimeError if it sees unsorted data.
  • Data lists are copied internally, so external mutation after add_data() is safe.

High-level: BacktestNode

Declarative configuration objects. Recommended for production. A BacktestRunConfig bundles:

  • BacktestDataConfig[] - pointers into a ParquetDataCatalog (or arbitrary data sources via data_cls).
  • BacktestVenueConfig[] - venue spec (name, OMS, account type, balances, book_type, fill model, latency model, margin model, liquidity_consumption, queue_position, price_protection_points, trade_execution, bar_execution, bar_adaptive_high_low_ordering).
  • ImportableActorConfig[] - actors to instantiate.
  • ImportableStrategyConfig[] - strategies to instantiate.
  • ImportableExecAlgorithmConfig[] - execution algorithms.
  • ImportableControllerConfig (optional) - orchestration logic.
  • BacktestEngineConfig (optional) - engine-level config defaults.
from nautilus_trader.backtest.node import BacktestNode
from nautilus_trader.config import BacktestRunConfig
 
configs = [
    BacktestRunConfig(...),  # Run 1: MK2 baseline
    BacktestRunConfig(...),  # Run 2: MK3 candidate
]
node = BacktestNode(configs=configs)
results = node.run()

Each run gets a fresh engine - no reset() plumbing needed. This is why BacktestNode is the production path: it cleanly separates runs, which is exactly what M2 parallel MK2/MK3 backtests want.

When to use which API

The docs are explicit:

Use Low-level whenUse High-level when
Data fits in RAMData exceeds RAM (streaming)
Prefer raw CSV/binaryWant ParquetDataCatalog
Same dataset, swap componentsMany configs across many engines
Fine-grained controlProduction batch runs
Parameter sweep with reset()Single canonical run per config

For Cortana MK3, both paths are in scope:

  • Step 6 spike (Saturday): low-level. One day of data, swap strategies (MK2 vs MK3), call reset() between runs.
  • Step 0.5 spike (Saturday, just added): high-level via ParquetDataCatalog, fed by Databento $125 free credits.
  • M2 milestone: high-level. Multiple BacktestRunConfig per day, parallel MK2/MK3 lanes.

Repeated runs and reset semantics

The reset story matters because the spike Step 6 path runs multiple strategies against the same dataset.

BacktestEngine.reset() resets:

  • All trading state (orders, positions, account balances).
  • Strategy instances (you must re-add).
  • Engine counters and timestamps.

Persists across reset:

  • Data added via add_data() (must call clear_data() to remove).
  • Instruments (must match persisted data).
  • Venue configurations.

This is exactly the workflow the spike wants: load decisions.db once, run MK2-equivalent strategy, reset, run MK3 candidate, compare decisions.

Data and venue book_type

Backtest fidelity is governed by the venue’s book_type and what data you feed it:

Data typeL1_MBPL2_MBPL3_MBO
QuoteTickUpdates bookIgnoredIgnored
TradeTickTriggers matchingTriggers matchingTriggers matching
BarUpdates bookIgnoredIgnored
OrderBookDeltaIgnoredUpdates bookUpdates book
OrderBookDepth10IgnoredUpdates bookUpdates book

Critical caveat: Nautilus cannot synthesize higher-fidelity data from lower fidelity. If you pick L2_MBP and feed only quotes/bars, the matching engine will see an empty book and orders will never fill.

Strategies always receive all subscribed data via the data engine regardless of book_type. Only the matching engine cares.

Main loop and command settling

For each data point the engine runs three phases:

  1. Exchange processes data. Simulated venue updates its book from the incoming market data and iterates the matching engine. Existing resting orders that now match get filled.
  2. Strategy receives data. Data engine dispatches the data point to actors and strategies via on_quote_tick/on_bar/etc. Strategies may submit, modify, or cancel orders during these callbacks.
  3. Settle venues. The engine drains all queued venue commands and then iterates matching engines to fill newly submitted orders. This loop repeats until no pending commands remain - so cascading orders (e.g. a hedge submitted from on_order_filled) settle within the same timestamp.

Timer events use the same settle mechanism but batch by timestamp: all callbacks at timestamp T execute first, then venues are settled for T before advancing to T+1.

When a LatencyModel is configured, commands enter the venue’s inflight queue with a future timestamp derived from the simulated latency. The settle loop considers inflight commands due at the current timestamp as pending, so zero-latency or same-tick latency configurations still settle correctly.

Fill simulation models

The doc states the fill problem honestly: even with perfect historical data, you cannot fully simulate how orders may have interacted with other market participants in real time. So Nautilus offers a family of models, each making different trade-offs:

ModelDescriptionUse case
FillModel (base)Probabilistic queue + slippageSimple
BestPriceFillModelAlways fills at best, unlimited liquidityOptimistic logic test
OneTickSlippageFillModelForces exactly one tick of slippageConservative slip test
TwoTierFillModel10 contracts at best, remainder one tick worseBasic depth
ThreeTierFillModel50/30/20 contracts across three levelsMore realistic depth
ProbabilisticFillModel50% best, 50% one-tick slippageRandomized quality
SizeAwareFillModelDifferent exec for ≤10 vs >10 contractsSize impact
LimitOrderPartialFillModelMax 5 contracts per price touchQueue via partials
MarketHoursFillModelWider spreads in low-liquidity periodsSession-aware
VolumeSensitiveFillModelLiquidity from recent volumeVolume-adaptive
CompetitionAwareFillModelOnly % of visible liquidity availableMulti-participant

Base FillModel parameters:

  • prob_fill_on_limit (default 1.0) - probability a limit order fills when its price is touched but not crossed. Models queue position probabilistically. 0.0 = back of queue, 0.5 = middle, 1.0 = front.
  • prob_slippage (default 0.0) - probability of one tick of slippage per fill. Only applies on L1 data (quotes, trades, bars). Affects all takers.
  • random_seed - pins the model’s PRNG for reproducibility. Doc says same-process reruns “are expected to match”; cross-process reruns may differ in rare cases due to hash-ordering effects outside the fill model.

Order book simulation models override get_orderbook_for_fill_simulation() to generate a synthetic book. When a custom model returns a book, liquidity_consumption tracking is not applied - the model owns its own liquidity simulation.

Slippage and spread by data type

  • L2/L3: slippage emerges naturally from book traversal. Market orders walk levels; prob_slippage is unused.
  • L1 (quotes / trades / bars): the simulated book has one level per side. prob_slippage is active. If an order’s residual quantity exceeds top-of-book liquidity, market and marketable-limit orders slip one tick to fill.
  • Bars: OHLC processes via four price points (Open / High / Low / Close), volume split 25% per point. Sequencing is fixed (O→H→L→C) by default; with bar_adaptive_high_low_ordering=True, the engine estimates the most likely H/L sequence based on whether Open is closer to High or Low. Doc cites “~75-85% accuracy” for this heuristic vs ~50% statistical accuracy with fixed ordering. Critical when both TP and SL fall inside the same bar - sequencing decides which fills.

Stop order behavior with bar data

The matching engine distinguishes:

  1. Gap scenario (bar opens past trigger) - stop fills at the open price. Models real exchange gap behavior (no price guarantee).
  2. Move-through scenario (bar opens normally, then H/L moves through trigger) - stop fills at the trigger price. Assumes orderly movement, no gap slippage.

This caps modeled slippage during orderly moves while preserving gap realism. For tick-level precision, use quote/trade ticks instead of bars.

Order book immutability and liquidity_consumption

Historical book data is immutable. When your order fills, the book itself is not modified. Nautilus offers a per-level consumption tracker:

  • liquidity_consumption=False (default) - each iteration fills against full book liquidity independently. Simpler; assumes you’re a small participant.
  • liquidity_consumption=True - tracks (original_size, consumed) per price level. Resets when fresh data arrives at that level. Prevents the same displayed liquidity from generating multiple fills.

For passive limit orders on L1 data: with consumption tracking, the order fills only against displayed liquidity at each new quote, with remainder open. Without it, the engine assumes “market crossed your price → there must have been enough liquidity → fill it all.”

Trade-driven execution and queue position

trade_execution=True (default) lets trade ticks trigger matching. The matching core uses a “transient override” - temporarily adjusts its internal best bid/ask toward the trade price so resting passive orders can cross. The underlying book is never modified.

queue_position=True (alongside trade_execution=True) tracks queue position for limit orders. On placement, the engine snapshots same-side depth at the order’s price level. Trade ticks decrement quantity-ahead. Order becomes fill-eligible only when ahead reaches zero. Order modification resets queue (back of new level).

Price protection

Models exchanges like Binance / CME that filter excessively aggressive fills. price_protection_points=N on BacktestVenueConfig defines a boundary computed at fill time:

  • BUY: protection_price = ask + (N × price_increment)
  • SELL: protection_price = bid - (N × price_increment)

Affects MARKET and STOP_MARKET orders. Set to 0 to disable.

Fee and margin models

Fee model: MakerTakerFeeModel, FixedFeeModel, or custom subclass of FeeModel. Sign convention: positive fee rate = commission, negative = rebate.

Margin: configure on BacktestVenueConfig via MarginModelConfig. model_type="leveraged" (default) - margin reduced by leverage. model_type="standard" - fixed percentages (traditional brokers). Custom: fully-qualified class path "my_package.my_module:MyMarginModel".

Multi-instrument backtests

A single BacktestEngine can host many instruments. The performance trap to remember: add_data() with sort=True (default) re-sorts the entire stream every call. For N instruments with M bars each, naive loading is O(N²M log M) sorts.

Recommended pattern:

engine = BacktestEngine()
engine.add_venue(...)
engine.add_instrument(instr1)
engine.add_instrument(instr2)
engine.add_instrument(instr3)
engine.add_data(instr1_bars, sort=False)
engine.add_data(instr2_bars, sort=False)
engine.add_data(instr3_bars, sort=False)
engine.sort_data()  # one sort at the end
engine.add_strategy(strategy)
engine.run()

Or batch first, sort once:

all_bars = []
all_bars.extend(instr1_bars)
all_bars.extend(instr2_bars)
all_bars.extend(instr3_bars)
engine.add_data(all_bars, sort=True)

Streaming for datasets larger than RAM

Two streaming patterns:

Automatic chunking - supply a generator that yields batches; the engine pulls lazily during run():

def data_generator():
    yield load_chunk_1()
    yield load_chunk_2()
    yield load_chunk_3()
 
engine.add_data_iterator(
    data_name="my_data_stream",
    generator=data_generator(),
)
engine.run()

Manual chunking - load and run each batch yourself. This is the pattern BacktestNode uses internally:

engine.add_strategy(strategy)
for batch in data_batches:
    engine.add_data(batch)
    engine.run(streaming=True)
    engine.clear_data()
engine.end()  # flushes deferred timers, stops engines, produces results

In streaming mode, timer advancement stops when data exhausts for each batch. Timers scheduled past the last data point are deferred until more data arrives or end() is called.

Distributed and parallel runs

The doc describes BacktestNode as orchestrating “multiple BacktestEngine instances.” Multiple BacktestRunConfig objects in a list run sequentially in a single node. The doc does not, in this page, go deeper into multi-process or multi-host parallelism - that is deferred to live-trading and (presumably) deployment guides. For M2, parallel MK2/MK3 likely means either (a) two BacktestRunConfig in one BacktestNode call, or (b) two separate Python processes each running one config. Both are clean.

Deterministic replay guarantees

The doc’s deterministic guarantees:

  1. Data is sorted monotonically by ts_init.
  2. The matching loop drains cascading commands within the same timestamp, so order of submission within a single tick does not produce nondeterministic outcomes - all commands at T settle before T+1.
  3. Timer events fire deterministically because the engine’s Clock is data-driven, not wall-clock-driven.
  4. random_seed on FillModel pins the probabilistic PRNG.

The doc explicitly hedges:

Reproducible results: A fixed random_seed pins the probabilistic fill model’s PRNG. Same-process reruns are expected to match; cross-process reruns may differ in rare cases due to hash-ordering effects outside the fill model.

This is honest about the boundary - Python dict iteration order, set ordering, etc. can leak across processes. For audit-grade reproducibility across machines, use Rust-only DST (see nautilus-concepts.md DST section).

How parity is maintained with live

This is the structural argument. Backtest and live share:

  1. Same Kernel - both modes instantiate the same kernel object.
  2. Same MessageBus - same topic naming, same dispatch, same subscribe semantics.
  3. Same Cache - cache-then-publish for data, identical query API.
  4. Same Strategy / Actor lifecycle - on_start, on_quote_tick, on_bar, on_event fire the same way.
  5. Same Clock interface - strategies cannot tell whether they are in simulated or wall-clock time. They never call time.time() directly.
  6. Same OMS / account / position / order semantics - just with a simulated venue instead of an adapter.
  7. Reconciliation only on the live side - backtest controls both sides of the trade ledger, so there is nothing to reconcile against.

What differs:

  • Clock source - data-driven in backtest, wall-clock in live.
  • Venue - SimulatedExchange in backtest, adapter to a real venue in live.
  • Fill realism - FillModel in backtest, real venue matching in live. This is the main honest gap: a fill model is a model, not the thing itself.

Failure mode the parity argument does not prevent: if your FillModel is too optimistic, your backtest P&L will be too good. The parity guarantee is about infrastructure, not market microstructure.

Cortana MK3 implications

Step 6 (Saturday): replay decisions.db

Path: decisions.db → DataLoader → in-memory list → BacktestEngine.

Concrete shape:

  1. Load 15 today’s scoring events from SQLite.
  2. Wrap each as a Nautilus Bar or custom Data subclass with ts_init set to the scoring event timestamp.
  3. engine.add_data(events, sort=True); engine.add_venue(...) with book_type=L1_MBP and a basic FillModel.
  4. engine.add_strategy(MK2EquivalentStrategy()); engine.run(); capture decisions.
  5. engine.reset(); engine.add_strategy(MK3CandidateStrategy()); engine.run(); capture decisions.
  6. Diff. Pass criterion: ≥60% decision parity with MK2.

Open question: do we go directly through Bar or through a decisions custom Data type? Bar is more standard, but Cortana scoring events are richer than OHLC. A custom data type is probably the right shape - see nautilus-custom-data.md.

Step 0.5 (Saturday): Databento → ParquetDataCatalog

Path: Databento $125 credits → Parquet files → ParquetDataCatalog → BacktestNode → BacktestRunConfig.

This is the production path for M2. Worth proving it works on Saturday before it becomes a milestone-blocking surprise. The Databento adapter (per nautilus-data.md) supports bars_timestamp_on_close=True - set this so ts_init is at bar close, matching Nautilus’s expectation and avoiding look-ahead bias.

M2: parallel MK2 / MK3 backtests

Two BacktestRunConfig in one BacktestNode.run() call. Each config points to the same BacktestDataConfig (same Parquet catalog) but different strategy. Output: two sets of decisions per day; M2 success metric is daily decision diff <5%.

Carryover #7: ts_init nanosecond-tie ordering

Does the doc resolve it? Partially - no.

What the doc says:

  • Data is sorted “into monotonic order based on ts_init.”
  • time_bars_build_delay (microseconds) addresses one specific edge case: tick data arriving exactly at bar-close timestamp may out-order the bar timer. This is the only nanosecond-tie hint.
  • The settle loop guarantees commands at timestamp T finish before T+1.

What the doc does not say:

  • How ties at the same ts_init resolve when multiple Data objects share a nanosecond. Python’s sorted() is stable, so insertion order wins, but the doc never makes this guarantee explicit. The Rust path may behave differently.
  • Whether ts_init ties between trades and quotes have a defined priority.

For SQLite → Parquet replay, this is exactly the question. If two scoring events share a nanosecond (timestamp truncation in SQLite is a real risk; SQLite stores REAL or INTEGER and millisecond precision is common), their relative order at BacktestEngine enqueue time decides which strategy callback fires first - which can flip a decision.

Mitigation for the spike: pre-sort by (ts_init, source_priority, event_id) at DataLoader time, ensuring stable ordering before handing to the engine. Treat this as a known unresolved upstream question - worth a follow-up issue (“Document Nautilus tie-breaking semantics for ts_init collisions”) and a RuntimeError in our DataLoader if we detect a real tie at nanosecond resolution.

Precision invariants

Strict enforcement throughout the fill pipeline. All prices and quantities must match instrument.price_precision / instrument.size_precision. Mismatches raise RuntimeError immediately. Uses instrument.make_price(raw) and instrument.make_qty(raw) to coerce.

This applies to: QuoteTick (bid/ask price + size), TradeTick (price + size), Bar (OHLC + volume in base currency units), Order (quantity, price, trigger_price, activation_price), Order updates, Fills.

Bar volume must be in base currency units. Some data providers report quote-currency volume - convert before loading.

Best practices summary

  • Production batch: BacktestNode with config objects.
  • Parameter sweep: BacktestEngine with reset().
  • Many instruments: add_data(..., sort=False) then one sort_data().
  • Big data: add_data_iterator() (auto chunks) or manual chunking + run(streaming=True) + end().
  • Bar data: bar_adaptive_high_low_ordering=True if TP/SL inside same bar matters.
  • Reproducibility: pin random_seed on FillModel; same-process only.
  • Realism: pick FillModel that matches your strategy’s market-impact assumptions. For SPY 0DTE Cortana, ThreeTierFillModel (50/30/20 across three ticks) or OneTickSlippageFillModel are the closest matches - Cortana trades single contracts but options 0DTE often slip one tick. Recommend ThreeTierFillModel for the Saturday spike: it’s pessimistic enough to surface MK3-vs-MK2 differences that aren’t artifacts of optimistic fills. Switch to OneTickSlippageFillModel if ThreeTierFillModel proves too conservative for IBKR SmartRouting reality.

See Also


Timeline

  • 2026-05-07 | Cody - Filed during pre-spike concept mastery sweep batch 3.