Nautilus Backtesting
NautilusTrader exposes two backtest API levels: a low-level
BacktestEnginefor in-memory single-process runs (with manual setup, fastreset()for parameter sweeps, and full control over data ordering) and a high-levelBacktestNodethat orchestrates multiple engines fromBacktestRunConfigobjects (production path, supports streaming viaParquetDataCatalog, generators, and manual chunking). The simulated venue runs the same Kernel/MessageBus/Cache/OMS as live, withbook_type(L1_MBP/L2_MBP/L3_MBO) selecting matching-engine fidelity. Fill realism is tunable via aFillModelfamily (no slip / one-tick / two-tier / three-tier / size-aware / volume-sensitive / probabilistic), withprob_fill_on_limitfor queue position andprob_slippagefor L1 slippage. Execution is deterministic on a single seed: data is sorted byts_init, the matching loop drains cascading commands within the same timestamp, and a fixedrandom_seedpins the FillModel’s PRNG. Reproducibility caveat: cross-process reruns may differ in rare cases due to hash-ordering effects outside the fill model.
Core claim
Backtest and live differ only in (a) clock source (data-driven vs wall-clock) and (b) venue (simulated matching engine vs adapter to a real venue). Everything else - strategy code, message bus, OMS, accounting, position semantics - is bit-identical infrastructure. This makes Nautilus’s parity claim structural, not aspirational.
Two API levels
Low-level: BacktestEngine
Imperative Python script. You instantiate the engine, add a venue, add
instruments, call add_data() for each instrument, then add_strategy() and
run(). Use when:
- Entire data stream fits in RAM.
- You want raw CSV / binary / non-Parquet data.
- You want to re-run the same dataset with different strategies/parameters.
Key methods on BacktestEngine:
add_venue(...)- declares a simulated venue withbook_type,oms_type,account_type,starting_balances, fill model, latency model, etc.add_instrument(instrument)- must be added before data for that instrument.add_data(data, sort=True)- appends and sorts. Performance trap:sort=Truere-sorts the entire accumulated stream on every call, so loading 10 instruments × 1M bars triggers ten increasing-size sorts. Passsort=Falserepeatedly, then callsort_data()once beforerun().add_data_iterator(data_name, generator)- automatic chunking. The engine pulls chunks lazily duringrun().run()- single-shot run.run(streaming=True)- manual chunking. After the loop, callend()to flush deferred timers and finalize.clear_data()- wipe data between runs.reset()- return all stateful fields to initial value, except data and instruments which persist (controlled byCacheConfig.drop_instruments_on_reset=Falsein the default config). Strategies are removed; you must re-add before the nextrun().sort_data()- idempotent, safe to call multiple times.
Data integrity invariants the engine enforces:
- All data must be sorted before
run(). The engine validates this and raisesRuntimeErrorif it sees unsorted data. - Data lists are copied internally, so external mutation after
add_data()is safe.
High-level: BacktestNode
Declarative configuration objects. Recommended for production. A
BacktestRunConfig bundles:
BacktestDataConfig[]- pointers into aParquetDataCatalog(or arbitrary data sources viadata_cls).BacktestVenueConfig[]- venue spec (name, OMS, account type, balances, book_type, fill model, latency model, margin model,liquidity_consumption,queue_position,price_protection_points,trade_execution,bar_execution,bar_adaptive_high_low_ordering).ImportableActorConfig[]- actors to instantiate.ImportableStrategyConfig[]- strategies to instantiate.ImportableExecAlgorithmConfig[]- execution algorithms.ImportableControllerConfig(optional) - orchestration logic.BacktestEngineConfig(optional) - engine-level config defaults.
from nautilus_trader.backtest.node import BacktestNode
from nautilus_trader.config import BacktestRunConfig
configs = [
BacktestRunConfig(...), # Run 1: MK2 baseline
BacktestRunConfig(...), # Run 2: MK3 candidate
]
node = BacktestNode(configs=configs)
results = node.run()Each run gets a fresh engine - no reset() plumbing needed. This is
why BacktestNode is the production path: it cleanly separates runs,
which is exactly what M2 parallel MK2/MK3 backtests want.
When to use which API
The docs are explicit:
| Use Low-level when | Use High-level when |
|---|---|
| Data fits in RAM | Data exceeds RAM (streaming) |
| Prefer raw CSV/binary | Want ParquetDataCatalog |
| Same dataset, swap components | Many configs across many engines |
| Fine-grained control | Production batch runs |
Parameter sweep with reset() | Single canonical run per config |
For Cortana MK3, both paths are in scope:
- Step 6 spike (Saturday): low-level. One day of data, swap strategies
(MK2 vs MK3), call
reset()between runs. - Step 0.5 spike (Saturday, just added): high-level via
ParquetDataCatalog, fed by Databento $125 free credits. - M2 milestone: high-level. Multiple
BacktestRunConfigper day, parallel MK2/MK3 lanes.
Repeated runs and reset semantics
The reset story matters because the spike Step 6 path runs multiple strategies against the same dataset.
BacktestEngine.reset() resets:
- All trading state (orders, positions, account balances).
- Strategy instances (you must re-add).
- Engine counters and timestamps.
Persists across reset:
- Data added via
add_data()(must callclear_data()to remove). - Instruments (must match persisted data).
- Venue configurations.
This is exactly the workflow the spike wants: load decisions.db once, run MK2-equivalent strategy, reset, run MK3 candidate, compare decisions.
Data and venue book_type
Backtest fidelity is governed by the venue’s book_type and what data
you feed it:
| Data type | L1_MBP | L2_MBP | L3_MBO |
|---|---|---|---|
| QuoteTick | Updates book | Ignored | Ignored |
| TradeTick | Triggers matching | Triggers matching | Triggers matching |
| Bar | Updates book | Ignored | Ignored |
| OrderBookDelta | Ignored | Updates book | Updates book |
| OrderBookDepth10 | Ignored | Updates book | Updates book |
Critical caveat: Nautilus cannot synthesize higher-fidelity data from
lower fidelity. If you pick L2_MBP and feed only quotes/bars, the
matching engine will see an empty book and orders will never fill.
Strategies always receive all subscribed data via the data engine
regardless of book_type. Only the matching engine cares.
Main loop and command settling
For each data point the engine runs three phases:
- Exchange processes data. Simulated venue updates its book from the incoming market data and iterates the matching engine. Existing resting orders that now match get filled.
- Strategy receives data. Data engine dispatches the data point to
actors and strategies via
on_quote_tick/on_bar/etc. Strategies may submit, modify, or cancel orders during these callbacks. - Settle venues. The engine drains all queued venue commands and
then iterates matching engines to fill newly submitted orders. This
loop repeats until no pending commands remain - so cascading orders
(e.g. a hedge submitted from
on_order_filled) settle within the same timestamp.
Timer events use the same settle mechanism but batch by timestamp: all callbacks at timestamp T execute first, then venues are settled for T before advancing to T+1.
When a LatencyModel is configured, commands enter the venue’s inflight
queue with a future timestamp derived from the simulated latency. The
settle loop considers inflight commands due at the current timestamp as
pending, so zero-latency or same-tick latency configurations still
settle correctly.
Fill simulation models
The doc states the fill problem honestly: even with perfect historical data, you cannot fully simulate how orders may have interacted with other market participants in real time. So Nautilus offers a family of models, each making different trade-offs:
| Model | Description | Use case |
|---|---|---|
FillModel (base) | Probabilistic queue + slippage | Simple |
BestPriceFillModel | Always fills at best, unlimited liquidity | Optimistic logic test |
OneTickSlippageFillModel | Forces exactly one tick of slippage | Conservative slip test |
TwoTierFillModel | 10 contracts at best, remainder one tick worse | Basic depth |
ThreeTierFillModel | 50/30/20 contracts across three levels | More realistic depth |
ProbabilisticFillModel | 50% best, 50% one-tick slippage | Randomized quality |
SizeAwareFillModel | Different exec for ≤10 vs >10 contracts | Size impact |
LimitOrderPartialFillModel | Max 5 contracts per price touch | Queue via partials |
MarketHoursFillModel | Wider spreads in low-liquidity periods | Session-aware |
VolumeSensitiveFillModel | Liquidity from recent volume | Volume-adaptive |
CompetitionAwareFillModel | Only % of visible liquidity available | Multi-participant |
Base FillModel parameters:
prob_fill_on_limit(default 1.0) - probability a limit order fills when its price is touched but not crossed. Models queue position probabilistically. 0.0 = back of queue, 0.5 = middle, 1.0 = front.prob_slippage(default 0.0) - probability of one tick of slippage per fill. Only applies on L1 data (quotes, trades, bars). Affects all takers.random_seed- pins the model’s PRNG for reproducibility. Doc says same-process reruns “are expected to match”; cross-process reruns may differ in rare cases due to hash-ordering effects outside the fill model.
Order book simulation models override
get_orderbook_for_fill_simulation() to generate a synthetic book.
When a custom model returns a book, liquidity_consumption tracking is
not applied - the model owns its own liquidity simulation.
Slippage and spread by data type
- L2/L3: slippage emerges naturally from book traversal. Market
orders walk levels;
prob_slippageis unused. - L1 (quotes / trades / bars): the simulated book has one level per
side.
prob_slippageis active. If an order’s residual quantity exceeds top-of-book liquidity, market and marketable-limit orders slip one tick to fill. - Bars: OHLC processes via four price points (Open / High / Low /
Close), volume split 25% per point. Sequencing is fixed (O→H→L→C) by
default; with
bar_adaptive_high_low_ordering=True, the engine estimates the most likely H/L sequence based on whether Open is closer to High or Low. Doc cites “~75-85% accuracy” for this heuristic vs ~50% statistical accuracy with fixed ordering. Critical when both TP and SL fall inside the same bar - sequencing decides which fills.
Stop order behavior with bar data
The matching engine distinguishes:
- Gap scenario (bar opens past trigger) - stop fills at the open price. Models real exchange gap behavior (no price guarantee).
- Move-through scenario (bar opens normally, then H/L moves through trigger) - stop fills at the trigger price. Assumes orderly movement, no gap slippage.
This caps modeled slippage during orderly moves while preserving gap realism. For tick-level precision, use quote/trade ticks instead of bars.
Order book immutability and liquidity_consumption
Historical book data is immutable. When your order fills, the book itself is not modified. Nautilus offers a per-level consumption tracker:
liquidity_consumption=False(default) - each iteration fills against full book liquidity independently. Simpler; assumes you’re a small participant.liquidity_consumption=True- tracks(original_size, consumed)per price level. Resets when fresh data arrives at that level. Prevents the same displayed liquidity from generating multiple fills.
For passive limit orders on L1 data: with consumption tracking, the order fills only against displayed liquidity at each new quote, with remainder open. Without it, the engine assumes “market crossed your price → there must have been enough liquidity → fill it all.”
Trade-driven execution and queue position
trade_execution=True (default) lets trade ticks trigger matching. The
matching core uses a “transient override” - temporarily adjusts its
internal best bid/ask toward the trade price so resting passive orders
can cross. The underlying book is never modified.
queue_position=True (alongside trade_execution=True) tracks queue
position for limit orders. On placement, the engine snapshots same-side
depth at the order’s price level. Trade ticks decrement
quantity-ahead. Order becomes fill-eligible only when ahead reaches
zero. Order modification resets queue (back of new level).
Price protection
Models exchanges like Binance / CME that filter excessively aggressive
fills. price_protection_points=N on BacktestVenueConfig defines a
boundary computed at fill time:
- BUY:
protection_price = ask + (N × price_increment) - SELL:
protection_price = bid - (N × price_increment)
Affects MARKET and STOP_MARKET orders. Set to 0 to disable.
Fee and margin models
Fee model: MakerTakerFeeModel, FixedFeeModel, or custom subclass of
FeeModel. Sign convention: positive fee rate = commission, negative =
rebate.
Margin: configure on BacktestVenueConfig via MarginModelConfig.
model_type="leveraged" (default) - margin reduced by leverage.
model_type="standard" - fixed percentages (traditional brokers).
Custom: fully-qualified class path
"my_package.my_module:MyMarginModel".
Multi-instrument backtests
A single BacktestEngine can host many instruments. The performance
trap to remember: add_data() with sort=True (default) re-sorts the
entire stream every call. For N instruments with M bars each, naive
loading is O(N²M log M) sorts.
Recommended pattern:
engine = BacktestEngine()
engine.add_venue(...)
engine.add_instrument(instr1)
engine.add_instrument(instr2)
engine.add_instrument(instr3)
engine.add_data(instr1_bars, sort=False)
engine.add_data(instr2_bars, sort=False)
engine.add_data(instr3_bars, sort=False)
engine.sort_data() # one sort at the end
engine.add_strategy(strategy)
engine.run()Or batch first, sort once:
all_bars = []
all_bars.extend(instr1_bars)
all_bars.extend(instr2_bars)
all_bars.extend(instr3_bars)
engine.add_data(all_bars, sort=True)Streaming for datasets larger than RAM
Two streaming patterns:
Automatic chunking - supply a generator that yields batches; the
engine pulls lazily during run():
def data_generator():
yield load_chunk_1()
yield load_chunk_2()
yield load_chunk_3()
engine.add_data_iterator(
data_name="my_data_stream",
generator=data_generator(),
)
engine.run()Manual chunking - load and run each batch yourself. This is the
pattern BacktestNode uses internally:
engine.add_strategy(strategy)
for batch in data_batches:
engine.add_data(batch)
engine.run(streaming=True)
engine.clear_data()
engine.end() # flushes deferred timers, stops engines, produces resultsIn streaming mode, timer advancement stops when data exhausts for each
batch. Timers scheduled past the last data point are deferred until
more data arrives or end() is called.
Distributed and parallel runs
The doc describes BacktestNode as orchestrating “multiple
BacktestEngine instances.” Multiple BacktestRunConfig objects in a
list run sequentially in a single node. The doc does not, in this page,
go deeper into multi-process or multi-host parallelism - that is
deferred to live-trading and (presumably) deployment guides. For M2,
parallel MK2/MK3 likely means either (a) two BacktestRunConfig in one
BacktestNode call, or (b) two separate Python processes each running
one config. Both are clean.
Deterministic replay guarantees
The doc’s deterministic guarantees:
- Data is sorted monotonically by
ts_init. - The matching loop drains cascading commands within the same timestamp, so order of submission within a single tick does not produce nondeterministic outcomes - all commands at T settle before T+1.
- Timer events fire deterministically because the engine’s
Clockis data-driven, not wall-clock-driven. random_seedonFillModelpins the probabilistic PRNG.
The doc explicitly hedges:
Reproducible results: A fixed random_seed pins the probabilistic fill model’s PRNG. Same-process reruns are expected to match; cross-process reruns may differ in rare cases due to hash-ordering effects outside the fill model.
This is honest about the boundary - Python dict iteration order, set
ordering, etc. can leak across processes. For audit-grade reproducibility
across machines, use Rust-only DST (see nautilus-concepts.md DST
section).
How parity is maintained with live
This is the structural argument. Backtest and live share:
- Same
Kernel- both modes instantiate the same kernel object. - Same
MessageBus- same topic naming, same dispatch, same subscribe semantics. - Same
Cache- cache-then-publish for data, identical query API. - Same
Strategy/Actorlifecycle -on_start,on_quote_tick,on_bar,on_eventfire the same way. - Same
Clockinterface - strategies cannot tell whether they are in simulated or wall-clock time. They never calltime.time()directly. - Same OMS / account / position / order semantics - just with a simulated venue instead of an adapter.
- Reconciliation only on the live side - backtest controls both sides of the trade ledger, so there is nothing to reconcile against.
What differs:
- Clock source - data-driven in backtest, wall-clock in live.
- Venue -
SimulatedExchangein backtest, adapter to a real venue in live. - Fill realism -
FillModelin backtest, real venue matching in live. This is the main honest gap: a fill model is a model, not the thing itself.
Failure mode the parity argument does not prevent: if your
FillModel is too optimistic, your backtest P&L will be too good. The
parity guarantee is about infrastructure, not market microstructure.
Cortana MK3 implications
Step 6 (Saturday): replay decisions.db
Path: decisions.db → DataLoader → in-memory list → BacktestEngine.
Concrete shape:
- Load 15 today’s scoring events from SQLite.
- Wrap each as a Nautilus
Baror customDatasubclass withts_initset to the scoring event timestamp. engine.add_data(events, sort=True);engine.add_venue(...)withbook_type=L1_MBPand a basicFillModel.engine.add_strategy(MK2EquivalentStrategy());engine.run(); capture decisions.engine.reset();engine.add_strategy(MK3CandidateStrategy());engine.run(); capture decisions.- Diff. Pass criterion: ≥60% decision parity with MK2.
Open question: do we go directly through Bar or through a decisions
custom Data type? Bar is more standard, but Cortana scoring events
are richer than OHLC. A custom data type is probably the right shape -
see nautilus-custom-data.md.
Step 0.5 (Saturday): Databento → ParquetDataCatalog
Path: Databento $125 credits → Parquet files → ParquetDataCatalog → BacktestNode → BacktestRunConfig.
This is the production path for M2. Worth proving it works on Saturday
before it becomes a milestone-blocking surprise. The Databento adapter
(per nautilus-data.md) supports
bars_timestamp_on_close=True - set this so ts_init is at bar close,
matching Nautilus’s expectation and avoiding look-ahead bias.
M2: parallel MK2 / MK3 backtests
Two BacktestRunConfig in one BacktestNode.run() call. Each config
points to the same BacktestDataConfig (same Parquet catalog) but
different strategy. Output: two sets of decisions per day; M2 success
metric is daily decision diff <5%.
Carryover #7: ts_init nanosecond-tie ordering
Does the doc resolve it? Partially - no.
What the doc says:
- Data is sorted “into monotonic order based on
ts_init.” time_bars_build_delay(microseconds) addresses one specific edge case: tick data arriving exactly at bar-close timestamp may out-order the bar timer. This is the only nanosecond-tie hint.- The settle loop guarantees commands at timestamp T finish before T+1.
What the doc does not say:
- How ties at the same
ts_initresolve when multipleDataobjects share a nanosecond. Python’ssorted()is stable, so insertion order wins, but the doc never makes this guarantee explicit. The Rust path may behave differently. - Whether
ts_initties between trades and quotes have a defined priority.
For SQLite → Parquet replay, this is exactly the question. If two
scoring events share a nanosecond (timestamp truncation in SQLite is a
real risk; SQLite stores REAL or INTEGER and millisecond precision is
common), their relative order at BacktestEngine enqueue time decides
which strategy callback fires first - which can flip a decision.
Mitigation for the spike: pre-sort by (ts_init, source_priority, event_id) at DataLoader time, ensuring stable ordering before handing
to the engine. Treat this as a known unresolved upstream question -
worth a follow-up issue (“Document Nautilus tie-breaking semantics for
ts_init collisions”) and a RuntimeError in our DataLoader if we
detect a real tie at nanosecond resolution.
Precision invariants
Strict enforcement throughout the fill pipeline. All prices and
quantities must match instrument.price_precision /
instrument.size_precision. Mismatches raise RuntimeError
immediately. Uses instrument.make_price(raw) and
instrument.make_qty(raw) to coerce.
This applies to: QuoteTick (bid/ask price + size), TradeTick (price + size), Bar (OHLC + volume in base currency units), Order (quantity, price, trigger_price, activation_price), Order updates, Fills.
Bar volume must be in base currency units. Some data providers
report quote-currency volume - convert before loading.
Best practices summary
- Production batch:
BacktestNodewith config objects. - Parameter sweep:
BacktestEnginewithreset(). - Many instruments:
add_data(..., sort=False)then onesort_data(). - Big data:
add_data_iterator()(auto chunks) or manual chunking +run(streaming=True)+end(). - Bar data:
bar_adaptive_high_low_ordering=Trueif TP/SL inside same bar matters. - Reproducibility: pin
random_seedon FillModel; same-process only. - Realism: pick FillModel that matches your strategy’s market-impact
assumptions. For SPY 0DTE Cortana,
ThreeTierFillModel(50/30/20 across three ticks) orOneTickSlippageFillModelare the closest matches - Cortana trades single contracts but options 0DTE often slip one tick. RecommendThreeTierFillModelfor the Saturday spike: it’s pessimistic enough to surface MK3-vs-MK2 differences that aren’t artifacts of optimistic fills. Switch toOneTickSlippageFillModelifThreeTierFillModelproves too conservative for IBKR SmartRouting reality.
See Also
- Nautilus Concepts (overview) - BacktestEngine vs BacktestNode summary at lines 426-456.
- Nautilus Architecture - Kernel, message bus, the structural backtest=live argument.
- Nautilus Data - Bar, QuoteTick, TradeTick types
and
ParquetDataCatalog. - Nautilus Custom Data - extending
Datafor Cortana scoring events. - Nautilus Execution - OMS, latency model, reconciliation.
- Nautilus Strategies -
on_datacallbacks, same shape backtest and live. - Databento vs UW vs IBKR data feeds
- why Databento is the M2 catalog source.
Timeline
- 2026-05-07 | Cody - Filed during pre-spike concept mastery sweep batch 3.