Nautilus Databento Integration
NautilusTrader ships an OOB Databento adapter built on the official
databento-rscrate (no separatedatabentoPython package install needed - the Rust client links during the wheel build). It is data-only:DatabentoDataLoader(DBN file → Nautilus objects),DatabentoHistoricalClient(REST fetch of historical),DatabentoLiveClient(raw-TCP live stream),DatabentoInstrumentProvider(definitions), andDatabentoDataClient(theLiveMarketDataClientwired into aTradingNode). Pair it with IBKR for execution; pricing-of-record stays on IBKR. Twelve schemas are supported (MBO/MBP-1/MBP-10/BBO/CBBO/CMBP-1/TBBO/TCBBO/TRADES/OHLCV/ DEFINITION/IMBALANCE/STATISTICS/STATUS) and decode to native Nautilus types (OrderBookDelta,QuoteTick,TradeTick,OrderBookDepth10,Bar, instrument variants, plus PyO3-onlyDatabentoImbalance/DatabentoStatistics). The doc explicitly recommends decoding DBN to theParquetDataCatalogonce, then streaming viaBacktestNode- at least an order of magnitude faster than re-decoding DBN per run. $125 free credits are advertised on the doc itself; themetadata.get_costHTTP endpoint is the documented cost-estimation pre-flight. For Cortana MK3, this is the page that powers spike Step 0.5 - pull SPY OPRA Trades + MBP-1 for one trading day, ingest into~/cortana-data/catalog/, query, cross-check againstdecisions.db.
This page complements
Nautilus Adapters (Databento was named the closest
data-only reference adapter for the UW WebSocket adapter MK3 must build),
Nautilus Data Model (how ParquetDataCatalog and
custom-data types work), Nautilus Backtesting
(catalog → BacktestNode path), Nautilus Tutorials
(the “Data Catalog with Databento” how-to that the spike plan flagged as
404 on 2026-05-06), and the data-feed layering decision in
databento-vs-uw-vs-ibkr-data-feeds.md.
Core claim
The Databento adapter is the canonical historical-OPRA replay path for any Nautilus user who needs equity-options tape - and the closest reference in the codebase for Cortana’s UW adapter shape. Three operational primitives are independently usable:
DatabentoDataLoader.from_dbn_file(...)- pure offline DBN-to- Nautilus decoder. No API key needed at decode time. Rust under the hood, exposed as Python.DatabentoHistoricalClient- HTTP REST fetcher; needs API key; pulls historical DBN data and definitions on demand.DatabentoLiveClient/DatabentoDataClient- raw-TCP live stream wired into aTradingNodevia a factory.
The doc treats DBN-on-disk → catalog conversion as a first-class production pattern: pay decode cost once, query Parquet forever.
Adapter component map
Per the Overview section of the doc:
| Class | Role | Used directly? |
|---|---|---|
DatabentoDataLoader | Loads DBN data from files; converts to Nautilus objects (offline). | Yes - most common entry point for Cortana MK3 spike. |
DatabentoInstrumentProvider | Fetches latest or historical instrument definitions via Databento HTTP API. | Sometimes - usually wired into TradingNode and fed via config. |
DatabentoHistoricalClient | Fetches historical market data via Databento HTTP API. | Sometimes - most users prefer DBN-on-disk via the CLI then DatabentoDataLoader. |
DatabentoLiveClient | Subscribes to real-time data feeds via Databento’s raw TCP API. | Rare - wrapped by DatabentoDataClient. |
DatabentoDataClient | LiveMarketDataClient impl for live trading nodes. | Indirectly - instantiated by DatabentoLiveDataClientFactory from config. |
Doc verbatim: “Most users configure a live trading node (covered below)
and do not work with these components directly.” For Cortana MK3 spike
Step 0.5 the direct path is the simplest: CLI-pull DBN files,
DatabentoDataLoader.from_dbn_file(...), write to catalog.
Supported schemas (the table that matters)
Reproduced verbatim from the doc - this is the load-bearing reference for “what schema do I ask Databento for”:
| Databento schema | Nautilus data type | Description |
|---|---|---|
MBO | OrderBookDelta | Market by order (L3). |
MBP_1 | (QuoteTick, TradeTick | None) | Market by price (L1). |
MBP_10 | OrderBookDepth10 | Market depth (L2). |
BBO_1S | QuoteTick | 1-second best bid/offer. |
BBO_1M | QuoteTick | 1-minute best bid/offer. |
CMBP_1 | (QuoteTick, TradeTick | None) | Consolidated MBP across venues. |
CBBO_1S | QuoteTick | Consolidated 1-second BBO. |
CBBO_1M | QuoteTick | Consolidated 1-minute BBO. |
TCBBO | (QuoteTick, TradeTick) | Trade-sampled consolidated BBO. |
TBBO | (QuoteTick, TradeTick) | Trade-sampled best bid/offer. |
TRADES | TradeTick | Trade ticks. |
OHLCV_1S | Bar | 1-second bars. |
OHLCV_1M | Bar | 1-minute bars. |
OHLCV_1H | Bar | 1-hour bars. |
OHLCV_1D | Bar | Daily bars. |
OHLCV_EOD | Bar | End-of-day bars. |
DEFINITION | Instrument (various types) | Instrument definitions. |
IMBALANCE | DatabentoImbalance | Auction imbalance data. |
STATISTICS | DatabentoStatistics | Market statistics. |
STATUS | InstrumentStatus | Market status updates. |
Schemas Cortana actually cares about
| Cortana use case | Schema | Why |
|---|---|---|
| Spike Step 0.5: SPY OPRA replay | TRADES (every option print) + MBP_1 (top-of-book) | Per spike plan; gets every print + L1 NBBO without paying for L2/L3. |
| Tighter MK3 backtest fidelity (post-spike) | MBP_10 | L2 depth-aware fill simulation; midweek work after spike. |
| Adversarial replay (queue position, exact reconstruction) | MBO | L3, expensive, only after MK3 fill model needs it. |
| 0DTE chain definition resolution | DEFINITION | Required before market data - the doc is explicit. Loads instrument metadata so price precision is correct. |
| Equity SPY underlying tape | TBBO (trade-sampled BBO) or MBP_1+TRADES | TBBO emits both QuoteTick and TradeTick per message - more efficient than two streams. |
| Auction imbalance (open/close cross signals) | IMBALANCE | PyO3-only type (DatabentoImbalance); not streamable through BacktestNode yet - query catalog directly. |
| Market status (halts, opens, closes) | STATUS | Lightweight; useful for Power Hour / open-cross gating. |
Schema-pair pitfalls (verbatim guidance)
- TBBO / TCBBO already include trades. Do not subscribe to a separate trade feed alongside. Doc: “Avoid subscribing to both TBBO/TCBBO and separate trade feeds for the same instrument. These schemas already include trades. Duplicating wastes cost and creates duplicate data.”
- MBP-1 with
include_trades=Truealso emits bothQuoteTickandTradeTick. Equivalent to TBBO for L1+trades use cases. - MBO must be subscribed at node startup. Doc: “MBO subscriptions
must be made at node startup for Databento to ensure proper replay
from session start.” Subscriptions after start are logged as errors
and ignored. Cortana MK3 strategies that subscribe MBO must do it
in
on_start(), never lazily. - CMBP_1 / TCBBO have no native trade ID. The decoder derives a
deterministic
TradeIdvia FNV-1a hash of(instrument_id, ts_event, ts_recv, price, size, aggressor_side). Same venue event → sameTradeIdacross replays (dedup-safe). Two logically distinct trades with identical fields collide - matches venue’s own inability to distinguish them.
Subscription-method-to-schema map (live)
For live TradingNode strategies, Nautilus’s standard subscribe API
maps to Databento schemas with sensible defaults:
| Nautilus method | Default schema | Available schemas | Emits |
|---|---|---|---|
subscribe_quote_ticks() | mbp-1 | mbp-1, bbo-1s, bbo-1m, cmbp-1, cbbo-1s, cbbo-1m, tbbo, tcbbo | QuoteTick |
subscribe_trade_ticks() | trades | trades, tbbo, tcbbo, mbp-1, cmbp-1 | TradeTick |
subscribe_order_book_depth() | mbp-10 | mbp-10 | OrderBookDepth10 |
subscribe_order_book_deltas() | mbo | mbo | OrderBookDeltas |
subscribe_bars() | varies | ohlcv-1s, ohlcv-1m, ohlcv-1h, ohlcv-1d | Bar |
Schema overrides go through the params kwarg:
from nautilus_trader.adapters.databento import DATABENTO_CLIENT_ID
from nautilus_trader.model.identifiers import InstrumentId
# Default MBP-1 quotes (may include trades)
self.subscribe_quote_ticks(instrument_id, client_id=DATABENTO_CLIENT_ID)
# Override to TBBO (quotes + trades together, more efficient)
self.subscribe_quote_ticks(
instrument_id=instrument_id,
params={"schema": "tbbo"},
client_id=DATABENTO_CLIENT_ID,
)
# 1-second BBO snapshots only
self.subscribe_quote_ticks(
instrument_id=instrument_id,
params={"schema": "bbo-1s"},
client_id=DATABENTO_CLIENT_ID,
)For order book depth (depth must be 10 for Databento):
self.subscribe_order_book_depth(
instrument_id=instrument_id,
depth=10, # Required value - selects MBP-10 schema
)For MBO (must be in on_start()):
from nautilus_trader.model.enums import BookType
def on_start(self) -> None:
self.subscribe_order_book_deltas(
instrument_id=instrument_id,
book_type=BookType.L3_MBO, # Selects MBO schema
)For custom data types (DatabentoImbalance, DatabentoStatistics,
InstrumentStatus):
from nautilus_trader.adapters.databento import (
DATABENTO_CLIENT_ID,
DatabentoImbalance,
DatabentoStatistics,
)
from nautilus_trader.model import DataType
self.subscribe_data(
data_type=DataType(DatabentoImbalance, metadata={"instrument_id": instrument_id}),
client_id=DATABENTO_CLIENT_ID,
)Instrument IDs and symbology - the load-bearing answer for Cortana
This is the section the spike plan asked the doc to resolve. The
answer: Databento’s instrument_id (an integer assigned by source
venue or by Databento) is distinct from the Nautilus
InstrumentId string. The decoder maps:
- Databento
raw_symbol→ Nautilussymbol - Databento ISO 10383 MIC (Market Identifier Code) from the
DEFINITION message → Nautilus
venue - Together:
"{symbol}.{venue}"- e.g.,AAPL.XNAS
Critical for SPY: equity vs OPRA conventions
The doc covers GLBX (CME) explicitly but not OPRA explicitly. From the patterns the doc does establish:
- SPY equity lives at
XNAS(NASDAQ) orXNYS(NYSE/ARCA primary listing - SPY’s true primary is ARCA, MICARCX, but Cortana code has been usingSPY.ARCAper the existingnautilus-adapters.mdsketch). The doc notes “Other venue MICs are in the venue field of responses from the metadata.list_publishers endpoint.” - SPY options on OPRA publish under their own MIC. The exact MIC
for OPRA-published SPY options is
OPRA(the consolidated tape) or one of the participating exchange MICs (XCBOfor CBOE,XISEfor ISE, etc.) depending onuse_exchange_as_venue. Verify on Saturday by callingmetadata.list_publishersagainst your $125 credit account and grepping for SPY - this is the canonical resolution per the doc. - The adapter has a
use_exchange_as_venueconfig flag (defaultTrue): “Use the exchange MIC for Nautilus venues (e.g., XCME). False retains the default GLBX mapping.” For OPRA, the analogous question is “do options come back per-exchange (XCBO/XISE/etc.) or consolidated under OPRA?” Default is per-exchange MIC; the spike plan should leave default and observe whatinstrument_idstrings appear in the catalog.
CME Globex example (verbatim) - pattern to mirror for OPRA
For CME Globex MDP 3.0 (GLBX.MDP3), these exchanges group under the
GLBX venue. The instrument's exchange field determines the mapping:
CBCM, NYUM, XCBT, XCEC, XCME, XFXS, XNYM
The OPRA equivalent: the OPRA dataset publishes from many
participating-exchange MICs, the adapter normalizes them per
use_exchange_as_venue. Documented dataset codes in the spike-relevant
universe:
| Dataset code | Publisher / coverage | Cortana relevance |
|---|---|---|
GLBX.MDP3 | CME Globex | Future use (ES options) |
XNAS.ITCH | NASDAQ ITCH | SPY-on-NASDAQ underlying ticks if we ever route there |
OPRA.PILLAR (assumed name; verify) | OPRA consolidated options tape | Spike Step 0.5 target. Pulls every SPY option print across all participating exchanges. |
DBEQ.BASIC | Databento Equities Basic (consolidated NBBO) | SPY underlying NBBO - alternative to IBKR for backtest |
(Run databento datasets CLI on Saturday to print authoritative codes.)
Timestamps - the mapping Cortana needs for replay alignment
Databento timestamp fields:
| Field | Meaning |
|---|---|
ts_event | Matching-engine-received timestamp (ns since epoch). |
ts_in_delta | Matching-engine-sending delta (ns before ts_recv). |
ts_recv | Capture-server-received timestamp (ns since epoch). |
ts_out | Databento sending timestamp. |
The decoder maps Databento ts_recv → Nautilus ts_event. Doc:
“This timestamp is more reliable and monotonically increases per
instrument.”
Exceptions: DatabentoImbalance and DatabentoStatistics carry all
four timestamp fields verbatim because they’re adapter-specific PyO3
types.
Cross-check pattern for spike Step 0.5
To confirm a decisions.db row aligns with a Databento print:
- Pull the
decisions.db.scoring_events.spy_price_at_scoreand its timestamp (millisecond resolution in current Cortana). - Multiply by 1e6 → nanoseconds.
catalog.query(TradeTick, identifiers=["SPY.ARCA"], start=ts_ns - 100_000_000, end=ts_ns + 100_000_000)to pull a 200ms window around it.- Cross-check price magnitude (within typical bid-ask spread). If the IBKR-recorded price falls within the Databento NBBO at that instant, alignment passes.
Tie-breaker note: Nautilus stable-sorts by ts_init, not ts_event
(see nautilus-data.md). Databento’s mapping
ts_recv → ts_event means catalogs are deterministic on a per-
instrument basis (since ts_recv monotonically increases per
instrument), but cross-instrument ordering at the same nanosecond
needs the same caveat as nautilus-backtesting.md Carryover #7.
Price precision
Databento raw prices are fixed-point integers scaled by 1e-9. The adapter derives precision from the instrument’s tick size in the DEFINITION message.
For live feeds: a per-instrument precision map is populated from
InstrumentDefMsg records as they arrive. Definitions must arrive
before market data for correct precision on instruments with
non-standard tick sizes (e.g., treasury futures with 1/256 ticks).
Without a prior definition, precision falls back to 2 (USD default)
- fine for SPY-OPRA, where penny ticks dominate, but verify on the spike.
The Python adapter automatically subscribes to instrument definitions before market data - no extra config. The Rust client does not; you must subscribe DEFINITION first.
For historical/file-based loading: pass an explicit
price_precision= parameter to from_dbn_file to override.
Catalog persistence - the Cortana MK3 ingest pattern
This is the section spike Step 0.5 lives on. The doc’s recommended flow:
from nautilus_trader.adapters.databento import DatabentoDataLoader
from nautilus_trader.model.identifiers import InstrumentId
from nautilus_trader.persistence.catalog import ParquetDataCatalog
catalog = ParquetDataCatalog.from_env() # uses NAUTILUS_PATH
loader = DatabentoDataLoader()
# STEP 1: Load instrument definitions FIRST (required)
instruments = loader.from_dbn_file(
path="spy-opra-definition.dbn.zst",
as_legacy_cython=False, # PyO3 for performance
)
catalog.write_data(instruments)
# STEP 2: Load market data (trades, quotes, etc.)
instrument_id = InstrumentId.from_str("SPY.OPRA") # verify MIC on Saturday
trades = loader.from_dbn_file(
path="spy-opra-trades.dbn.zst",
instrument_id=instrument_id,
as_legacy_cython=False,
)
catalog.write_data(trades)
# Verify
print(catalog.instruments()) # empty list = step 1 missingfrom_dbn_file parameters (verbatim)
| Param | Effect |
|---|---|
path | DBN file path (.dbn or .dbn.zst). |
instrument_id | Speeds up decoding by skipping symbology lookup. Optional but recommended. |
price_precision | Overrides the default price precision (default: 2). |
include_trades | For MBP-1 / CMBP-1: True emits both QuoteTick and TradeTick when trade data is present in the message. |
as_legacy_cython | True (default) for legacy Cython types compatible with BacktestEngine. False required for IMBALANCE / STATISTICS (PyO3-only). False also faster for catalog writes. |
Doc warnings that bite
- Empty instruments list → DEFINITION files missing. The catalog rejects market data writes if no instruments are present (or the query returns garbage precision).
- Market data files do not contain instrument definitions. You must obtain DEFINITION schema files separately from Databento for your symbols and date ranges. Per the spike plan: pull SPY OPRA DEFINITION first, then TRADES + MBP-1.
- IMBALANCE / STATISTICS are PyO3-only. Streaming them through
BacktestNode/BacktestEngineis not yet supported - query the catalog directly and process in your strategy/analysis code. - TBBO / TCBBO double-counting: if
include_trades=Trueyou get quotes;include_trades=Falseyou get trades. Two separate calls needed if you want both back as native types from one TCBBO file.
Performance: DBN-on-disk vs catalog-on-disk
Doc verbatim:
Two options for backtesting with DBN data:
- Store data as DBN (.dbn.zst) files and decode to Nautilus objects every run.
- Convert DBN files to Nautilus objects once and write to the data catalog (Nautilus Parquet format). The DBN decoder is optimized Rust, but writing to the catalog once gives the best backtest performance. DataFusion streams Nautilus Parquet data from disk at high throughput, at least an order of magnitude faster than decoding DBN per run.
For Cortana MK3 spike: do option 2. Convert once on Saturday,
backtest forever from ~/cortana-data/catalog/.
Live trading config (for the future, not the spike)
The TradingNodeConfig shape:
from nautilus_trader.adapters.databento import DATABENTO
from nautilus_trader.adapters.databento.factories import DatabentoLiveDataClientFactory
from nautilus_trader.live.node import TradingNode
from nautilus_trader.config import InstrumentProviderConfig
config = TradingNodeConfig(
...,
data_clients={
DATABENTO: {
"api_key": None, # falls back to DATABENTO_API_KEY env var
"http_gateway": None,
"live_gateway": None,
"instrument_provider": InstrumentProviderConfig(load_all=True),
"instrument_ids": None,
"parent_symbols": None,
},
},
...,
)
node = TradingNode(config=config)
node.add_data_client_factory(DATABENTO, DatabentoLiveDataClientFactory)
node.build()Configuration parameters (verbatim)
| Option | Default | Description |
|---|---|---|
api_key | None | Databento API secret. Falls back to DATABENTO_API_KEY env var when None. |
http_gateway | None | Historical HTTP gateway override (testing). |
live_gateway | None | Raw TCP real-time gateway override (testing). |
use_exchange_as_venue | True | Use exchange MIC for Nautilus venue. False retains default GLBX mapping. |
timeout_initial_load | 15.0 | Seconds to wait for instrument definitions per dataset before proceeding. |
mbo_subscriptions_delay | 3.0 | Seconds to buffer before enabling MBO/L3 streams so initial snapshots replay in order. |
bars_timestamp_on_close | True | Timestamp bars on close (ts_event/ts_init). False = open. |
reconnect_timeout_mins | 10 | Minutes to attempt reconnection before giving up. None = retry indefinitely. |
venue_dataset_map | None | Optional Nautilus venue → Databento dataset code mapping. |
parent_symbols | None | {dataset: {parent symbols}} to preload definition trees, e.g. {"GLBX.MDP3": {"ES.FUT", "ES.OPT"}}. |
instrument_ids | None | Nautilus InstrumentId values to preload definitions for at startup. |
Live-client architecture quirk
Per dataset, DatabentoDataClient opens two DatabentoLiveClient
instances:
- One for MBO (order book deltas) real-time feeds.
- One for all other real-time feeds.
This is why MBO must be subscribed at startup - the first client needs the full subscribe-set before opening the stream so it can replay session-start snapshots in order.
Connection stability (live operation)
Two reconnect modes:
- With timeout (default 10 min): exponential backoff capped at 60s. Pattern: 1s, 2s, 4s, 8s, 16s, 32s, 60s, 60s… (with up-to-1s jitter). Survives transient network issues + scheduled gateway restarts. Stops retrying overnight.
- Without timeout (
reconnect_timeout_mins=None): exponential backoff capped at 10 min. Pattern: 1s, 2s, 4s, …, 256s, 512s, 600s, 600s… (with jitter). For unattended systems. Can mask config / auth issues.
Every reconnection auto-resubscribes to all active topics. Successful sessions >60s reset the timeout clock.
Sunday maintenance schedule
| Dataset | Maintenance Time (UTC) |
|---|---|
| CME Globex | 09 |
| All ICE venues | 09 |
| All other datasets (incl. OPRA) | 10 |
Default 10-min timeout covers a typical restart. Cortana operates weekdays only - Sunday maintenance is irrelevant for live MK3.
Cost model and the $125 credits
What the doc states:
- “Databento offers 125 USD in free data credits (historical only) for new sign-ups. With careful requests, this covers testing and evaluation.”
- “Check the
/metadata.get_costendpoint before requesting data.” This is the documented cost-estimation method.
What the doc does NOT cover (escalate to the databento Python SDK
docs / databento.com pricing page on Saturday):
- Exact /symbol-day cost for OPRA TRADES + MBP-1.
- Whether 0DTE wildcards (e.g.,
SPY 250509*for all SPY options expiring Friday) bill per-symbol-resolved or per-MB-of-DBN. - Live-feed billing (the databento-vs-uw-vs-ibkr-data-feeds.md page notes usage-based live is being deprecated 2025-03-31; live needs Standard or Plus).
Cost estimation pattern (Saturday morning)
# CLI pre-flight (assumes `pip install databento` and DATABENTO_API_KEY env)
databento metadata.get-cost \
--dataset OPRA.PILLAR \
--symbols "SPY 250509C00580000,SPY 250509P00580000" \
--schema trades \
--start 2026-05-06T13:30:00 \
--end 2026-05-06T20:00:00Or via Python SDK:
import databento as db
client = db.Historical()
cost = client.metadata.get_cost(
dataset="OPRA.PILLAR",
symbols=["SPY"], # parent symbol - resolves all SPY options
stype_in="parent", # critical: tells Databento "this is a parent symbol, expand to children"
schema="trades",
start="2026-05-06T13:30:00",
end="2026-05-06T20:00:00",
)
print(f"Estimated cost: ${cost:.2f}")Abort threshold per spike plan: if estimated cost > $30, narrow the symbol filter (specific strikes, e.g., the 0DTE near-the-money chain only) or reduce the time window.
Cortana MK3 implications - Step 0.5 step-by-step playbook
This is the load-bearing section. Copy-pasteable Saturday-morning recipe.
Pre-spike setup (5 min, do Friday night)
# 1. Sign up at databento.com, claim $125 trial credits
# 2. Install the databento SDK (separate from the Nautilus adapter - useful for the CLI)
uv pip install databento
# 3. Set API key (visible at https://databento.com/portal/keys)
export DATABENTO_API_KEY="db-..." # paste key
echo 'export DATABENTO_API_KEY="db-..."' >> ~/.zshrc
# 4. Set NAUTILUS_PATH
mkdir -p ~/cortana-data
export NAUTILUS_PATH=~/cortana-data
echo 'export NAUTILUS_PATH=~/cortana-data' >> ~/.zshrc
# 5. Smoke test
python -c "import databento; print(databento.__version__)"
databento datasets # prints all available dataset codes - grep for OPRAStep 0.5.1 - Verify OPRA dataset code (5 min)
databento datasets | grep -i opra
# Expected output includes OPRA.PILLAR (or current canonical OPRA dataset code)
databento metadata.list_publishers --dataset OPRA.PILLAR
# Lists participating exchanges and their MICs - confirms Nautilus venue mappingIf OPRA.PILLAR is not the canonical name, use whatever Databento
prints and substitute in the next steps.
Step 0.5.2 - Pre-flight cost estimate (5 min)
Pick a date already covered by decisions.db (e.g., 2026-05-06):
# /tmp/databento_cost_check.py
import databento as db
DATE = "2026-05-06"
client = db.Historical()
# Estimate cost for SPY parent (all SPY options, that day, Trades + MBP-1)
for schema in ("trades", "mbp-1"):
cost = client.metadata.get_cost(
dataset="OPRA.PILLAR",
symbols=["SPY"],
stype_in="parent",
schema=schema,
start=f"{DATE}T13:30:00", # 8:30 CT = 13:30 UTC
end=f"{DATE}T20:00:00", # 3:00 PM CT = 20:00 UTC
)
print(f"{schema:10s}: ${cost:.2f}")
# Also estimate DEFINITION (small, mandatory)
cost_def = client.metadata.get_cost(
dataset="OPRA.PILLAR",
symbols=["SPY"],
stype_in="parent",
schema="definition",
start=f"{DATE}T00:00:00",
end=f"{DATE}T23:59:59",
)
print(f"definition: ${cost_def:.2f}")Run it. If trades + mbp-1 + definition < 30,
narrow to specific 0DTE near-the-money strikes.
Step 0.5.3 - Pull SPY OPRA Trades + MBP-1 + DEFINITION (15 min)
# /tmp/databento_pull.py
import databento as db
from pathlib import Path
DATE = "2026-05-06"
OUT = Path("~/cortana-data/raw/databento").expanduser()
OUT.mkdir(parents=True, exist_ok=True)
client = db.Historical()
# 1. DEFINITION schema (small, fast, MUST be first)
client.timeseries.get_range(
dataset="OPRA.PILLAR",
symbols=["SPY"],
stype_in="parent",
schema="definition",
start=f"{DATE}T00:00:00",
end=f"{DATE}T23:59:59",
path=str(OUT / f"spy-opra-definition-{DATE}.dbn.zst"),
)
print("Definition pulled.")
# 2. TRADES (every option print)
client.timeseries.get_range(
dataset="OPRA.PILLAR",
symbols=["SPY"],
stype_in="parent",
schema="trades",
start=f"{DATE}T13:30:00",
end=f"{DATE}T20:00:00",
path=str(OUT / f"spy-opra-trades-{DATE}.dbn.zst"),
)
print("Trades pulled.")
# 3. MBP-1 (top-of-book per option contract)
client.timeseries.get_range(
dataset="OPRA.PILLAR",
symbols=["SPY"],
stype_in="parent",
schema="mbp-1",
start=f"{DATE}T13:30:00",
end=f"{DATE}T20:00:00",
path=str(OUT / f"spy-opra-mbp1-{DATE}.dbn.zst"),
)
print("MBP-1 pulled.")Equivalent CLI form (one schema at a time):
databento timeseries.get-range \
--dataset OPRA.PILLAR \
--symbols SPY \
--stype-in parent \
--schema trades \
--start 2026-05-06T13:30:00 \
--end 2026-05-06T20:00:00 \
--output ~/cortana-data/raw/databento/spy-opra-trades-2026-05-06.dbn.zstStep 0.5.4 - Ingest into ParquetDataCatalog (10 min)
# /tmp/databento_ingest.py
from pathlib import Path
from nautilus_trader.adapters.databento import DatabentoDataLoader
from nautilus_trader.model.identifiers import InstrumentId
from nautilus_trader.persistence.catalog import ParquetDataCatalog
DATE = "2026-05-06"
RAW = Path("~/cortana-data/raw/databento").expanduser()
catalog = ParquetDataCatalog.from_env() # NAUTILUS_PATH
loader = DatabentoDataLoader()
# 1. DEFINITIONS first
instruments = loader.from_dbn_file(
path=str(RAW / f"spy-opra-definition-{DATE}.dbn.zst"),
as_legacy_cython=False,
)
catalog.write_data(instruments)
print(f"Wrote {len(instruments)} instrument definitions.")
# Verify
loaded = catalog.instruments()
print(f"Catalog now contains {len(loaded)} instruments. Sample:")
for inst in list(loaded)[:5]:
print(f" {inst.id}")
# 2. TRADES (no instrument_id arg - let symbology resolve per-print)
trades = loader.from_dbn_file(
path=str(RAW / f"spy-opra-trades-{DATE}.dbn.zst"),
as_legacy_cython=False,
)
catalog.write_data(trades)
print(f"Wrote {len(trades):,} trade ticks.")
# 3. MBP-1 (gets quotes + maybe trades)
quotes = loader.from_dbn_file(
path=str(RAW / f"spy-opra-mbp1-{DATE}.dbn.zst"),
include_trades=True, # also emit TradeTicks where present
as_legacy_cython=False,
)
catalog.write_data(quotes)
print(f"Wrote {len(quotes):,} quote/trade ticks from MBP-1.")Step 0.5.5 - Query catalog and cross-check decisions.db (10 min)
# /tmp/databento_crosscheck.py
import sqlite3, pandas as pd
from pathlib import Path
from nautilus_trader.persistence.catalog import ParquetDataCatalog
from nautilus_trader.model import TradeTick
from nautilus_trader.model.identifiers import InstrumentId
DATE = "2026-05-06"
catalog = ParquetDataCatalog.from_env()
# 1. Pick one decisions.db row near a known-good entry timestamp
DB_PATH = Path("~/conductor/workspaces/cortanaroi-mk2/belo-horizonte/data/decisions.db").expanduser()
conn = sqlite3.connect(DB_PATH)
df = pd.read_sql(
"SELECT ts_event, signal_id, spy_price_at_score, composite_score, bias "
"FROM scoring_events "
f"WHERE date(ts_event/1e9, 'unixepoch') = '{DATE}' "
"AND composite_score >= 65 "
"ORDER BY ts_event LIMIT 1",
conn,
)
print(df)
ts_ns = int(df.iloc[0]["ts_event"])
spy_price = float(df.iloc[0]["spy_price_at_score"])
print(f"Cortana saw SPY @ ${spy_price:.2f} at ts_event={ts_ns}")
# 2. Pull a 200ms window of SPY-OPRA trades around that timestamp
# First find the SPY underlying in the Databento equity catalog if loaded;
# else use the nearest ATM option print as a proxy
results = catalog.query(
data_cls=TradeTick,
start=ts_ns - 100_000_000, # ts - 100ms
end=ts_ns + 100_000_000, # ts + 100ms
)
print(f"Catalog returned {len(results)} ticks in 200ms window")
for tick in results[:10]:
print(f" {tick.instrument_id} {tick.price} size={tick.size} ts_event={tick.ts_event}")
# 3. (Optional) load SPY underlying from a separate Databento equity pull (DBEQ.BASIC schema)
# so cross-check is apples-to-apples. For Step 0.5 spike, the OPRA print is enough -
# we're validating the *plumbing*, not building the prod pipeline.Pass criterion: the catalog returns ticks; one of them within
nanoseconds-to-milliseconds of ts_ns; the cross-instrument MIC
strings (SPY 250509C00580000.OPRA or similar) decode cleanly. Fail
criterion: schema confusion, empty result, instrument_id mismatch,
or out-of-order ticks.
Step 0.5.6 - Stub a no-op backtest streaming the catalog (10 min)
# /tmp/databento_backtest_stub.py
from nautilus_trader.backtest.node import BacktestNode
from nautilus_trader.config import (
BacktestRunConfig, BacktestDataConfig, BacktestVenueConfig,
BacktestEngineConfig, ImportableStrategyConfig,
)
from nautilus_trader.model import TradeTick
from nautilus_trader.model.identifiers import InstrumentId
DATE = "2026-05-06"
data_config = BacktestDataConfig(
catalog_path="/Users/codysmith/cortana-data/catalog",
data_cls=TradeTick,
instrument_id=None, # all SPY-OPRA contracts in catalog
start_time=f"{DATE}T13:30:00Z",
end_time=f"{DATE}T20:00:00Z",
)
venue_config = BacktestVenueConfig(
name="OPRA",
oms_type="NETTING",
account_type="CASH",
starting_balances=["100_000 USD"],
book_type="L1_MBP",
)
# A no-op strategy (counts ticks, then prints)
class TickCounterStrategy:
pass
run_config = BacktestRunConfig(
engine=BacktestEngineConfig(),
data=[data_config],
venues=[venue_config],
strategies=[], # empty for stub - just prove the catalog streams
)
node = BacktestNode(configs=[run_config])
results = node.run()
print(results)Pass: BacktestNode.run() completes without error and prints non-
zero ticks streamed. Fail: ImportError, schema validation, or
mid-run crash.
Step 0.5.7 - Acceptance log entry (5 min)
If all six substeps pass in <60 min and credit burn was <$30, log to the spike plan / brain timeline:
2026-05-09 | Cody - Step 0.5 PASS. SPY OPRA Trades + MBP-1 for
2026-05-06 ingested in N min, $X spent of $125 credits.
Catalog at ~/cortana-data/catalog/. Cross-check against
decisions.db row #N: ts alignment confirmed within Yms.
Backtest stub streamed Z ticks. Databento path validated for
MK3 historical replay.
Step 0.5 fail-mode triage
| Failure | Likely cause | Mitigation |
|---|---|---|
| Cost estimate >$30 | Pulling whole SPY parent is expensive | Narrow to specific 0DTE strikes; use stype_in="raw_symbol" with explicit OSI symbols. |
databento datasets doesn’t list OPRA | Free credits don’t cover OPRA | Switch to a smaller/cheaper dataset (DBEQ.BASIC for SPY equity) and validate the plumbing. Defer OPRA to paid tier. |
from_dbn_file raises on definition | DBN version mismatch / file corruption | Re-pull; check file size > 0; check zstd integrity. |
catalog.instruments() empty after write | DEFINITION write failed silently | Re-run with as_legacy_cython=False; check write_data return value; check ~/cortana-data/catalog/data/ filesystem. |
| Cross-check tick-window empty | MIC mismatch - Cortana’s SPY.ARCA vs Databento’s per-exchange MIC | Run metadata.list_publishers and use the MIC the adapter actually wrote. |
| Out-of-order events on replay | Nanosecond-tie ordering issue | Document as Carryover 7-extension; pre-sort (ts_event, raw_id) at DataLoader stage. |
| Backtest stub hangs | Strategies-list empty + no-op strat config issue | Plug in a minimal Strategy subclass that counts ticks; or use BacktestEngine directly. |
Cross-page Cortana question - answered
“Which Databento schema(s) does Cortana actually need for replay?” (Open thread on databento-vs-uw-vs-ibkr-data-feeds.md.)
Answer per this doc: TRADES + MBP-1 + DEFINITION is the
minimum viable triple for the spike. TRADES gives every print. MBP-1
gives top-of-book quotes (and optionally trades via
include_trades=True). DEFINITION is mandatory before any market
data load. Skip MBO/MBP-10/IMBALANCE/STATISTICS until post-spike when
adversarial-fidelity backtest needs them.
Open questions for Step 0.5 (not resolved by this doc)
- Exact OPRA dataset code. The doc names
GLBX.MDP3for CME but does not name OPRA’s. Resolve viadatabento datasetsCLI on Saturday. - OPRA MIC after
use_exchange_as_venue=True. The doc establishes the per-exchange-MIC pattern via the GLBX example but does not enumerate OPRA’s behavior. Resolve viametadata.list_publishers --dataset <opra>and observe whatinstrument_idstrings end up in the catalog. - 0DTE wildcard syntax. Doc shows
parent_symbols={"GLBX.MDP3": {"ES.FUT", "ES.OPT"}}for futures parents but doesn’t show “SPY 0DTE only” filtering for options. Likely path: pull SPY parent, filter post-load byexpiry == DATE. Or use raw OSI symbols (SPY 250509C00580000etc.) withstype_in="raw_symbol". - Per-MB vs per-symbol-day cost shape. Doc says “Check
metadata.get_cost” but does not document the cost formula. Treatget_costas authoritative; do not try to predict from first principles. - 404 how-to URLs. The two
nautilus-how-to.mdDatabento recipes were 404 on 2026-05-06. The integration page atnautilustrader.io/docs/latest/integrations/databento/is the actual canonical reference (this page mirrors it). The how-tos should be considered pointers; this page is the source of truth. - Live OPRA pricing tier. Doc covers reconnection mechanics but not the pricing-tier requirement for live OPRA. Per databento-vs-uw-vs-ibkr-data-feeds.md: live needs Standard (1,399/mo). Defer.
OPRA.PILLARvs separate per-exchange OPRA dataset codes. Resolve viadatabento datasets.- Tie-breaking at exact-nanosecond
ts_eventcollisions. Per nautilus-backtesting.md Carryover #7, not unique to Databento - but the OPRA tape is high-volume and genuinely produces collisions. Pre-sort(ts_event, sequence_id, raw_id)at DataLoader stage if observed.
Anti-patterns to avoid
- Loading market data before DEFINITION. Catalog will reject or produce wrong-precision data. Always step 1 = DEFINITION, step 2 = market data.
- Subscribing MBO after
on_start. “Subscriptions after start are logged as errors and ignored.” MBO must be inon_start(). - Subscribing TBBO/TCBBO and a separate
tradesfeed for the same instrument. Doubles cost, creates duplicates. Pick one path. - Re-decoding DBN per backtest run. Pay the decode cost once, then query Parquet. Order-of-magnitude perf difference per the doc.
- Using
as_legacy_cython=Truefor IMBALANCE/STATISTICS. RaisesValueError. They are PyO3-only types. - Skipping
metadata.get_costbefore a big pull. $125 burns fast with naive whole-day MBO pulls. - Hardcoding venue MICs. Use the venue MIC the adapter actually
writes; don’t assume
SPY.ARCAorSPY.OPRAmatches what’s in the catalog. Confirm withcatalog.instruments(). - Treating
decisions.dbmillisecond timestamps as nanoseconds. Multiply by 1e6 before comparing against Nautilusts_event. - Skipping the dual-live-client architecture caveat. Live MBO + other feeds open two TCP clients per dataset; firewall/ratelimit- budget for two, not one.
- Hardcoding
OPRA.PILLARbefore verifying withdatabento datasets. The actual code may differ; the spike has time to check.
When this concept applies
- Pulling historical OPRA / equity tape into a Nautilus
ParquetDataCatalogfor backtest replay (Cortana MK3 spike Step 0.5 and beyond). - Building backtest fidelity tooling (MBO L3 reconstruction, MBP-10 L2 fills, NBBO replay).
- Cross-validating UW alerts against raw OPRA prints (post-spike adversarial replay framework).
- Evaluating a future move to Databento as a live feed (post-MK3, post-Standard-tier upgrade).
When it does not apply
- Cortana’s primary signal layer (UW flow alerts) - UW remains the signal vendor; Databento is replay/audit only.
- Pricing-of-record - IBKR is the only vendor whose price string
lands on an order
(
feedback_ibkr_pricing_source.md). - Live trading on $125 free credits - historical only.
See Also
- Nautilus Adapters - Databento was named the closest reference adapter for the UW WebSocket adapter; this page documents the shape concretely.
- Nautilus Data Model -
ParquetDataCatalog,BacktestDataConfig, custom-data-type ingest patterns the Databento decoder feeds into. - Nautilus Backtesting -
BacktestNode/BacktestEngineconsumption of the catalog Databento populates; Carryover #7 (ts_inittie-breaking). - Nautilus Tutorials - “Data Catalog with Databento” tutorial entry; recommended Cortana learning path.
- Nautilus Instruments -
Equity,OptionContractinstrument types Databento DEFINITION decodes into (parallel agent - verify slug). - Databento vs UW vs IBKR data feeds - the layering decision that drove Step 0.5; explicit “augment, don’t replace UW”.
- Nautilus Integrations (IBKR-focused) - IBKR pairing partner; Databento data + IBKR execution is the doc- blessed combo.
- Spike plan:
~/conductor/workspaces/cortanaroi-mk2/belo-horizonte/plans/2026-05-09-nautilus-spike.mdStep 0.5. - Source: https://nautilustrader.io/docs/latest/integrations/databento/
Timeline
2026-05-07 | Cody - Filed during pre-spike concept mastery sweep batch 4.