Nautilus Databento Integration

NautilusTrader ships an OOB Databento adapter built on the official databento-rs crate (no separate databento Python package install needed - the Rust client links during the wheel build). It is data-only: DatabentoDataLoader (DBN file → Nautilus objects), DatabentoHistoricalClient (REST fetch of historical), DatabentoLiveClient (raw-TCP live stream), DatabentoInstrumentProvider (definitions), and DatabentoDataClient (the LiveMarketDataClient wired into a TradingNode). Pair it with IBKR for execution; pricing-of-record stays on IBKR. Twelve schemas are supported (MBO/MBP-1/MBP-10/BBO/CBBO/CMBP-1/TBBO/TCBBO/TRADES/OHLCV/ DEFINITION/IMBALANCE/STATISTICS/STATUS) and decode to native Nautilus types (OrderBookDelta, QuoteTick, TradeTick, OrderBookDepth10, Bar, instrument variants, plus PyO3-only DatabentoImbalance / DatabentoStatistics). The doc explicitly recommends decoding DBN to the ParquetDataCatalog once, then streaming via BacktestNode - at least an order of magnitude faster than re-decoding DBN per run. $125 free credits are advertised on the doc itself; the metadata.get_cost HTTP endpoint is the documented cost-estimation pre-flight. For Cortana MK3, this is the page that powers spike Step 0.5 - pull SPY OPRA Trades + MBP-1 for one trading day, ingest into ~/cortana-data/catalog/, query, cross-check against decisions.db.

This page complements Nautilus Adapters (Databento was named the closest data-only reference adapter for the UW WebSocket adapter MK3 must build), Nautilus Data Model (how ParquetDataCatalog and custom-data types work), Nautilus Backtesting (catalog → BacktestNode path), Nautilus Tutorials (the “Data Catalog with Databento” how-to that the spike plan flagged as 404 on 2026-05-06), and the data-feed layering decision in databento-vs-uw-vs-ibkr-data-feeds.md.

Core claim

The Databento adapter is the canonical historical-OPRA replay path for any Nautilus user who needs equity-options tape - and the closest reference in the codebase for Cortana’s UW adapter shape. Three operational primitives are independently usable:

  1. DatabentoDataLoader.from_dbn_file(...) - pure offline DBN-to- Nautilus decoder. No API key needed at decode time. Rust under the hood, exposed as Python.
  2. DatabentoHistoricalClient - HTTP REST fetcher; needs API key; pulls historical DBN data and definitions on demand.
  3. DatabentoLiveClient / DatabentoDataClient - raw-TCP live stream wired into a TradingNode via a factory.

The doc treats DBN-on-disk → catalog conversion as a first-class production pattern: pay decode cost once, query Parquet forever.

Adapter component map

Per the Overview section of the doc:

ClassRoleUsed directly?
DatabentoDataLoaderLoads DBN data from files; converts to Nautilus objects (offline).Yes - most common entry point for Cortana MK3 spike.
DatabentoInstrumentProviderFetches latest or historical instrument definitions via Databento HTTP API.Sometimes - usually wired into TradingNode and fed via config.
DatabentoHistoricalClientFetches historical market data via Databento HTTP API.Sometimes - most users prefer DBN-on-disk via the CLI then DatabentoDataLoader.
DatabentoLiveClientSubscribes to real-time data feeds via Databento’s raw TCP API.Rare - wrapped by DatabentoDataClient.
DatabentoDataClientLiveMarketDataClient impl for live trading nodes.Indirectly - instantiated by DatabentoLiveDataClientFactory from config.

Doc verbatim: “Most users configure a live trading node (covered below) and do not work with these components directly.” For Cortana MK3 spike Step 0.5 the direct path is the simplest: CLI-pull DBN files, DatabentoDataLoader.from_dbn_file(...), write to catalog.

Supported schemas (the table that matters)

Reproduced verbatim from the doc - this is the load-bearing reference for “what schema do I ask Databento for”:

Databento schemaNautilus data typeDescription
MBOOrderBookDeltaMarket by order (L3).
MBP_1(QuoteTick, TradeTick | None)Market by price (L1).
MBP_10OrderBookDepth10Market depth (L2).
BBO_1SQuoteTick1-second best bid/offer.
BBO_1MQuoteTick1-minute best bid/offer.
CMBP_1(QuoteTick, TradeTick | None)Consolidated MBP across venues.
CBBO_1SQuoteTickConsolidated 1-second BBO.
CBBO_1MQuoteTickConsolidated 1-minute BBO.
TCBBO(QuoteTick, TradeTick)Trade-sampled consolidated BBO.
TBBO(QuoteTick, TradeTick)Trade-sampled best bid/offer.
TRADESTradeTickTrade ticks.
OHLCV_1SBar1-second bars.
OHLCV_1MBar1-minute bars.
OHLCV_1HBar1-hour bars.
OHLCV_1DBarDaily bars.
OHLCV_EODBarEnd-of-day bars.
DEFINITIONInstrument (various types)Instrument definitions.
IMBALANCEDatabentoImbalanceAuction imbalance data.
STATISTICSDatabentoStatisticsMarket statistics.
STATUSInstrumentStatusMarket status updates.

Schemas Cortana actually cares about

Cortana use caseSchemaWhy
Spike Step 0.5: SPY OPRA replayTRADES (every option print) + MBP_1 (top-of-book)Per spike plan; gets every print + L1 NBBO without paying for L2/L3.
Tighter MK3 backtest fidelity (post-spike)MBP_10L2 depth-aware fill simulation; midweek work after spike.
Adversarial replay (queue position, exact reconstruction)MBOL3, expensive, only after MK3 fill model needs it.
0DTE chain definition resolutionDEFINITIONRequired before market data - the doc is explicit. Loads instrument metadata so price precision is correct.
Equity SPY underlying tapeTBBO (trade-sampled BBO) or MBP_1+TRADESTBBO emits both QuoteTick and TradeTick per message - more efficient than two streams.
Auction imbalance (open/close cross signals)IMBALANCEPyO3-only type (DatabentoImbalance); not streamable through BacktestNode yet - query catalog directly.
Market status (halts, opens, closes)STATUSLightweight; useful for Power Hour / open-cross gating.

Schema-pair pitfalls (verbatim guidance)

  • TBBO / TCBBO already include trades. Do not subscribe to a separate trade feed alongside. Doc: “Avoid subscribing to both TBBO/TCBBO and separate trade feeds for the same instrument. These schemas already include trades. Duplicating wastes cost and creates duplicate data.”
  • MBP-1 with include_trades=True also emits both QuoteTick and TradeTick. Equivalent to TBBO for L1+trades use cases.
  • MBO must be subscribed at node startup. Doc: “MBO subscriptions must be made at node startup for Databento to ensure proper replay from session start.” Subscriptions after start are logged as errors and ignored. Cortana MK3 strategies that subscribe MBO must do it in on_start(), never lazily.
  • CMBP_1 / TCBBO have no native trade ID. The decoder derives a deterministic TradeId via FNV-1a hash of (instrument_id, ts_event, ts_recv, price, size, aggressor_side). Same venue event → same TradeId across replays (dedup-safe). Two logically distinct trades with identical fields collide - matches venue’s own inability to distinguish them.

Subscription-method-to-schema map (live)

For live TradingNode strategies, Nautilus’s standard subscribe API maps to Databento schemas with sensible defaults:

Nautilus methodDefault schemaAvailable schemasEmits
subscribe_quote_ticks()mbp-1mbp-1, bbo-1s, bbo-1m, cmbp-1, cbbo-1s, cbbo-1m, tbbo, tcbboQuoteTick
subscribe_trade_ticks()tradestrades, tbbo, tcbbo, mbp-1, cmbp-1TradeTick
subscribe_order_book_depth()mbp-10mbp-10OrderBookDepth10
subscribe_order_book_deltas()mbomboOrderBookDeltas
subscribe_bars()variesohlcv-1s, ohlcv-1m, ohlcv-1h, ohlcv-1dBar

Schema overrides go through the params kwarg:

from nautilus_trader.adapters.databento import DATABENTO_CLIENT_ID
from nautilus_trader.model.identifiers import InstrumentId
 
# Default MBP-1 quotes (may include trades)
self.subscribe_quote_ticks(instrument_id, client_id=DATABENTO_CLIENT_ID)
 
# Override to TBBO (quotes + trades together, more efficient)
self.subscribe_quote_ticks(
    instrument_id=instrument_id,
    params={"schema": "tbbo"},
    client_id=DATABENTO_CLIENT_ID,
)
 
# 1-second BBO snapshots only
self.subscribe_quote_ticks(
    instrument_id=instrument_id,
    params={"schema": "bbo-1s"},
    client_id=DATABENTO_CLIENT_ID,
)

For order book depth (depth must be 10 for Databento):

self.subscribe_order_book_depth(
    instrument_id=instrument_id,
    depth=10,  # Required value - selects MBP-10 schema
)

For MBO (must be in on_start()):

from nautilus_trader.model.enums import BookType
 
def on_start(self) -> None:
    self.subscribe_order_book_deltas(
        instrument_id=instrument_id,
        book_type=BookType.L3_MBO,  # Selects MBO schema
    )

For custom data types (DatabentoImbalance, DatabentoStatistics, InstrumentStatus):

from nautilus_trader.adapters.databento import (
    DATABENTO_CLIENT_ID,
    DatabentoImbalance,
    DatabentoStatistics,
)
from nautilus_trader.model import DataType
 
self.subscribe_data(
    data_type=DataType(DatabentoImbalance, metadata={"instrument_id": instrument_id}),
    client_id=DATABENTO_CLIENT_ID,
)

Instrument IDs and symbology - the load-bearing answer for Cortana

This is the section the spike plan asked the doc to resolve. The answer: Databento’s instrument_id (an integer assigned by source venue or by Databento) is distinct from the Nautilus InstrumentId string. The decoder maps:

  • Databento raw_symbol → Nautilus symbol
  • Databento ISO 10383 MIC (Market Identifier Code) from the DEFINITION message → Nautilus venue
  • Together: "{symbol}.{venue}" - e.g., AAPL.XNAS

Critical for SPY: equity vs OPRA conventions

The doc covers GLBX (CME) explicitly but not OPRA explicitly. From the patterns the doc does establish:

  • SPY equity lives at XNAS (NASDAQ) or XNYS (NYSE/ARCA primary listing - SPY’s true primary is ARCA, MIC ARCX, but Cortana code has been using SPY.ARCA per the existing nautilus-adapters.md sketch). The doc notes “Other venue MICs are in the venue field of responses from the metadata.list_publishers endpoint.”
  • SPY options on OPRA publish under their own MIC. The exact MIC for OPRA-published SPY options is OPRA (the consolidated tape) or one of the participating exchange MICs (XCBO for CBOE, XISE for ISE, etc.) depending on use_exchange_as_venue. Verify on Saturday by calling metadata.list_publishers against your $125 credit account and grepping for SPY - this is the canonical resolution per the doc.
  • The adapter has a use_exchange_as_venue config flag (default True): “Use the exchange MIC for Nautilus venues (e.g., XCME). False retains the default GLBX mapping.” For OPRA, the analogous question is “do options come back per-exchange (XCBO/XISE/etc.) or consolidated under OPRA?” Default is per-exchange MIC; the spike plan should leave default and observe what instrument_id strings appear in the catalog.

CME Globex example (verbatim) - pattern to mirror for OPRA

For CME Globex MDP 3.0 (GLBX.MDP3), these exchanges group under the
GLBX venue. The instrument's exchange field determines the mapping:
  CBCM, NYUM, XCBT, XCEC, XCME, XFXS, XNYM

The OPRA equivalent: the OPRA dataset publishes from many participating-exchange MICs, the adapter normalizes them per use_exchange_as_venue. Documented dataset codes in the spike-relevant universe:

Dataset codePublisher / coverageCortana relevance
GLBX.MDP3CME GlobexFuture use (ES options)
XNAS.ITCHNASDAQ ITCHSPY-on-NASDAQ underlying ticks if we ever route there
OPRA.PILLAR (assumed name; verify)OPRA consolidated options tapeSpike Step 0.5 target. Pulls every SPY option print across all participating exchanges.
DBEQ.BASICDatabento Equities Basic (consolidated NBBO)SPY underlying NBBO - alternative to IBKR for backtest

(Run databento datasets CLI on Saturday to print authoritative codes.)

Timestamps - the mapping Cortana needs for replay alignment

Databento timestamp fields:

FieldMeaning
ts_eventMatching-engine-received timestamp (ns since epoch).
ts_in_deltaMatching-engine-sending delta (ns before ts_recv).
ts_recvCapture-server-received timestamp (ns since epoch).
ts_outDatabento sending timestamp.

The decoder maps Databento ts_recv → Nautilus ts_event. Doc: “This timestamp is more reliable and monotonically increases per instrument.”

Exceptions: DatabentoImbalance and DatabentoStatistics carry all four timestamp fields verbatim because they’re adapter-specific PyO3 types.

Cross-check pattern for spike Step 0.5

To confirm a decisions.db row aligns with a Databento print:

  1. Pull the decisions.db.scoring_events.spy_price_at_score and its timestamp (millisecond resolution in current Cortana).
  2. Multiply by 1e6 → nanoseconds.
  3. catalog.query(TradeTick, identifiers=["SPY.ARCA"], start=ts_ns - 100_000_000, end=ts_ns + 100_000_000) to pull a 200ms window around it.
  4. Cross-check price magnitude (within typical bid-ask spread). If the IBKR-recorded price falls within the Databento NBBO at that instant, alignment passes.

Tie-breaker note: Nautilus stable-sorts by ts_init, not ts_event (see nautilus-data.md). Databento’s mapping ts_recv → ts_event means catalogs are deterministic on a per- instrument basis (since ts_recv monotonically increases per instrument), but cross-instrument ordering at the same nanosecond needs the same caveat as nautilus-backtesting.md Carryover #7.

Price precision

Databento raw prices are fixed-point integers scaled by 1e-9. The adapter derives precision from the instrument’s tick size in the DEFINITION message.

For live feeds: a per-instrument precision map is populated from InstrumentDefMsg records as they arrive. Definitions must arrive before market data for correct precision on instruments with non-standard tick sizes (e.g., treasury futures with 1/256 ticks). Without a prior definition, precision falls back to 2 (USD default)

  • fine for SPY-OPRA, where penny ticks dominate, but verify on the spike.

The Python adapter automatically subscribes to instrument definitions before market data - no extra config. The Rust client does not; you must subscribe DEFINITION first.

For historical/file-based loading: pass an explicit price_precision= parameter to from_dbn_file to override.

Catalog persistence - the Cortana MK3 ingest pattern

This is the section spike Step 0.5 lives on. The doc’s recommended flow:

from nautilus_trader.adapters.databento import DatabentoDataLoader
from nautilus_trader.model.identifiers import InstrumentId
from nautilus_trader.persistence.catalog import ParquetDataCatalog
 
catalog = ParquetDataCatalog.from_env()  # uses NAUTILUS_PATH
loader = DatabentoDataLoader()
 
# STEP 1: Load instrument definitions FIRST (required)
instruments = loader.from_dbn_file(
    path="spy-opra-definition.dbn.zst",
    as_legacy_cython=False,  # PyO3 for performance
)
catalog.write_data(instruments)
 
# STEP 2: Load market data (trades, quotes, etc.)
instrument_id = InstrumentId.from_str("SPY.OPRA")  # verify MIC on Saturday
trades = loader.from_dbn_file(
    path="spy-opra-trades.dbn.zst",
    instrument_id=instrument_id,
    as_legacy_cython=False,
)
catalog.write_data(trades)
 
# Verify
print(catalog.instruments())  # empty list = step 1 missing

from_dbn_file parameters (verbatim)

ParamEffect
pathDBN file path (.dbn or .dbn.zst).
instrument_idSpeeds up decoding by skipping symbology lookup. Optional but recommended.
price_precisionOverrides the default price precision (default: 2).
include_tradesFor MBP-1 / CMBP-1: True emits both QuoteTick and TradeTick when trade data is present in the message.
as_legacy_cythonTrue (default) for legacy Cython types compatible with BacktestEngine. False required for IMBALANCE / STATISTICS (PyO3-only). False also faster for catalog writes.

Doc warnings that bite

  • Empty instruments list → DEFINITION files missing. The catalog rejects market data writes if no instruments are present (or the query returns garbage precision).
  • Market data files do not contain instrument definitions. You must obtain DEFINITION schema files separately from Databento for your symbols and date ranges. Per the spike plan: pull SPY OPRA DEFINITION first, then TRADES + MBP-1.
  • IMBALANCE / STATISTICS are PyO3-only. Streaming them through BacktestNode / BacktestEngine is not yet supported - query the catalog directly and process in your strategy/analysis code.
  • TBBO / TCBBO double-counting: if include_trades=True you get quotes; include_trades=False you get trades. Two separate calls needed if you want both back as native types from one TCBBO file.

Performance: DBN-on-disk vs catalog-on-disk

Doc verbatim:

Two options for backtesting with DBN data:

  1. Store data as DBN (.dbn.zst) files and decode to Nautilus objects every run.
  2. Convert DBN files to Nautilus objects once and write to the data catalog (Nautilus Parquet format). The DBN decoder is optimized Rust, but writing to the catalog once gives the best backtest performance. DataFusion streams Nautilus Parquet data from disk at high throughput, at least an order of magnitude faster than decoding DBN per run.

For Cortana MK3 spike: do option 2. Convert once on Saturday, backtest forever from ~/cortana-data/catalog/.

Live trading config (for the future, not the spike)

The TradingNodeConfig shape:

from nautilus_trader.adapters.databento import DATABENTO
from nautilus_trader.adapters.databento.factories import DatabentoLiveDataClientFactory
from nautilus_trader.live.node import TradingNode
from nautilus_trader.config import InstrumentProviderConfig
 
config = TradingNodeConfig(
    ...,
    data_clients={
        DATABENTO: {
            "api_key": None,  # falls back to DATABENTO_API_KEY env var
            "http_gateway": None,
            "live_gateway": None,
            "instrument_provider": InstrumentProviderConfig(load_all=True),
            "instrument_ids": None,
            "parent_symbols": None,
        },
    },
    ...,
)
 
node = TradingNode(config=config)
node.add_data_client_factory(DATABENTO, DatabentoLiveDataClientFactory)
node.build()

Configuration parameters (verbatim)

OptionDefaultDescription
api_keyNoneDatabento API secret. Falls back to DATABENTO_API_KEY env var when None.
http_gatewayNoneHistorical HTTP gateway override (testing).
live_gatewayNoneRaw TCP real-time gateway override (testing).
use_exchange_as_venueTrueUse exchange MIC for Nautilus venue. False retains default GLBX mapping.
timeout_initial_load15.0Seconds to wait for instrument definitions per dataset before proceeding.
mbo_subscriptions_delay3.0Seconds to buffer before enabling MBO/L3 streams so initial snapshots replay in order.
bars_timestamp_on_closeTrueTimestamp bars on close (ts_event/ts_init). False = open.
reconnect_timeout_mins10Minutes to attempt reconnection before giving up. None = retry indefinitely.
venue_dataset_mapNoneOptional Nautilus venue → Databento dataset code mapping.
parent_symbolsNone{dataset: {parent symbols}} to preload definition trees, e.g. {"GLBX.MDP3": {"ES.FUT", "ES.OPT"}}.
instrument_idsNoneNautilus InstrumentId values to preload definitions for at startup.

Live-client architecture quirk

Per dataset, DatabentoDataClient opens two DatabentoLiveClient instances:

  1. One for MBO (order book deltas) real-time feeds.
  2. One for all other real-time feeds.

This is why MBO must be subscribed at startup - the first client needs the full subscribe-set before opening the stream so it can replay session-start snapshots in order.

Connection stability (live operation)

Two reconnect modes:

  • With timeout (default 10 min): exponential backoff capped at 60s. Pattern: 1s, 2s, 4s, 8s, 16s, 32s, 60s, 60s… (with up-to-1s jitter). Survives transient network issues + scheduled gateway restarts. Stops retrying overnight.
  • Without timeout (reconnect_timeout_mins=None): exponential backoff capped at 10 min. Pattern: 1s, 2s, 4s, …, 256s, 512s, 600s, 600s… (with jitter). For unattended systems. Can mask config / auth issues.

Every reconnection auto-resubscribes to all active topics. Successful sessions >60s reset the timeout clock.

Sunday maintenance schedule

DatasetMaintenance Time (UTC)
CME Globex09
All ICE venues09
All other datasets (incl. OPRA)10

Default 10-min timeout covers a typical restart. Cortana operates weekdays only - Sunday maintenance is irrelevant for live MK3.

Cost model and the $125 credits

What the doc states:

  • “Databento offers 125 USD in free data credits (historical only) for new sign-ups. With careful requests, this covers testing and evaluation.”
  • “Check the /metadata.get_cost endpoint before requesting data.” This is the documented cost-estimation method.

What the doc does NOT cover (escalate to the databento Python SDK docs / databento.com pricing page on Saturday):

  • Exact /symbol-day cost for OPRA TRADES + MBP-1.
  • Whether 0DTE wildcards (e.g., SPY 250509* for all SPY options expiring Friday) bill per-symbol-resolved or per-MB-of-DBN.
  • Live-feed billing (the databento-vs-uw-vs-ibkr-data-feeds.md page notes usage-based live is being deprecated 2025-03-31; live needs Standard or Plus).

Cost estimation pattern (Saturday morning)

# CLI pre-flight (assumes `pip install databento` and DATABENTO_API_KEY env)
databento metadata.get-cost \
    --dataset OPRA.PILLAR \
    --symbols "SPY 250509C00580000,SPY 250509P00580000" \
    --schema trades \
    --start 2026-05-06T13:30:00 \
    --end 2026-05-06T20:00:00

Or via Python SDK:

import databento as db
client = db.Historical()
cost = client.metadata.get_cost(
    dataset="OPRA.PILLAR",
    symbols=["SPY"],          # parent symbol - resolves all SPY options
    stype_in="parent",        # critical: tells Databento "this is a parent symbol, expand to children"
    schema="trades",
    start="2026-05-06T13:30:00",
    end="2026-05-06T20:00:00",
)
print(f"Estimated cost: ${cost:.2f}")

Abort threshold per spike plan: if estimated cost > $30, narrow the symbol filter (specific strikes, e.g., the 0DTE near-the-money chain only) or reduce the time window.

Cortana MK3 implications - Step 0.5 step-by-step playbook

This is the load-bearing section. Copy-pasteable Saturday-morning recipe.

Pre-spike setup (5 min, do Friday night)

# 1. Sign up at databento.com, claim $125 trial credits
# 2. Install the databento SDK (separate from the Nautilus adapter - useful for the CLI)
uv pip install databento
 
# 3. Set API key (visible at https://databento.com/portal/keys)
export DATABENTO_API_KEY="db-..."  # paste key
echo 'export DATABENTO_API_KEY="db-..."' >> ~/.zshrc
 
# 4. Set NAUTILUS_PATH
mkdir -p ~/cortana-data
export NAUTILUS_PATH=~/cortana-data
echo 'export NAUTILUS_PATH=~/cortana-data' >> ~/.zshrc
 
# 5. Smoke test
python -c "import databento; print(databento.__version__)"
databento datasets   # prints all available dataset codes - grep for OPRA

Step 0.5.1 - Verify OPRA dataset code (5 min)

databento datasets | grep -i opra
# Expected output includes OPRA.PILLAR (or current canonical OPRA dataset code)
 
databento metadata.list_publishers --dataset OPRA.PILLAR
# Lists participating exchanges and their MICs - confirms Nautilus venue mapping

If OPRA.PILLAR is not the canonical name, use whatever Databento prints and substitute in the next steps.

Step 0.5.2 - Pre-flight cost estimate (5 min)

Pick a date already covered by decisions.db (e.g., 2026-05-06):

# /tmp/databento_cost_check.py
import databento as db
 
DATE = "2026-05-06"
client = db.Historical()
 
# Estimate cost for SPY parent (all SPY options, that day, Trades + MBP-1)
for schema in ("trades", "mbp-1"):
    cost = client.metadata.get_cost(
        dataset="OPRA.PILLAR",
        symbols=["SPY"],
        stype_in="parent",
        schema=schema,
        start=f"{DATE}T13:30:00",   # 8:30 CT = 13:30 UTC
        end=f"{DATE}T20:00:00",     # 3:00 PM CT = 20:00 UTC
    )
    print(f"{schema:10s}: ${cost:.2f}")
 
# Also estimate DEFINITION (small, mandatory)
cost_def = client.metadata.get_cost(
    dataset="OPRA.PILLAR",
    symbols=["SPY"],
    stype_in="parent",
    schema="definition",
    start=f"{DATE}T00:00:00",
    end=f"{DATE}T23:59:59",
)
print(f"definition: ${cost_def:.2f}")

Run it. If trades + mbp-1 + definition < 30, narrow to specific 0DTE near-the-money strikes.

Step 0.5.3 - Pull SPY OPRA Trades + MBP-1 + DEFINITION (15 min)

# /tmp/databento_pull.py
import databento as db
from pathlib import Path
 
DATE = "2026-05-06"
OUT = Path("~/cortana-data/raw/databento").expanduser()
OUT.mkdir(parents=True, exist_ok=True)
 
client = db.Historical()
 
# 1. DEFINITION schema (small, fast, MUST be first)
client.timeseries.get_range(
    dataset="OPRA.PILLAR",
    symbols=["SPY"],
    stype_in="parent",
    schema="definition",
    start=f"{DATE}T00:00:00",
    end=f"{DATE}T23:59:59",
    path=str(OUT / f"spy-opra-definition-{DATE}.dbn.zst"),
)
print("Definition pulled.")
 
# 2. TRADES (every option print)
client.timeseries.get_range(
    dataset="OPRA.PILLAR",
    symbols=["SPY"],
    stype_in="parent",
    schema="trades",
    start=f"{DATE}T13:30:00",
    end=f"{DATE}T20:00:00",
    path=str(OUT / f"spy-opra-trades-{DATE}.dbn.zst"),
)
print("Trades pulled.")
 
# 3. MBP-1 (top-of-book per option contract)
client.timeseries.get_range(
    dataset="OPRA.PILLAR",
    symbols=["SPY"],
    stype_in="parent",
    schema="mbp-1",
    start=f"{DATE}T13:30:00",
    end=f"{DATE}T20:00:00",
    path=str(OUT / f"spy-opra-mbp1-{DATE}.dbn.zst"),
)
print("MBP-1 pulled.")

Equivalent CLI form (one schema at a time):

databento timeseries.get-range \
    --dataset OPRA.PILLAR \
    --symbols SPY \
    --stype-in parent \
    --schema trades \
    --start 2026-05-06T13:30:00 \
    --end 2026-05-06T20:00:00 \
    --output ~/cortana-data/raw/databento/spy-opra-trades-2026-05-06.dbn.zst

Step 0.5.4 - Ingest into ParquetDataCatalog (10 min)

# /tmp/databento_ingest.py
from pathlib import Path
from nautilus_trader.adapters.databento import DatabentoDataLoader
from nautilus_trader.model.identifiers import InstrumentId
from nautilus_trader.persistence.catalog import ParquetDataCatalog
 
DATE = "2026-05-06"
RAW = Path("~/cortana-data/raw/databento").expanduser()
 
catalog = ParquetDataCatalog.from_env()  # NAUTILUS_PATH
loader = DatabentoDataLoader()
 
# 1. DEFINITIONS first
instruments = loader.from_dbn_file(
    path=str(RAW / f"spy-opra-definition-{DATE}.dbn.zst"),
    as_legacy_cython=False,
)
catalog.write_data(instruments)
print(f"Wrote {len(instruments)} instrument definitions.")
 
# Verify
loaded = catalog.instruments()
print(f"Catalog now contains {len(loaded)} instruments. Sample:")
for inst in list(loaded)[:5]:
    print(f"  {inst.id}")
 
# 2. TRADES (no instrument_id arg - let symbology resolve per-print)
trades = loader.from_dbn_file(
    path=str(RAW / f"spy-opra-trades-{DATE}.dbn.zst"),
    as_legacy_cython=False,
)
catalog.write_data(trades)
print(f"Wrote {len(trades):,} trade ticks.")
 
# 3. MBP-1 (gets quotes + maybe trades)
quotes = loader.from_dbn_file(
    path=str(RAW / f"spy-opra-mbp1-{DATE}.dbn.zst"),
    include_trades=True,  # also emit TradeTicks where present
    as_legacy_cython=False,
)
catalog.write_data(quotes)
print(f"Wrote {len(quotes):,} quote/trade ticks from MBP-1.")

Step 0.5.5 - Query catalog and cross-check decisions.db (10 min)

# /tmp/databento_crosscheck.py
import sqlite3, pandas as pd
from pathlib import Path
from nautilus_trader.persistence.catalog import ParquetDataCatalog
from nautilus_trader.model import TradeTick
from nautilus_trader.model.identifiers import InstrumentId
 
DATE = "2026-05-06"
catalog = ParquetDataCatalog.from_env()
 
# 1. Pick one decisions.db row near a known-good entry timestamp
DB_PATH = Path("~/conductor/workspaces/cortanaroi-mk2/belo-horizonte/data/decisions.db").expanduser()
conn = sqlite3.connect(DB_PATH)
df = pd.read_sql(
    "SELECT ts_event, signal_id, spy_price_at_score, composite_score, bias "
    "FROM scoring_events "
    f"WHERE date(ts_event/1e9, 'unixepoch') = '{DATE}' "
    "AND composite_score >= 65 "
    "ORDER BY ts_event LIMIT 1",
    conn,
)
print(df)
ts_ns = int(df.iloc[0]["ts_event"])
spy_price = float(df.iloc[0]["spy_price_at_score"])
print(f"Cortana saw SPY @ ${spy_price:.2f} at ts_event={ts_ns}")
 
# 2. Pull a 200ms window of SPY-OPRA trades around that timestamp
# First find the SPY underlying in the Databento equity catalog if loaded;
# else use the nearest ATM option print as a proxy
results = catalog.query(
    data_cls=TradeTick,
    start=ts_ns - 100_000_000,  # ts - 100ms
    end=ts_ns + 100_000_000,    # ts + 100ms
)
print(f"Catalog returned {len(results)} ticks in 200ms window")
for tick in results[:10]:
    print(f"  {tick.instrument_id}  {tick.price}  size={tick.size}  ts_event={tick.ts_event}")
 
# 3. (Optional) load SPY underlying from a separate Databento equity pull (DBEQ.BASIC schema)
#    so cross-check is apples-to-apples. For Step 0.5 spike, the OPRA print is enough -
#    we're validating the *plumbing*, not building the prod pipeline.

Pass criterion: the catalog returns ticks; one of them within nanoseconds-to-milliseconds of ts_ns; the cross-instrument MIC strings (SPY 250509C00580000.OPRA or similar) decode cleanly. Fail criterion: schema confusion, empty result, instrument_id mismatch, or out-of-order ticks.

Step 0.5.6 - Stub a no-op backtest streaming the catalog (10 min)

# /tmp/databento_backtest_stub.py
from nautilus_trader.backtest.node import BacktestNode
from nautilus_trader.config import (
    BacktestRunConfig, BacktestDataConfig, BacktestVenueConfig,
    BacktestEngineConfig, ImportableStrategyConfig,
)
from nautilus_trader.model import TradeTick
from nautilus_trader.model.identifiers import InstrumentId
 
DATE = "2026-05-06"
 
data_config = BacktestDataConfig(
    catalog_path="/Users/codysmith/cortana-data/catalog",
    data_cls=TradeTick,
    instrument_id=None,   # all SPY-OPRA contracts in catalog
    start_time=f"{DATE}T13:30:00Z",
    end_time=f"{DATE}T20:00:00Z",
)
 
venue_config = BacktestVenueConfig(
    name="OPRA",
    oms_type="NETTING",
    account_type="CASH",
    starting_balances=["100_000 USD"],
    book_type="L1_MBP",
)
 
# A no-op strategy (counts ticks, then prints)
class TickCounterStrategy:
    pass
 
run_config = BacktestRunConfig(
    engine=BacktestEngineConfig(),
    data=[data_config],
    venues=[venue_config],
    strategies=[],  # empty for stub - just prove the catalog streams
)
 
node = BacktestNode(configs=[run_config])
results = node.run()
print(results)

Pass: BacktestNode.run() completes without error and prints non- zero ticks streamed. Fail: ImportError, schema validation, or mid-run crash.

Step 0.5.7 - Acceptance log entry (5 min)

If all six substeps pass in <60 min and credit burn was <$30, log to the spike plan / brain timeline:

2026-05-09 | Cody - Step 0.5 PASS. SPY OPRA Trades + MBP-1 for
2026-05-06 ingested in N min, $X spent of $125 credits.
Catalog at ~/cortana-data/catalog/. Cross-check against
decisions.db row #N: ts alignment confirmed within Yms.
Backtest stub streamed Z ticks. Databento path validated for
MK3 historical replay.

Step 0.5 fail-mode triage

FailureLikely causeMitigation
Cost estimate >$30Pulling whole SPY parent is expensiveNarrow to specific 0DTE strikes; use stype_in="raw_symbol" with explicit OSI symbols.
databento datasets doesn’t list OPRAFree credits don’t cover OPRASwitch to a smaller/cheaper dataset (DBEQ.BASIC for SPY equity) and validate the plumbing. Defer OPRA to paid tier.
from_dbn_file raises on definitionDBN version mismatch / file corruptionRe-pull; check file size > 0; check zstd integrity.
catalog.instruments() empty after writeDEFINITION write failed silentlyRe-run with as_legacy_cython=False; check write_data return value; check ~/cortana-data/catalog/data/ filesystem.
Cross-check tick-window emptyMIC mismatch - Cortana’s SPY.ARCA vs Databento’s per-exchange MICRun metadata.list_publishers and use the MIC the adapter actually wrote.
Out-of-order events on replayNanosecond-tie ordering issueDocument as Carryover 7-extension; pre-sort (ts_event, raw_id) at DataLoader stage.
Backtest stub hangsStrategies-list empty + no-op strat config issuePlug in a minimal Strategy subclass that counts ticks; or use BacktestEngine directly.

Cross-page Cortana question - answered

“Which Databento schema(s) does Cortana actually need for replay?” (Open thread on databento-vs-uw-vs-ibkr-data-feeds.md.)

Answer per this doc: TRADES + MBP-1 + DEFINITION is the minimum viable triple for the spike. TRADES gives every print. MBP-1 gives top-of-book quotes (and optionally trades via include_trades=True). DEFINITION is mandatory before any market data load. Skip MBO/MBP-10/IMBALANCE/STATISTICS until post-spike when adversarial-fidelity backtest needs them.

Open questions for Step 0.5 (not resolved by this doc)

  1. Exact OPRA dataset code. The doc names GLBX.MDP3 for CME but does not name OPRA’s. Resolve via databento datasets CLI on Saturday.
  2. OPRA MIC after use_exchange_as_venue=True. The doc establishes the per-exchange-MIC pattern via the GLBX example but does not enumerate OPRA’s behavior. Resolve via metadata.list_publishers --dataset <opra> and observe what instrument_id strings end up in the catalog.
  3. 0DTE wildcard syntax. Doc shows parent_symbols={"GLBX.MDP3": {"ES.FUT", "ES.OPT"}} for futures parents but doesn’t show “SPY 0DTE only” filtering for options. Likely path: pull SPY parent, filter post-load by expiry == DATE. Or use raw OSI symbols (SPY 250509C00580000 etc.) with stype_in="raw_symbol".
  4. Per-MB vs per-symbol-day cost shape. Doc says “Check metadata.get_cost” but does not document the cost formula. Treat get_cost as authoritative; do not try to predict from first principles.
  5. 404 how-to URLs. The two nautilus-how-to.md Databento recipes were 404 on 2026-05-06. The integration page at nautilustrader.io/docs/latest/integrations/databento/ is the actual canonical reference (this page mirrors it). The how-tos should be considered pointers; this page is the source of truth.
  6. Live OPRA pricing tier. Doc covers reconnection mechanics but not the pricing-tier requirement for live OPRA. Per databento-vs-uw-vs-ibkr-data-feeds.md: live needs Standard (1,399/mo). Defer.
  7. OPRA.PILLAR vs separate per-exchange OPRA dataset codes. Resolve via databento datasets.
  8. Tie-breaking at exact-nanosecond ts_event collisions. Per nautilus-backtesting.md Carryover #7, not unique to Databento - but the OPRA tape is high-volume and genuinely produces collisions. Pre-sort (ts_event, sequence_id, raw_id) at DataLoader stage if observed.

Anti-patterns to avoid

  • Loading market data before DEFINITION. Catalog will reject or produce wrong-precision data. Always step 1 = DEFINITION, step 2 = market data.
  • Subscribing MBO after on_start. “Subscriptions after start are logged as errors and ignored.” MBO must be in on_start().
  • Subscribing TBBO/TCBBO and a separate trades feed for the same instrument. Doubles cost, creates duplicates. Pick one path.
  • Re-decoding DBN per backtest run. Pay the decode cost once, then query Parquet. Order-of-magnitude perf difference per the doc.
  • Using as_legacy_cython=True for IMBALANCE/STATISTICS. Raises ValueError. They are PyO3-only types.
  • Skipping metadata.get_cost before a big pull. $125 burns fast with naive whole-day MBO pulls.
  • Hardcoding venue MICs. Use the venue MIC the adapter actually writes; don’t assume SPY.ARCA or SPY.OPRA matches what’s in the catalog. Confirm with catalog.instruments().
  • Treating decisions.db millisecond timestamps as nanoseconds. Multiply by 1e6 before comparing against Nautilus ts_event.
  • Skipping the dual-live-client architecture caveat. Live MBO + other feeds open two TCP clients per dataset; firewall/ratelimit- budget for two, not one.
  • Hardcoding OPRA.PILLAR before verifying with databento datasets. The actual code may differ; the spike has time to check.

When this concept applies

  • Pulling historical OPRA / equity tape into a Nautilus ParquetDataCatalog for backtest replay (Cortana MK3 spike Step 0.5 and beyond).
  • Building backtest fidelity tooling (MBO L3 reconstruction, MBP-10 L2 fills, NBBO replay).
  • Cross-validating UW alerts against raw OPRA prints (post-spike adversarial replay framework).
  • Evaluating a future move to Databento as a live feed (post-MK3, post-Standard-tier upgrade).

When it does not apply

  • Cortana’s primary signal layer (UW flow alerts) - UW remains the signal vendor; Databento is replay/audit only.
  • Pricing-of-record - IBKR is the only vendor whose price string lands on an order (feedback_ibkr_pricing_source.md).
  • Live trading on $125 free credits - historical only.

See Also


Timeline

2026-05-07 | Cody - Filed during pre-spike concept mastery sweep batch 4.