Nautilus Dev Test Datasets
Nautilus’s developer guide formalizes how external datasets get curated, stored, and consumed as test fixtures. Three categories: small data (<1 MB) checked directly into
tests/test_data/<source>/with ametadata.jsonsidecar; large data (>1 MB) hosted in the R2 bucket with a SHA-256 checksum tracked intests/test_data/large/checksums.jsonand downloaded on first use viaensure_test_data_exists(); and user-fetched data for vendor-restricted feeds where the repo only stores a manifest + fetch instructions + transform code and each user downloads with their own credentials. Storage format is Nautilus Parquet (ZSTD level 3, 1M row groups) - never raw vendor formats for new datasets. Naming:<source>_<instrument>_<date>_<datatype>.parquet. Curation has two paths:scripts/curate-dataset.shfor simple single- file downloads, or a Rust curation function undercrates/testkit/src/<source>/gated by#[cfg(test)]/#[ignore]for binary-format pipelines (ITCH, Tardis CSV.gz). Test runner uses alarge-data-testsnextest group withmax-threads = 1so binaries sharing target paths don’t race on download. For Cortana MK3 this page is the canonical layout reference fortests/data/uw_flow_alerts/,tests/data/decisions_db/,tests/data/databento_opra/, and the “do not commit raw OPRA bytes - synthesize fixtures” rule.
This page complements nautilus-dev-testing.md (testing mechanism
ladder + spec-acceptance pattern; parallel agent), nautilus-databento.md
(the OPRA pull pipeline that produces the raw DBN files this page
explains how to fixture), nautilus-data.md (where the loaded fixtures
flow at runtime - DataEngine → Cache → MessageBus), and
nautilus-custom-data.md (the @customdataclass types whose fixtures
get parquet-encoded under this layout).
Core claim
Test datasets in Nautilus are versioned, hashed, license-tagged
artifacts - not ad-hoc files dropped into a fixtures folder. Every
curated dataset must answer four questions in metadata.json: what is
it, where did it come from, can it be redistributed, and is its
integrity verifiable. The directory layout under tests/test_data/
encodes the answer to “can it be redistributed” (small → checked in;
large → R2 mirror; user-fetched → manifest only). Cortana MK3 should
mirror this exact layout for its own fixtures so that pytest-driven
backtests, replay tests, and adapter-spec acceptance tests have a single
discoverable convention.
Where datasets live (verbatim layout)
tests/test_data/
├── <source>/ # checked-in small data (<1 MB)
│ ├── <instrument>_<date>_<datatype>.parquet
│ ├── metadata.json
│ └── LICENSE.txt
├── large/ # R2-hosted large data (>1 MB)
│ ├── checksums.json # SHA-256 per file
│ └── <files downloaded on first use>
├── <source>/<slug>/ # user-fetched (vendor-restricted)
│ ├── metadata.json # provenance, license
│ ├── manifest.json # fetch + transform instructions
│ └── README.md # human-readable operator notes
├── local/ # GITIGNORED local cache
│ └── <source>/<slug>/
│ ├── vendor/ # raw vendor downloads (kept here)
│ └── <generated parquet>
└── tardis/, itch/, ... # legacy (CSV/CSV.gz, predates policy)
The tests/test_data/local/ directory is gitignored. Tests skip
cleanly when local data is absent - they never fail CI for missing
user-fetched fixtures.
Required metadata (the contract)
Every curated dataset that ships an artifact must include
metadata.json with these fields:
| Field | Required | Description |
|---|---|---|
file | yes¹ | Filename of the dataset. |
sha256 | yes¹ | SHA-256 hash of the file. |
size_bytes | yes¹ | File size in bytes. |
original_url | yes | Download URL of original source data. |
licence | yes | License terms and any redistribution constraints. |
added_at | yes | ISO 8601 timestamp when dataset was curated. |
instrument | recommended | Instrument symbol(s) covered. |
date | recommended | Trading date(s) covered. |
format | recommended | e.g. "Nautilus OrderBookDelta Parquet". |
original_file | recommended | Vendor filename pre-transform. |
parser | recommended | Parser + version (e.g. "itchy 0.3.4"). |
¹ Optional for user-fetched datasets without a single committed
artifact - target_files in manifest.json becomes authoritative
instead.
User-fetched datasets add:
| Field | Required | Description |
|---|---|---|
distribution | yes | Must be "user-fetch". |
fetch_method | yes | API, web portal, CLI, etc. |
fetch_reference | yes | URL or doc ref for download flow. |
auth | recommended | Required credentials/entitlements. |
transform_version | yes | Version of local transform pipeline. |
redistribution | yes | Short note on redistribution limits. |
public_mirror | yes | Must be false for restricted datasets. |
The fields match what scripts/curate-dataset.sh emits - use the
script when possible so metadata stays consistent.
Storage format
New datasets are stored as Nautilus Parquet - not raw vendor formats. Three reasons cited verbatim:
- Consistent data types across all test datasets.
- No vendor-format parsing at test time.
- Clear derivative-work status for licensing.
Compression: ZSTD level 3. Row groups: 1M rows. Both choices are conventions, not enforced - but new code matches them.
User-fetched datasets also end up as Nautilus Parquet after the
local transform step. Raw vendor files stay outside the repo and
outside the public R2 bucket - they live in
tests/test_data/local/<source>/<slug>/vendor/.
Naming convention
<source>_<instrument>_<date>_<datatype>.parquet
Verbatim examples:
itch_AAPL_2019-01-30_deltas.parquettardis_BTCUSDT_2020-09-01_depth10.parquet
Fields are kebab-cased within their slot; instrument keeps the venue
symbol shape (BTC-PERPETUAL, BTCUSDT).
How to add a new dataset
Path 1 - Simple single-download files
Use scripts/curate-dataset.sh:
scripts/curate-dataset.sh <slug> <filename> <download-url> <licence>This creates a versioned directory v1/<slug>/ containing the file,
LICENSE.txt, and metadata.json populated with the required fields.
Path 2 - Complex pipelines (parse + transform)
For datasets needing format conversion (e.g. binary ITCH → Parquet):
- Write a curation function under
crates/testkit/src/<source>/, gated behind#[cfg(test)]or marked#[ignore]. - The function should: download → parse → filter → convert to NautilusTrader types → write Parquet.
- Output
.parquet+metadata.jsonto a local directory. - Upload the Parquet to R2 manually.
- Add the SHA-256 to
tests/test_data/large/checksums.json.
Verbatim ITCH example (the canonical Path 2 fixture):
# Download source (~4.4 GB, keep a local copy)
wget -O ~/Downloads/01302019.NASDAQ_ITCH50.gz \
"https://emi.nasdaq.com/ITCH/Nasdaq%20ITCH/01302019.NASDAQ_ITCH50.gz"
# Curation test expects source at /tmp
ln -sf ~/Downloads/01302019.NASDAQ_ITCH50.gz /tmp/01302019.NASDAQ_ITCH50.gz
# Regenerate parquet
cargo test -p nautilus-testkit --lib test_curate_aapl_itch -- --ignored --nocapture
# Output: /tmp/itch_AAPL.XNAS_2019-01-30_deltas.parquetPath 3 - User-fetched (restricted redistribution)
For datasets the project cannot redistribute (paid vendor licenses, per-account entitlements, unclear derivative-work rights):
- Commit a
manifest.jsonandmetadata.json. Do not commit the real vendor data or derived Parquet output. - Provide a local fetch command that uses the user’s own vendor credentials.
- Convert the vendor data locally into Nautilus Parquet.
- Store outputs in a local cache path (
tests/test_data/local/...) ignored by git. - Make tests opt-in. They must skip cleanly when the dataset is missing.
Default distribution priority
Verbatim from the doc:
The default distribution order for new datasets is:
- Checked in small data.
- Public R2 large data.
- User-fetched data.
Choose user-fetched only when the first two options are not acceptable under the vendor’s terms.
Recommended manifest.json fields (user-fetched)
| Field | Description |
|---|---|
slug | Stable dataset identifier. |
vendor | Vendor or venue name. |
source_type | api, portal-download, purchased-archive, etc. |
source_filters | Symbols, event IDs, market IDs, date ranges, file names. |
target_files | Output Nautilus Parquet files expected after conversion. |
cache_dir | Local output location relative to tests/test_data/local/. |
fetch_command | Suggested command or script entry point. |
transform_command | Suggested local conversion command. |
env | Required environment variables. |
notes | Short operational notes. |
metadata.json is authoritative for provenance, licensing, and
redistribution rules. manifest.json is authoritative for fetch
inputs, commands, cache locations, and output files. Don’t mix the
two.
Test runner serialization
Tests that download large shared files race when nextest runs binaries
in parallel. The fix is in .config/nextest.toml:
[[profile.default.overrides]]
filter = 'binary(grid_mm_itch) | binary(orderbook_integration) | binary(your_new_binary)'
test-group = 'large-data-tests'
[test-groups]
large-data-tests = { max-threads = 1 }When adding a new test binary that downloads from R2, add it to the
large-data-tests group filter - otherwise concurrent processes
will race on the same download path.
Regenerating datasets after schema change
# 1. Re-run the curation test
cargo test -p nautilus-tardis test_curate_deribit_deltas -- --ignored --nocapture
# 2. Update the SHA-256
sha256sum /tmp/<output_file>.parquet
# 3. Update tests/test_data/large/checksums.json (busts CI cache)
# 4. Update the corresponding metadata.json (sha256, size_bytes)
# 5. Upload the new Parquet to R2
# 6. Commit checksums.json + metadata.jsonPytest skip pattern (user-fetched fixtures)
import pytest
if not filepath.exists():
pytest.skip(f"User-fetched test data not found: {filepath}")For Rust integration tests that need manual prep, prefer #[ignore]
when the test is not expected to run in default CI.
Licensing and redistribution rules - what NOT to do
Verbatim “do not” list:
- Do not upload restricted vendor datasets to the public R2 bucket.
- Do not commit real vendor-derived Parquet files when redistribution rights are unclear.
- Do not make default CI depend on vendor credentials or paid historical-data access.
Internal-only mirrors are allowed if the license permits internal sharing - but treat them as a separate operational path, not part of the public test-data standard.
Tutorial test data - NAUTILUS_DATA_DIR
Several Nautilus tutorials load user-provided market data. The
NAUTILUS_DATA_DIR env var overrides the base path. The test suite
sets it to tests/test_data/local/ so tutorials run against small
local samples. Examples:
tests/test_data/local/
├── Binance/
│ ├── BTCUSDT_T_DEPTH_2022-11-01_depth_snap.csv
│ └── BTCUSDT_T_DEPTH_2022-11-01_depth_update.csv
├── Bybit/
│ └── 2024-12-01_XRPUSDT_ob500.data.zip
└── HISTDATA/
└── DAT_ASCII_EURUSD_T_202001.csv.gz
The pattern Cortana mirrors: ship a small subset (e.g. first 10,000 rows of a depth file) for local smoke tests; full files live in gitignored cache.
Legacy datasets
Datasets that predate this policy and use raw CSV/CSV.gz formats
without metadata.json remain valid for existing tests. New
datasets do not get the legacy exemption - they go through the
Parquet + metadata pipeline.
Cortana MK3 implications - fixture layout
The Nautilus convention maps cleanly to Cortana’s four fixture classes. Mirror the directory shape exactly so any future Nautilus-side tooling (data-test-spec runner, catalog auditor) recognizes our fixtures.
Cortana fixture layout (proposed)
tests/data/ # Cortana MK3 fixtures root
├── uw_flow_alerts/ # checked-in small (<1 MB)
│ ├── 2026-05-06-cluster.parquet # 50-100 alerts, one trading day
│ ├── 2026-05-06-quiet.parquet # contrast fixture (<10 alerts)
│ ├── synthetic-edge-cases.parquet # hand-crafted boundary alerts
│ └── metadata.json
├── decisions_db/ # checked-in small
│ ├── 2026-05-06-row-replay.sqlite # ~15 trades, full schema
│ ├── 2026-04-16-chop-cluster.sqlite # known loss-day replay target
│ └── metadata.json
├── ibkr_chains/ # checked-in small
│ ├── 2026-05-06-spy-chain.json # one snapshot, ATM ±10 strikes
│ ├── 2026-05-06-spy-greeks.parquet # derived greeks
│ └── metadata.json
├── databento_opra/ # USER-FETCHED (vendor-restricted)
│ ├── 2026-05-06-spy/ # one slug per (date, instrument)
│ │ ├── manifest.json # databento CLI invocation
│ │ ├── metadata.json # license, redistribution=false
│ │ └── README.md # "run scripts/fetch_databento.py"
│ └── synthesized/ # checked-in tiny synthetic OPRA
│ ├── spy-near-the-money-100rows.parquet # for unit tests, hand-crafted
│ └── metadata.json
├── large/ # R2 mirror (Cortana-private bucket)
│ └── checksums.json
└── local/ # GITIGNORED - local cache
└── databento_opra/
├── vendor/ # raw .dbn.zst from Databento
└── 2026-05-06-spy/ # converted Nautilus Parquet
The first three fixtures (immediate)
-
tests/data/uw_flow_alerts/2026-05-06-cluster.parquetContains the day’sUWFlowAlert@customdataclassinstances parquet-encoded via the auto-generated schema. Small - one trading day’s flow alerts is a few hundred KB. Checked in. Used by theUWFlowAlertreplay test in the spike’s Step 6.Generation:
# scripts/build_fixtures/uw_flow_alerts.py from cortana.adapters.uw.replay import alerts_from_decisions_db from nautilus_trader.persistence.catalog import ParquetDataCatalog alerts = alerts_from_decisions_db("2026-05-06") ParquetDataCatalog("tests/data/uw_flow_alerts").write_data(alerts) -
tests/data/decisions_db/2026-05-06.sqliteA trimmed copy of the productiondecisions.dbcontaining exactly the day’sscoring_eventsrows (78-column schema). Checked in. Sub-MB. Used by the row-replay backtest test that streamsscoring_eventsinto theBacktestNodeas synthesizedScoreUpdateevents.Generation:
sqlite3 ~/conductor/.../data/decisions.db \ ".dump scoring_events" | \ sqlite3 tests/data/decisions_db/2026-05-06.sqlite -
tests/data/ibkr_chains/2026-05-06-spy-chain.jsonOne option-chain snapshot near a representative entry timestamp (e.g. the first composite-score≥65 row fromdecisions.db). Hand- reduced to ATM ±10 strikes. JSON for human readability. Used by strike-selection unit tests.Generation:
# scripts/build_fixtures/ibkr_chain.py from cortana.adapters.ibkr.snapshot import chain_at_ts chain_at_ts("SPY", "2026-05-06T13:30:15Z").write_json( "tests/data/ibkr_chains/2026-05-06-spy-chain.json" )
Should we commit raw OPRA bytes? - answer: NO
Reasoning per this doc + Cortana constraints:
- OPRA tape is large (one day SPY OPRA TRADES = hundreds of MB to
multi-GB). Fails the
<1 MB checked-inthreshold by orders of magnitude. - Databento license: per-account credit-based access. Treating raw
DBN as redistributable to anyone who clones the Cortana repo is a
license violation per the “user-fetched data” rules in this doc
(
distribution: "user-fetch",public_mirror: false). - The R2 large-data path is for datasets the project (Nautilus, in their case; Cortana, in ours) has redistribution rights to. We do not have those rights for OPRA.
The right pattern (verbatim from this doc, applied to Cortana):
- Commit
tests/data/databento_opra/2026-05-06-spy/manifest.jsonwith thedatabento timeseries.get-rangecommand and the expected output file paths. - Commit
metadata.jsonwithdistribution: "user-fetch",public_mirror: false, the license note, and the original URL pattern. - Commit
README.mdinstructing operators to runscripts/fetch_databento.pyafter exportingDATABENTO_API_KEY. - The script writes raw
.dbn.zsttotests/data/local/databento_opra/2026-05-06-spy/vendor/and the converted Parquet totests/data/local/databento_opra/2026-05-06-spy/. - Pytest fixtures load from the local cache and skip when absent - never fail CI on missing OPRA.
- Additionally commit a tiny synthesized OPRA-shape fixture
under
tests/data/databento_opra/synthesized/(~100 rows, hand- crafted) so unit tests covering the OPRA loader path run in CI without any real OPRA data.
How to load these in pytest fixtures
# tests/conftest.py
import pytest
from pathlib import Path
from nautilus_trader.persistence.catalog import ParquetDataCatalog
FIXTURES = Path(__file__).parent / "data"
@pytest.fixture
def uw_flow_alerts_2026_05_06():
path = FIXTURES / "uw_flow_alerts" / "2026-05-06-cluster.parquet"
if not path.exists():
pytest.skip(f"Fixture missing: {path}")
catalog = ParquetDataCatalog(str(FIXTURES / "uw_flow_alerts"))
return list(catalog.query(data_cls=UWFlowAlert))
@pytest.fixture
def decisions_db_2026_05_06():
path = FIXTURES / "decisions_db" / "2026-05-06.sqlite"
if not path.exists():
pytest.skip(f"Fixture missing: {path}")
import sqlite3
return sqlite3.connect(path)
@pytest.fixture
def databento_opra_local_2026_05_06_spy():
path = FIXTURES / "local" / "databento_opra" / "2026-05-06-spy"
if not path.exists():
pytest.skip(
"Local Databento OPRA cache absent - run "
"scripts/fetch_databento.py with DATABENTO_API_KEY set"
)
return ParquetDataCatalog(str(path))The skip-when-missing pattern is mandatory for any user-fetched fixture - it preserves the “default CI does not need vendor credentials” rule from this doc.
Naming for Cortana fixtures
Apply the Nautilus convention <source>_<instrument>_<date>_<datatype>.parquet
to Cortana fixtures:
uw_SPY_2026-05-06_flow-alerts.parquetdatabento_SPY_2026-05-06_opra-trades.parquetibkr_SPY_2026-05-06_chain-snapshot.jsoncortana_SPY_2026-05-06_decisions.sqlite(treatcortanaas the source)
Adopt this naming during the fixture-build step of the spike - easier than renaming later.
When this concept applies
- Every Cortana MK3 test fixture committed after 2026-05-09.
- Designing the Saturday spike’s Step 6 row-replay fixture set.
- Migrating existing
tests/fixtures/*.jsonchain snapshots from the pre-Nautilus structure to the new layout. - Onboarding any future contributor - they look here for “where do fixtures live, what’s checked in, what’s user-fetched.”
When it does not apply
- Production runtime data (
decisions.db, broker state, brain entries) - those follow Cortana’s three-tier state model (project_data_loss_april22.md), not this test-fixtures policy. - One-off ad-hoc spike scratch files (
/tmp/databento_*.py) - those are throwaway, not fixtures. - Memory snapshots (
~/.claude/...) - they live outside repo entirely.
Anti-patterns to avoid
- Committing raw OPRA
.dbn.zstto the repo or to a public bucket. License violation; size violation; verbatim “do not” item. - Skipping
metadata.json. Every artifact gets one. No exceptions for “obvious” datasets. - Mixing fetch instructions into
metadata.json. Fetch goes inmanifest.json. License/provenance goes inmetadata.json. - Hard-failing pytest on missing user-fetched fixtures. Use
pytest.skip(...). CI must pass without vendor credentials. - Committing raw vendor formats (CSV/CSV.gz) for new datasets. New datasets are Nautilus Parquet only; CSV is legacy-only.
- Forgetting to update
checksums.jsonafter regenerating a large dataset. CI cache won’t bust; downstream tests get stale data. - Adding a new R2-downloading test binary without joining the
large-data-testsnextest group. Concurrent downloads race. - Using
tests/test_data/local/for anything you’d want CI to see. It’s gitignored. CI never sees it. By design.
Open questions for Cortana fixturing (deferred)
- Does Cortana need its own private R2 (or S3) bucket for
tests/data/large/? Probably yes oncedecisions.dbhistory exceeds a few MB, but defer until first concrete fixture exceeds 1 MB. - Do we mirror Nautilus’s
scripts/curate-dataset.shor write a Cortana-specificscripts/curate_fixture.py? Probably the latter - we have additional steps (UW alert dedup, decisions-db trimming). - Where does the synthesized OPRA fixture come from? Likely
hand-crafted via a one-off
scripts/build_fixtures/synth_opra.pythat emits ~100 trades with realistic timestamps and prices - committed alongside the fixture.
See Also
- Nautilus Dev Testing - testing mechanism ladder and spec-acceptance pattern (parallel agent - verify slug).
- Nautilus Databento - the OPRA pull pipeline whose output this page rules on (commit vs cache vs synthesize).
- Nautilus Data Model - DataEngine / Cache /
MessageBus /
ParquetDataCatalog; the runtime side of the fixtures this page describes. - Nautilus Custom Data - the
@customdataclasstypes (UWFlowAlert,ScoreUpdate,MetaProb) whose instances become parquet fixtures under this layout. - Nautilus Developer Guide - the contributor contract this page extends with the test-data subpolicy.
- Cortana data-loss postmortem - three-tier state model (production data is NOT test fixtures).
- Spike plan:
~/conductor/workspaces/cortanaroi-mk2/belo-horizonte/plans/2026-05-09-nautilus-spike.mdStep 0.5 (Databento $125 credits) + Step 6 (run against today’s session data). - Source: https://nautilustrader.io/docs/latest/developer_guide/test_datasets/
Timeline
2026-05-07 | Cody - Filed during pre-spike concept mastery sweep batch 7 (developer guide).