Nautilus Dev Test Datasets

Nautilus’s developer guide formalizes how external datasets get curated, stored, and consumed as test fixtures. Three categories: small data (<1 MB) checked directly into tests/test_data/<source>/ with a metadata.json sidecar; large data (>1 MB) hosted in the R2 bucket with a SHA-256 checksum tracked in tests/test_data/large/checksums.json and downloaded on first use via ensure_test_data_exists(); and user-fetched data for vendor-restricted feeds where the repo only stores a manifest + fetch instructions + transform code and each user downloads with their own credentials. Storage format is Nautilus Parquet (ZSTD level 3, 1M row groups) - never raw vendor formats for new datasets. Naming: <source>_<instrument>_<date>_<datatype>.parquet. Curation has two paths: scripts/curate-dataset.sh for simple single- file downloads, or a Rust curation function under crates/testkit/src/<source>/ gated by #[cfg(test)] / #[ignore] for binary-format pipelines (ITCH, Tardis CSV.gz). Test runner uses a large-data-tests nextest group with max-threads = 1 so binaries sharing target paths don’t race on download. For Cortana MK3 this page is the canonical layout reference for tests/data/uw_flow_alerts/, tests/data/decisions_db/, tests/data/databento_opra/, and the “do not commit raw OPRA bytes - synthesize fixtures” rule.

This page complements nautilus-dev-testing.md (testing mechanism ladder + spec-acceptance pattern; parallel agent), nautilus-databento.md (the OPRA pull pipeline that produces the raw DBN files this page explains how to fixture), nautilus-data.md (where the loaded fixtures flow at runtime - DataEngine → Cache → MessageBus), and nautilus-custom-data.md (the @customdataclass types whose fixtures get parquet-encoded under this layout).

Core claim

Test datasets in Nautilus are versioned, hashed, license-tagged artifacts - not ad-hoc files dropped into a fixtures folder. Every curated dataset must answer four questions in metadata.json: what is it, where did it come from, can it be redistributed, and is its integrity verifiable. The directory layout under tests/test_data/ encodes the answer to “can it be redistributed” (small → checked in; large → R2 mirror; user-fetched → manifest only). Cortana MK3 should mirror this exact layout for its own fixtures so that pytest-driven backtests, replay tests, and adapter-spec acceptance tests have a single discoverable convention.

Where datasets live (verbatim layout)

tests/test_data/
├── <source>/                          # checked-in small data (<1 MB)
│   ├── <instrument>_<date>_<datatype>.parquet
│   ├── metadata.json
│   └── LICENSE.txt
├── large/                             # R2-hosted large data (>1 MB)
│   ├── checksums.json                 # SHA-256 per file
│   └── <files downloaded on first use>
├── <source>/<slug>/                   # user-fetched (vendor-restricted)
│   ├── metadata.json                  # provenance, license
│   ├── manifest.json                  # fetch + transform instructions
│   └── README.md                      # human-readable operator notes
├── local/                             # GITIGNORED local cache
│   └── <source>/<slug>/
│       ├── vendor/                    # raw vendor downloads (kept here)
│       └── <generated parquet>
└── tardis/, itch/, ...                # legacy (CSV/CSV.gz, predates policy)

The tests/test_data/local/ directory is gitignored. Tests skip cleanly when local data is absent - they never fail CI for missing user-fetched fixtures.

Required metadata (the contract)

Every curated dataset that ships an artifact must include metadata.json with these fields:

FieldRequiredDescription
fileyes¹Filename of the dataset.
sha256yes¹SHA-256 hash of the file.
size_bytesyes¹File size in bytes.
original_urlyesDownload URL of original source data.
licenceyesLicense terms and any redistribution constraints.
added_atyesISO 8601 timestamp when dataset was curated.
instrumentrecommendedInstrument symbol(s) covered.
daterecommendedTrading date(s) covered.
formatrecommendede.g. "Nautilus OrderBookDelta Parquet".
original_filerecommendedVendor filename pre-transform.
parserrecommendedParser + version (e.g. "itchy 0.3.4").

¹ Optional for user-fetched datasets without a single committed artifact - target_files in manifest.json becomes authoritative instead.

User-fetched datasets add:

FieldRequiredDescription
distributionyesMust be "user-fetch".
fetch_methodyesAPI, web portal, CLI, etc.
fetch_referenceyesURL or doc ref for download flow.
authrecommendedRequired credentials/entitlements.
transform_versionyesVersion of local transform pipeline.
redistributionyesShort note on redistribution limits.
public_mirroryesMust be false for restricted datasets.

The fields match what scripts/curate-dataset.sh emits - use the script when possible so metadata stays consistent.

Storage format

New datasets are stored as Nautilus Parquet - not raw vendor formats. Three reasons cited verbatim:

  1. Consistent data types across all test datasets.
  2. No vendor-format parsing at test time.
  3. Clear derivative-work status for licensing.

Compression: ZSTD level 3. Row groups: 1M rows. Both choices are conventions, not enforced - but new code matches them.

User-fetched datasets also end up as Nautilus Parquet after the local transform step. Raw vendor files stay outside the repo and outside the public R2 bucket - they live in tests/test_data/local/<source>/<slug>/vendor/.

Naming convention

<source>_<instrument>_<date>_<datatype>.parquet

Verbatim examples:

  • itch_AAPL_2019-01-30_deltas.parquet
  • tardis_BTCUSDT_2020-09-01_depth10.parquet

Fields are kebab-cased within their slot; instrument keeps the venue symbol shape (BTC-PERPETUAL, BTCUSDT).

How to add a new dataset

Path 1 - Simple single-download files

Use scripts/curate-dataset.sh:

scripts/curate-dataset.sh <slug> <filename> <download-url> <licence>

This creates a versioned directory v1/<slug>/ containing the file, LICENSE.txt, and metadata.json populated with the required fields.

Path 2 - Complex pipelines (parse + transform)

For datasets needing format conversion (e.g. binary ITCH → Parquet):

  1. Write a curation function under crates/testkit/src/<source>/, gated behind #[cfg(test)] or marked #[ignore].
  2. The function should: download → parse → filter → convert to NautilusTrader types → write Parquet.
  3. Output .parquet + metadata.json to a local directory.
  4. Upload the Parquet to R2 manually.
  5. Add the SHA-256 to tests/test_data/large/checksums.json.

Verbatim ITCH example (the canonical Path 2 fixture):

# Download source (~4.4 GB, keep a local copy)
wget -O ~/Downloads/01302019.NASDAQ_ITCH50.gz \
  "https://emi.nasdaq.com/ITCH/Nasdaq%20ITCH/01302019.NASDAQ_ITCH50.gz"
 
# Curation test expects source at /tmp
ln -sf ~/Downloads/01302019.NASDAQ_ITCH50.gz /tmp/01302019.NASDAQ_ITCH50.gz
 
# Regenerate parquet
cargo test -p nautilus-testkit --lib test_curate_aapl_itch -- --ignored --nocapture
# Output: /tmp/itch_AAPL.XNAS_2019-01-30_deltas.parquet

Path 3 - User-fetched (restricted redistribution)

For datasets the project cannot redistribute (paid vendor licenses, per-account entitlements, unclear derivative-work rights):

  1. Commit a manifest.json and metadata.json. Do not commit the real vendor data or derived Parquet output.
  2. Provide a local fetch command that uses the user’s own vendor credentials.
  3. Convert the vendor data locally into Nautilus Parquet.
  4. Store outputs in a local cache path (tests/test_data/local/...) ignored by git.
  5. Make tests opt-in. They must skip cleanly when the dataset is missing.

Default distribution priority

Verbatim from the doc:

The default distribution order for new datasets is:

  1. Checked in small data.
  2. Public R2 large data.
  3. User-fetched data.

Choose user-fetched only when the first two options are not acceptable under the vendor’s terms.

FieldDescription
slugStable dataset identifier.
vendorVendor or venue name.
source_typeapi, portal-download, purchased-archive, etc.
source_filtersSymbols, event IDs, market IDs, date ranges, file names.
target_filesOutput Nautilus Parquet files expected after conversion.
cache_dirLocal output location relative to tests/test_data/local/.
fetch_commandSuggested command or script entry point.
transform_commandSuggested local conversion command.
envRequired environment variables.
notesShort operational notes.

metadata.json is authoritative for provenance, licensing, and redistribution rules. manifest.json is authoritative for fetch inputs, commands, cache locations, and output files. Don’t mix the two.

Test runner serialization

Tests that download large shared files race when nextest runs binaries in parallel. The fix is in .config/nextest.toml:

[[profile.default.overrides]]
filter = 'binary(grid_mm_itch) | binary(orderbook_integration) | binary(your_new_binary)'
test-group = 'large-data-tests'
 
[test-groups]
large-data-tests = { max-threads = 1 }

When adding a new test binary that downloads from R2, add it to the large-data-tests group filter - otherwise concurrent processes will race on the same download path.

Regenerating datasets after schema change

# 1. Re-run the curation test
cargo test -p nautilus-tardis test_curate_deribit_deltas -- --ignored --nocapture
 
# 2. Update the SHA-256
sha256sum /tmp/<output_file>.parquet
 
# 3. Update tests/test_data/large/checksums.json (busts CI cache)
# 4. Update the corresponding metadata.json (sha256, size_bytes)
# 5. Upload the new Parquet to R2
# 6. Commit checksums.json + metadata.json

Pytest skip pattern (user-fetched fixtures)

import pytest
 
if not filepath.exists():
    pytest.skip(f"User-fetched test data not found: {filepath}")

For Rust integration tests that need manual prep, prefer #[ignore] when the test is not expected to run in default CI.

Licensing and redistribution rules - what NOT to do

Verbatim “do not” list:

  • Do not upload restricted vendor datasets to the public R2 bucket.
  • Do not commit real vendor-derived Parquet files when redistribution rights are unclear.
  • Do not make default CI depend on vendor credentials or paid historical-data access.

Internal-only mirrors are allowed if the license permits internal sharing - but treat them as a separate operational path, not part of the public test-data standard.

Tutorial test data - NAUTILUS_DATA_DIR

Several Nautilus tutorials load user-provided market data. The NAUTILUS_DATA_DIR env var overrides the base path. The test suite sets it to tests/test_data/local/ so tutorials run against small local samples. Examples:

tests/test_data/local/
├── Binance/
│   ├── BTCUSDT_T_DEPTH_2022-11-01_depth_snap.csv
│   └── BTCUSDT_T_DEPTH_2022-11-01_depth_update.csv
├── Bybit/
│   └── 2024-12-01_XRPUSDT_ob500.data.zip
└── HISTDATA/
    └── DAT_ASCII_EURUSD_T_202001.csv.gz

The pattern Cortana mirrors: ship a small subset (e.g. first 10,000 rows of a depth file) for local smoke tests; full files live in gitignored cache.

Legacy datasets

Datasets that predate this policy and use raw CSV/CSV.gz formats without metadata.json remain valid for existing tests. New datasets do not get the legacy exemption - they go through the Parquet + metadata pipeline.

Cortana MK3 implications - fixture layout

The Nautilus convention maps cleanly to Cortana’s four fixture classes. Mirror the directory shape exactly so any future Nautilus-side tooling (data-test-spec runner, catalog auditor) recognizes our fixtures.

Cortana fixture layout (proposed)

tests/data/                                    # Cortana MK3 fixtures root
├── uw_flow_alerts/                            # checked-in small (<1 MB)
│   ├── 2026-05-06-cluster.parquet             # 50-100 alerts, one trading day
│   ├── 2026-05-06-quiet.parquet               # contrast fixture (<10 alerts)
│   ├── synthetic-edge-cases.parquet           # hand-crafted boundary alerts
│   └── metadata.json
├── decisions_db/                              # checked-in small
│   ├── 2026-05-06-row-replay.sqlite           # ~15 trades, full schema
│   ├── 2026-04-16-chop-cluster.sqlite         # known loss-day replay target
│   └── metadata.json
├── ibkr_chains/                               # checked-in small
│   ├── 2026-05-06-spy-chain.json              # one snapshot, ATM ±10 strikes
│   ├── 2026-05-06-spy-greeks.parquet          # derived greeks
│   └── metadata.json
├── databento_opra/                            # USER-FETCHED (vendor-restricted)
│   ├── 2026-05-06-spy/                        # one slug per (date, instrument)
│   │   ├── manifest.json                      # databento CLI invocation
│   │   ├── metadata.json                      # license, redistribution=false
│   │   └── README.md                          # "run scripts/fetch_databento.py"
│   └── synthesized/                           # checked-in tiny synthetic OPRA
│       ├── spy-near-the-money-100rows.parquet # for unit tests, hand-crafted
│       └── metadata.json
├── large/                                     # R2 mirror (Cortana-private bucket)
│   └── checksums.json
└── local/                                     # GITIGNORED - local cache
    └── databento_opra/
        ├── vendor/                            # raw .dbn.zst from Databento
        └── 2026-05-06-spy/                    # converted Nautilus Parquet

The first three fixtures (immediate)

  1. tests/data/uw_flow_alerts/2026-05-06-cluster.parquet Contains the day’s UWFlowAlert @customdataclass instances parquet-encoded via the auto-generated schema. Small - one trading day’s flow alerts is a few hundred KB. Checked in. Used by the UWFlowAlert replay test in the spike’s Step 6.

    Generation:

    # scripts/build_fixtures/uw_flow_alerts.py
    from cortana.adapters.uw.replay import alerts_from_decisions_db
    from nautilus_trader.persistence.catalog import ParquetDataCatalog
    alerts = alerts_from_decisions_db("2026-05-06")
    ParquetDataCatalog("tests/data/uw_flow_alerts").write_data(alerts)
  2. tests/data/decisions_db/2026-05-06.sqlite A trimmed copy of the production decisions.db containing exactly the day’s scoring_events rows (78-column schema). Checked in. Sub-MB. Used by the row-replay backtest test that streams scoring_events into the BacktestNode as synthesized ScoreUpdate events.

    Generation:

    sqlite3 ~/conductor/.../data/decisions.db \
      ".dump scoring_events" | \
      sqlite3 tests/data/decisions_db/2026-05-06.sqlite
  3. tests/data/ibkr_chains/2026-05-06-spy-chain.json One option-chain snapshot near a representative entry timestamp (e.g. the first composite-score≥65 row from decisions.db). Hand- reduced to ATM ±10 strikes. JSON for human readability. Used by strike-selection unit tests.

    Generation:

    # scripts/build_fixtures/ibkr_chain.py
    from cortana.adapters.ibkr.snapshot import chain_at_ts
    chain_at_ts("SPY", "2026-05-06T13:30:15Z").write_json(
        "tests/data/ibkr_chains/2026-05-06-spy-chain.json"
    )

Should we commit raw OPRA bytes? - answer: NO

Reasoning per this doc + Cortana constraints:

  • OPRA tape is large (one day SPY OPRA TRADES = hundreds of MB to multi-GB). Fails the <1 MB checked-in threshold by orders of magnitude.
  • Databento license: per-account credit-based access. Treating raw DBN as redistributable to anyone who clones the Cortana repo is a license violation per the “user-fetched data” rules in this doc (distribution: "user-fetch", public_mirror: false).
  • The R2 large-data path is for datasets the project (Nautilus, in their case; Cortana, in ours) has redistribution rights to. We do not have those rights for OPRA.

The right pattern (verbatim from this doc, applied to Cortana):

  1. Commit tests/data/databento_opra/2026-05-06-spy/manifest.json with the databento timeseries.get-range command and the expected output file paths.
  2. Commit metadata.json with distribution: "user-fetch", public_mirror: false, the license note, and the original URL pattern.
  3. Commit README.md instructing operators to run scripts/fetch_databento.py after exporting DATABENTO_API_KEY.
  4. The script writes raw .dbn.zst to tests/data/local/databento_opra/2026-05-06-spy/vendor/ and the converted Parquet to tests/data/local/databento_opra/2026-05-06-spy/.
  5. Pytest fixtures load from the local cache and skip when absent - never fail CI on missing OPRA.
  6. Additionally commit a tiny synthesized OPRA-shape fixture under tests/data/databento_opra/synthesized/ (~100 rows, hand- crafted) so unit tests covering the OPRA loader path run in CI without any real OPRA data.

How to load these in pytest fixtures

# tests/conftest.py
import pytest
from pathlib import Path
from nautilus_trader.persistence.catalog import ParquetDataCatalog
 
FIXTURES = Path(__file__).parent / "data"
 
@pytest.fixture
def uw_flow_alerts_2026_05_06():
    path = FIXTURES / "uw_flow_alerts" / "2026-05-06-cluster.parquet"
    if not path.exists():
        pytest.skip(f"Fixture missing: {path}")
    catalog = ParquetDataCatalog(str(FIXTURES / "uw_flow_alerts"))
    return list(catalog.query(data_cls=UWFlowAlert))
 
@pytest.fixture
def decisions_db_2026_05_06():
    path = FIXTURES / "decisions_db" / "2026-05-06.sqlite"
    if not path.exists():
        pytest.skip(f"Fixture missing: {path}")
    import sqlite3
    return sqlite3.connect(path)
 
@pytest.fixture
def databento_opra_local_2026_05_06_spy():
    path = FIXTURES / "local" / "databento_opra" / "2026-05-06-spy"
    if not path.exists():
        pytest.skip(
            "Local Databento OPRA cache absent - run "
            "scripts/fetch_databento.py with DATABENTO_API_KEY set"
        )
    return ParquetDataCatalog(str(path))

The skip-when-missing pattern is mandatory for any user-fetched fixture - it preserves the “default CI does not need vendor credentials” rule from this doc.

Naming for Cortana fixtures

Apply the Nautilus convention <source>_<instrument>_<date>_<datatype>.parquet to Cortana fixtures:

  • uw_SPY_2026-05-06_flow-alerts.parquet
  • databento_SPY_2026-05-06_opra-trades.parquet
  • ibkr_SPY_2026-05-06_chain-snapshot.json
  • cortana_SPY_2026-05-06_decisions.sqlite (treat cortana as the source)

Adopt this naming during the fixture-build step of the spike - easier than renaming later.

When this concept applies

  • Every Cortana MK3 test fixture committed after 2026-05-09.
  • Designing the Saturday spike’s Step 6 row-replay fixture set.
  • Migrating existing tests/fixtures/*.json chain snapshots from the pre-Nautilus structure to the new layout.
  • Onboarding any future contributor - they look here for “where do fixtures live, what’s checked in, what’s user-fetched.”

When it does not apply

  • Production runtime data (decisions.db, broker state, brain entries) - those follow Cortana’s three-tier state model (project_data_loss_april22.md), not this test-fixtures policy.
  • One-off ad-hoc spike scratch files (/tmp/databento_*.py) - those are throwaway, not fixtures.
  • Memory snapshots (~/.claude/...) - they live outside repo entirely.

Anti-patterns to avoid

  • Committing raw OPRA .dbn.zst to the repo or to a public bucket. License violation; size violation; verbatim “do not” item.
  • Skipping metadata.json. Every artifact gets one. No exceptions for “obvious” datasets.
  • Mixing fetch instructions into metadata.json. Fetch goes in manifest.json. License/provenance goes in metadata.json.
  • Hard-failing pytest on missing user-fetched fixtures. Use pytest.skip(...). CI must pass without vendor credentials.
  • Committing raw vendor formats (CSV/CSV.gz) for new datasets. New datasets are Nautilus Parquet only; CSV is legacy-only.
  • Forgetting to update checksums.json after regenerating a large dataset. CI cache won’t bust; downstream tests get stale data.
  • Adding a new R2-downloading test binary without joining the large-data-tests nextest group. Concurrent downloads race.
  • Using tests/test_data/local/ for anything you’d want CI to see. It’s gitignored. CI never sees it. By design.

Open questions for Cortana fixturing (deferred)

  1. Does Cortana need its own private R2 (or S3) bucket for tests/data/large/? Probably yes once decisions.db history exceeds a few MB, but defer until first concrete fixture exceeds 1 MB.
  2. Do we mirror Nautilus’s scripts/curate-dataset.sh or write a Cortana-specific scripts/curate_fixture.py? Probably the latter - we have additional steps (UW alert dedup, decisions-db trimming).
  3. Where does the synthesized OPRA fixture come from? Likely hand-crafted via a one-off scripts/build_fixtures/synth_opra.py that emits ~100 trades with realistic timestamps and prices - committed alongside the fixture.

See Also


Timeline

2026-05-07 | Cody - Filed during pre-spike concept mastery sweep batch 7 (developer guide).