0DTE ML Best-in-Class Comparison

Synthesis of academic + practitioner literature on 0DTE options ML training versus the Cortana MK2 stack. Punchline: at our sample size (n≈100), the only architectures with empirical lift are meta-labeling (López de Prado / Hudson & Thames) and TabPFN (small-n tabular transformer). Switching to TabNet/SAINT/RL is premature. 80% win rate is a 6-12 month horizon that requires microstructure capture + triple-barrier labels + meta-label veto + n≥500 - anything faster is wishful.

Core claim

The path from 57% to 80% win rate is not “better model on the same data.” It’s clean labels (triple-barrier, vol-scaled) → meta-label veto on top of existing primary signal → microstructure capture for execution context. Architecture change without these three is noise.

What the literature says

Sample-size reality

  • Tree ensembles (XGBoost) need ~10k samples to reach within ±2pp of full-data AUC for binary classification (medRxiv 2024.05.03).
  • Below n=200, penalized linear models (ridge/lasso) outperform trees; van Smeden 2019 dismantles the 10-events-per-variable rule but the underlying point holds: at our 38 features × 100 train, no tree ensemble produces stable feature ranking.
  • TabPFN (table-representation-learning, NeurIPS): pretrained transformer for tabular data, predicts in one forward pass without fitting. Designed explicitly for n<1000. Worth a 1-day spike.

Labels

  • Binary outcome_pnl >= 0 (our current target) is the worst label López de Prado names in Advances in Financial Machine Learning.
  • Triple-barrier method (TP/SL/time-out, label = first-hit) with volatility-scaled barriers (IV rank or 30m realized vol) is the practitioner standard. mlfinlab implements it.
  • Meta-labeling: primary signal decides direction, secondary classifier decides take-or-skip. H&T published examples: 20% → 77% and 37% → 56% accuracy lifts. Highest-EV move at n=100 because the secondary’s hypothesis space is much narrower than primary’s.
  • Path-aware multi-class (big_win / managed_win / scratch / loss / big_loss) is good for telemetry; triple-barrier is the label of record the meta-classifier consumes.

Features that matter for 0DTE

  1. Microstructure: NBBO spread, spread widening 10s, quote churn, top-of-book depth ratio, signed-volume imbalance, sweep velocity. LOB literature (Tandfonline 2025, arXiv 2506.05764) is unambiguous: features matter more than model depth.
  2. Charm + vanna as continuous magnitudes, not 1-bit directions. Mechanically grounded in dealer hedging, survives regime drift more durably than empirical correlations (SqueezeMetrics GEX paper, Volland white paper, GEXBoard).
  3. Cross-asset divergences: QQQ/IWM/XLF vs SPY at 1m, ES-SPY basis, VVIX, MOVE, VIX9D/VIX, VIX term roll.
  4. Execution context: signal→submit→ack→fill ms latencies, slippage_vs_mid, market_data_age_ms. Without these we cannot distinguish “wrong signal” from “right signal, bad fill.”
  5. Trader state: trade_number_today, loss_streak, pnl_day_to_date.
  6. Macro context: FOMC/CPI/auction/OPEX dummies + minutes-to-event.

Evaluation

  • AUC is the wrong objective for trading. Brier + decile lift + conditional expectancy above veto threshold + Sharpe under model filter are what matter.
  • Combinatorial Purged Cross-Validation (López de Prado) with embargo for honest out-of-sample. skfolio implements it.

When this concept applies

Any conversation about “should we switch to ” or “how do we get to 80% via ML.” The answer is almost always “fix the label, capture microstructure, then meta-label” - not “more model.”

When it breaks

  • If we ever get to n≥5000 trades with full microstructure capture, the argument for tabular DL or small transformers becomes plausible.
  • If the strategy moves off 0DTE (e.g. weeklies, swings), some architectures become viable that aren’t here.

See Also


Timeline

2026-05-04 | derived - Research agent (Claude general-purpose) fired against this question after the user asked for an exhaustive comparison vs best-in-class. Estimated meta-labeling could plausibly move our 57% → ~69% on existing data; 80% requires 6-12 months and the full microstructure stack. Filed so the next “should we try TabNet?” thread starts from the literature, not from vibes.