0DTE ML Best-in-Class Comparison

Synthesis of academic + practitioner literature on 0DTE options ML training versus the Cortana MK2 stack. Punchline: at our sample size (n≈100), the only architectures with empirical lift are meta-labeling (López de Prado / Hudson & Thames) and TabPFN (small-n tabular transformer). Switching to TabNet/SAINT/RL is premature. 80% win rate is a 6-12 month horizon that requires microstructure capture + triple-barrier labels + meta-label veto + n≥500 - anything faster is wishful.

Core claim

The path from 57% to 80% win rate is not “better model on the same data.” It’s clean labels (triple-barrier, vol-scaled) → meta-label veto on top of existing primary signal → microstructure capture for execution context. Architecture change without these three is noise.

What the literature says

Sample-size reality

Tree ensembles (XGBoost) need ~10k samples to reach within ±2pp of full-data AUC for binary classification (medRxiv 2024.05.03).
Below n=200, penalized linear models (ridge/lasso) outperform trees; van Smeden 2019 dismantles the 10-events-per-variable rule but the underlying point holds: at our 38 features × 100 train, no tree ensemble produces stable feature ranking.
TabPFN (table-representation-learning, NeurIPS): pretrained transformer for tabular data, predicts in one forward pass without fitting. Designed explicitly for n<1000. Worth a 1-day spike.

Labels

Binary outcome_pnl >= 0 (our current target) is the worst label López de Prado names in Advances in Financial Machine Learning.
Triple-barrier method (TP/SL/time-out, label = first-hit) with volatility-scaled barriers (IV rank or 30m realized vol) is the practitioner standard. mlfinlab implements it.
Meta-labeling: primary signal decides direction, secondary classifier decides take-or-skip. H&T published examples: 20% → 77% and 37% → 56% accuracy lifts. Highest-EV move at n=100 because the secondary’s hypothesis space is much narrower than primary’s.
Path-aware multi-class (big_win / managed_win / scratch / loss / big_loss) is good for telemetry; triple-barrier is the label of record the meta-classifier consumes.

Features that matter for 0DTE

Microstructure: NBBO spread, spread widening 10s, quote churn, top-of-book depth ratio, signed-volume imbalance, sweep velocity. LOB literature (Tandfonline 2025, arXiv 2506.05764) is unambiguous: features matter more than model depth.
Charm + vanna as continuous magnitudes, not 1-bit directions. Mechanically grounded in dealer hedging, survives regime drift more durably than empirical correlations (SqueezeMetrics GEX paper, Volland white paper, GEXBoard).
Cross-asset divergences: QQQ/IWM/XLF vs SPY at 1m, ES-SPY basis, VVIX, MOVE, VIX9D/VIX, VIX term roll.
Execution context: signal→submit→ack→fill ms latencies, slippage_vs_mid, market_data_age_ms. Without these we cannot distinguish “wrong signal” from “right signal, bad fill.”
Trader state: trade_number_today, loss_streak, pnl_day_to_date.
Macro context: FOMC/CPI/auction/OPEX dummies + minutes-to-event.

Evaluation

AUC is the wrong objective for trading. Brier + decile lift + conditional expectancy above veto threshold + Sharpe under model filter are what matter.
Combinatorial Purged Cross-Validation (López de Prado) with embargo for honest out-of-sample. skfolio implements it.

When this concept applies

Any conversation about “should we switch to ” or “how do we get to 80% via ML.” The answer is almost always “fix the label, capture microstructure, then meta-label” - not “more model.”

When it breaks

If we ever get to n≥5000 trades with full microstructure capture, the argument for tabular DL or small transformers becomes plausible.
If the strategy moves off 0DTE (e.g. weeklies, swings), some architectures become viable that aren’t here.

Timeline

2026-05-04 | derived - Research agent (Claude general-purpose) fired against this question after the user asked for an exhaustive comparison vs best-in-class. Estimated meta-labeling could plausibly move our 57% → ~69% on existing data; 80% requires 6-12 months and the full microstructure stack. Filed so the next “should we try TabNet?” thread starts from the literature, not from vibes.

CortanaROI Brain

Explorer

0dte-ml-best-in-class-comparison