2026-05-04 Adversarial ML Data Review
Codex (gpt-5.4, read-only) tore apart Cortana MK2’s data-capture-for-ML approach against the 80% win-rate mandate. Verdict: ML is not the path to 80% on the current timeline; the data loop is “lying by omission”; highest-EV move is to demote ML to research mode, fix label freshness, rebuild on path-aware targets and microstructure features, and pause production-facing ML until n≥500 current labeled trades exist.
The load-bearing finding
scoring_events.outcome_resolved has been broken since 2026-04-17.
- 138 ENTERED events. 138 have
paper_trade_idattached. Only 36 haveoutcome_resolved=1. Span of resolved: 2026-04-07 → 2026-04-17. - 102 ENTERED events are attached to closed trades (with outcomes in
paper_trades.db) but flagged unresolved inscoring_events. - 9 nightly retrains from 2026-04-22 → 2026-05-01. Every one trained on the same 36-row snapshot. AUC=0.50 every time. Feature gain=0.0 on every column.
- Effect: the model has not seen any data from the last 17 trading days. “Data is the truth” is false right now; the training set is stale, partial, and regime-misaligned.
Wired backfill at app.py:414 and app.py:2501 is the broken component.
Operational facts (from DB queries 2026-05-04)
| Metric | Value |
|---|---|
| Total scoring_events | 176,259 |
| Trades closed | 120 (out of 126; 6 cancelled) |
| Outcomes recorded | 106 |
| Wins / Losses / BE | 60 / 44 / 2 → 57% win rate |
| Net P&L over outcomes | −$28,912 (since 2026-04-07) |
| ENTERED resolved for ML | 36 |
| Trainable rows (after path filter) | 26 train / 10 holdout |
| Class balance in trainable slice | 15W / 21L (41.7% wins) |
| Path snapshot coverage | 85 of 126 trades; 0 before 2026-04-22 |
| Forward-return labels (5m/15m/30m) | 10 / 8 / 6 of 106 outcomes (post-bug-fix) |
Top findings
1. Path-aware label, not outcome_pnl >= 0
xgboost_model.py:64 collapses all wins together regardless of partial
exit ladders. Replace with 4-class (big_win, managed_win, scratch,
loss) plus a separate classifier for MFE-before-MAE-breach. Use
partial_exits, outcomes.mfe_pct, outcomes.mae_pct. Empirical:
trades with partial exits averaged +3.61% pnl_pct vs −1.91% for
non-partial trades.
2. The training projection drags ~80 columns; featurizer uses 38
decision_logger.py:830 (TRAINING_DATA_BASE_QUERY) is bloated. Trim
to what features.py:25 actually consumes. Quarantine SHAP / model-
output / regime columns from the training projection.
3. Microstructure features are missing (the “early and right” gap)
Five new tables to add at decision and fill time:
entry_microstructure: nbbo_spread_pct, spread_widening_10s, quote_churn_rate, bid_ask_imbalance, top_of_book_size_ratio, mid_move_1s, mid_move_5ssignal_execution_context: signal_ts, submit_ts, ack_ts, fill_ts, signal_to_fill_ms, fill_slippage_vs_mid, market_data_age_mscross_asset_state: qqq/iwm/xlf vs spy divergence_1m, es_spy_basis, vx_term_roll, vix9d_vix_ratioflow_stateper event: puts_per_min, calls_per_min, net_premium_zscore_5m, sweep_velocity, lit_vs_dark_ratio, same_strike_repeat_ratetrader_state: trade_number_today, open_positions_count, last_trade_bias, pnl_day_to_date, loss_streak
4. Sample volume vs feature dimensionality
Defensible minimum for 38 predictors: 380-760 minority-class examples → 910-1,825 total labeled trades. We have 26 train. Below n=200, do not fit a 38-feature booster for production decisions. Below n=500, assume unstable ranking unless feature set is halved.
5. The promotion gate has the wrong objective
config.json:159 retrain_validation_log_only=true blocks nothing.
Even when active, the gate accepts cand_auc >= cur_auc - tolerance
- a worse model can pass by design. Right gate for this use case: better downside filtering on the worst decile; improved conditional expectancy above the live veto threshold; non-degraded calibration in the top probability bucket; holdout of ≥50-100 chronologically latest trades. AUC and Brier are secondary.
6. Path snapshot survivorship bias
trade_path_snapshots covers 85 of 126 trades. Missing subset is not
random: no-path trades averaged 9.41 min duration vs 6.58 with paths;
12 of 22 no-path outcomes were ≤5 min. Path-aware labels would
inherit this bias from day one until coverage is universal.
7. Time to 80% win rate via ML
At 5-15 trades/day, reaching 500 current labeled trades takes 33-100 trading days; the safer 900+ range takes 60-180 days. Hitting 80% by filtering alone means cutting losses 44 → 15 with zero win attrition (a 66% loss reduction with no win attrition). Codex estimate: never with current ML/data path; plausibly after 2-6 months IF label freshness, path-aware targets, and ML-as-veto-only on top 20-30% danger setups all land.
Top 3 next moves (Codex’s prioritization, validated)
- Fix label freshness.
backfill_outcomes()is the chokepoint. Until the 102 unresolved ENTERED rows drop to ~0, retraining is garbage-in-garbage-out. (New task.) - Replace binary target with path-aware target. Touch
xgboost_model.py:64and the backfill that writes labels. - Pause production-facing ML. Cut feature set to ~10-12. Compute which conditions actually reduce loss tails (classical stats). Don’t promote another XGBoost checkpoint until n≥500 current labeled trades and a gate based on conditional expectancy.
Alternative learning approaches assessed
- Synthetic data / simulation: fine for pretraining and stress tests, useless for 0DTE microstructure edge (sims get fill quality and dealer reactions wrong - exactly what matters).
- Transfer learning from public 0DTE datasets: weak fit unless the feed has the same entry timing, contract selection, and execution assumptions; otherwise you import someone else’s bias.
- RL replay buffers: premature theater at n=36.
- Bayesian online updating on hand-built filters: better fit than nightly XGBoost for this scale.
- Classical stats + bigger sample (decline ML): highest-EV right now. Shrink to a dozen features, post-trade stratification, no ML for entry selection until labels are abundant and current.
See Also
Timeline
2026-05-04 | observed - Codex adversarial review fired against data-capture pipeline after Cody asked “has the ML engine learned anything?” Discovery: outcome_resolved is broken; 102 of 138 ENTERED events are unresolved despite trade outcomes existing. Retrains have been running on a frozen 36-row snapshot from Apr 7-17. Filed this review so the path forward (label freshness → path-aware target → microstructure features → veto-only ML) doesn’t get lost in the nightly noise.