2026-05-04 Adversarial ML Data Review

Codex (gpt-5.4, read-only) tore apart Cortana MK2’s data-capture-for-ML approach against the 80% win-rate mandate. Verdict: ML is not the path to 80% on the current timeline; the data loop is “lying by omission”; highest-EV move is to demote ML to research mode, fix label freshness, rebuild on path-aware targets and microstructure features, and pause production-facing ML until n≥500 current labeled trades exist.

The load-bearing finding

scoring_events.outcome_resolved has been broken since 2026-04-17.

138 ENTERED events. 138 have paper_trade_id attached. Only 36 have outcome_resolved=1. Span of resolved: 2026-04-07 → 2026-04-17.
102 ENTERED events are attached to closed trades (with outcomes in paper_trades.db) but flagged unresolved in scoring_events.
9 nightly retrains from 2026-04-22 → 2026-05-01. Every one trained on the same 36-row snapshot. AUC=0.50 every time. Feature gain=0.0 on every column.
Effect: the model has not seen any data from the last 17 trading days. “Data is the truth” is false right now; the training set is stale, partial, and regime-misaligned.

Wired backfill at app.py:414 and app.py:2501 is the broken component.

Operational facts (from DB queries 2026-05-04)

Metric	Value
Total scoring_events	176,259
Trades closed	120 (out of 126; 6 cancelled)
Outcomes recorded	106
Wins / Losses / BE	60 / 44 / 2 → 57% win rate
Net P&L over outcomes	−$28,912 (since 2026-04-07)
ENTERED resolved for ML	36
Trainable rows (after path filter)	26 train / 10 holdout
Class balance in trainable slice	15W / 21L (41.7% wins)
Path snapshot coverage	85 of 126 trades; 0 before 2026-04-22
Forward-return labels (5m/15m/30m)	10 / 8 / 6 of 106 outcomes (post-bug-fix)

Top findings

1. Path-aware label, not `outcome_pnl >= 0`

xgboost_model.py:64 collapses all wins together regardless of partial exit ladders. Replace with 4-class (big_win, managed_win, scratch, loss) plus a separate classifier for MFE-before-MAE-breach. Use partial_exits, outcomes.mfe_pct, outcomes.mae_pct. Empirical: trades with partial exits averaged +3.61% pnl_pct vs −1.91% for non-partial trades.

2. The training projection drags ~80 columns; featurizer uses 38

decision_logger.py:830 (TRAINING_DATA_BASE_QUERY) is bloated. Trim to what features.py:25 actually consumes. Quarantine SHAP / model- output / regime columns from the training projection.

3. Microstructure features are missing (the “early and right” gap)

Five new tables to add at decision and fill time:

entry_microstructure: nbbo_spread_pct, spread_widening_10s, quote_churn_rate, bid_ask_imbalance, top_of_book_size_ratio, mid_move_1s, mid_move_5s
signal_execution_context: signal_ts, submit_ts, ack_ts, fill_ts, signal_to_fill_ms, fill_slippage_vs_mid, market_data_age_ms
cross_asset_state: qqq/iwm/xlf vs spy divergence_1m, es_spy_basis, vx_term_roll, vix9d_vix_ratio
flow_state per event: puts_per_min, calls_per_min, net_premium_zscore_5m, sweep_velocity, lit_vs_dark_ratio, same_strike_repeat_rate
trader_state: trade_number_today, open_positions_count, last_trade_bias, pnl_day_to_date, loss_streak

4. Sample volume vs feature dimensionality

Defensible minimum for 38 predictors: 380-760 minority-class examples → 910-1,825 total labeled trades. We have 26 train. Below n=200, do not fit a 38-feature booster for production decisions. Below n=500, assume unstable ranking unless feature set is halved.

5. The promotion gate has the wrong objective

config.json:159 retrain_validation_log_only=true blocks nothing. Even when active, the gate accepts cand_auc >= cur_auc - tolerance

a worse model can pass by design. Right gate for this use case: better downside filtering on the worst decile; improved conditional expectancy above the live veto threshold; non-degraded calibration in the top probability bucket; holdout of ≥50-100 chronologically latest trades. AUC and Brier are secondary.

6. Path snapshot survivorship bias

trade_path_snapshots covers 85 of 126 trades. Missing subset is not random: no-path trades averaged 9.41 min duration vs 6.58 with paths; 12 of 22 no-path outcomes were ≤5 min. Path-aware labels would inherit this bias from day one until coverage is universal.

7. Time to 80% win rate via ML

At 5-15 trades/day, reaching 500 current labeled trades takes 33-100 trading days; the safer 900+ range takes 60-180 days. Hitting 80% by filtering alone means cutting losses 44 → 15 with zero win attrition (a 66% loss reduction with no win attrition). Codex estimate: never with current ML/data path; plausibly after 2-6 months IF label freshness, path-aware targets, and ML-as-veto-only on top 20-30% danger setups all land.

Top 3 next moves (Codex’s prioritization, validated)

Fix label freshness. backfill_outcomes() is the chokepoint. Until the 102 unresolved ENTERED rows drop to ~0, retraining is garbage-in-garbage-out. (New task.)
Replace binary target with path-aware target. Touch xgboost_model.py:64 and the backfill that writes labels.
Pause production-facing ML. Cut feature set to ~10-12. Compute which conditions actually reduce loss tails (classical stats). Don’t promote another XGBoost checkpoint until n≥500 current labeled trades and a gate based on conditional expectancy.

Alternative learning approaches assessed

Synthetic data / simulation: fine for pretraining and stress tests, useless for 0DTE microstructure edge (sims get fill quality and dealer reactions wrong - exactly what matters).
Transfer learning from public 0DTE datasets: weak fit unless the feed has the same entry timing, contract selection, and execution assumptions; otherwise you import someone else’s bias.
RL replay buffers: premature theater at n=36.
Bayesian online updating on hand-built filters: better fit than nightly XGBoost for this scale.
Classical stats + bigger sample (decline ML): highest-EV right now. Shrink to a dozen features, post-trade stratification, no ML for entry selection until labels are abundant and current.

Timeline

2026-05-04 | observed - Codex adversarial review fired against data-capture pipeline after Cody asked “has the ML engine learned anything?” Discovery: outcome_resolved is broken; 102 of 138 ENTERED events are unresolved despite trade outcomes existing. Retrains have been running on a frozen 36-row snapshot from Apr 7-17. Filed this review so the path forward (label freshness → path-aware target → microstructure features → veto-only ML) doesn’t get lost in the nightly noise.

CortanaROI Brain

Explorer

2026-05-04-adversarial-ml-data-review

2026-05-04 Adversarial ML Data Review

The load-bearing finding

Operational facts (from DB queries 2026-05-04)

Top findings

1. Path-aware label, not `outcome_pnl >= 0`

2. The training projection drags ~80 columns; featurizer uses 38

3. Microstructure features are missing (the “early and right” gap)

4. Sample volume vs feature dimensionality

5. The promotion gate has the wrong objective

6. Path snapshot survivorship bias

7. Time to 80% win rate via ML

Top 3 next moves (Codex’s prioritization, validated)

Alternative learning approaches assessed

See Also

Timeline

Graph View

Table of Contents

Backlinks

CortanaROI Brain

Explorer

2026-05-04-adversarial-ml-data-review

2026-05-04 Adversarial ML Data Review

The load-bearing finding

Operational facts (from DB queries 2026-05-04)

Top findings

1. Path-aware label, not outcome_pnl >= 0

2. The training projection drags ~80 columns; featurizer uses 38

3. Microstructure features are missing (the “early and right” gap)

4. Sample volume vs feature dimensionality

5. The promotion gate has the wrong objective

6. Path snapshot survivorship bias

7. Time to 80% win rate via ML

Top 3 next moves (Codex’s prioritization, validated)

Alternative learning approaches assessed

See Also

Timeline

Graph View

Table of Contents

Backlinks

1. Path-aware label, not `outcome_pnl >= 0`