META gate retrain catastrophe
What happened
2026-05-20 close: nightly_ml_pipeline.py at 16:00 CT retrained the secondary model on the day’s outcomes. The training corpus added 2026-05-20’s 3 CALL losses (#442 −4,525, #443 −4,800, #444 partial −1,045 before bypass save) + 1 winner (#441 +2,025). 4 new rows into a 406-row corpus.
2026-05-21 morning: the new model output take_prob = 0.06-0.19 on
EVERY BULLISH 80+ signal. META gate threshold 0.55. 47 signals
blocked in the first 30 minutes, including the clean SPY 740
directional rip at 09:03. Zero entries fired.
Why the model collapsed
Pre-retrain (2026-05-20 model): META let 4 signals through that day,
no SKIPPED_META_GATE rows in decisions.db for the day. Approved
the 3 losers.
Post-retrain (2026-05-21 morning model): 47 skips with consistent
0.15-0.19 prob. Blanket pessimism. Engine log: 0DTE meta-model: take_prob=0.06 conf=TRAINED (n=406).
3 bad CALL trades in a 406-row corpus shouldn’t flip a healthy model. Hypothesis: the model’s calibration is brittle around small class-imbalance shifts. Adding 3 losers vs 1 winner skewed the predicted-positive class enough that the calibrated output collapsed near zero for all CALL features.
Two distinct bugs from this episode
- The model overcorrected. Calibration is too sensitive to small daily batches. Needs regime conditioning + per-side calibration so 3 BULL CALL losses don’t poison BULL PUT prediction too.
- Auto-promotion was unsafe. The new model went live without ANY held-out validation. If we’d held out 30% of training data and only promoted when held-out WR ≥ 0.55 and held-out net > 0, the 2026-05-20 retrain would have FAILED validation and the live model would have stayed on the 2026-05-19 weights.
What we shipped
META_GATE_ENABLED=false(env, immediate mitigation 2026-05-21 09:50)META_PROB_SIZING_ENABLED=false(env, sizing also depended on meta_prob and was zeroing out position sizes - required separate flag)8a253fe fix(meta): stage retrains behind explicit promotion- the durable fix:SECONDARY_RETRAIN_TRAIN_ENABLED=true(default, keep training candidates),SECONDARY_RETRAIN_PROMOTE_ENABLED=false(default, NEVER auto-promote). Held-out validation (WR ≥ 0.55, net ≥ $0) gates promotion. Per-side models (secondary_model_call,secondary_model_put). Manual review/promote/rollback CLI scripts.
The durable principle
A trading-engine ML model must not silently change behavior overnight. Train candidates automatically; promote them manually. Held-out validation must pass before promotion is even an option. A single bad-day batch can never flip a passing model.
Costs and constraint
User’s morning was missed because the model said “no” to every signal during a clean directional move. Estimated opportunity cost on the 09:03 BULLISH 85 signal alone: ~735→$740 on 100 contracts of ATM call).
References
- Plan:
plans/2026-05-21-codex-handoff-P0-meta-retrain-opt-in-only.md - Ship: commit
8a253fe - Live disable:
~/.config/cortanaroi/app.env2026-05-21