META gate retrain catastrophe

What happened

2026-05-20 close: nightly_ml_pipeline.py at 16:00 CT retrained the secondary model on the day’s outcomes. The training corpus added 2026-05-20’s 3 CALL losses (#442 −4,525, #443 −4,800, #444 partial −1,045 before bypass save) + 1 winner (#441 +2,025). 4 new rows into a 406-row corpus.

2026-05-21 morning: the new model output take_prob = 0.06-0.19 on EVERY BULLISH 80+ signal. META gate threshold 0.55. 47 signals blocked in the first 30 minutes, including the clean SPY 740 directional rip at 09:03. Zero entries fired.

Why the model collapsed

Pre-retrain (2026-05-20 model): META let 4 signals through that day, no SKIPPED_META_GATE rows in decisions.db for the day. Approved the 3 losers.

Post-retrain (2026-05-21 morning model): 47 skips with consistent 0.15-0.19 prob. Blanket pessimism. Engine log: 0DTE meta-model: take_prob=0.06 conf=TRAINED (n=406).

3 bad CALL trades in a 406-row corpus shouldn’t flip a healthy model. Hypothesis: the model’s calibration is brittle around small class-imbalance shifts. Adding 3 losers vs 1 winner skewed the predicted-positive class enough that the calibrated output collapsed near zero for all CALL features.

Two distinct bugs from this episode

  1. The model overcorrected. Calibration is too sensitive to small daily batches. Needs regime conditioning + per-side calibration so 3 BULL CALL losses don’t poison BULL PUT prediction too.
  2. Auto-promotion was unsafe. The new model went live without ANY held-out validation. If we’d held out 30% of training data and only promoted when held-out WR ≥ 0.55 and held-out net > 0, the 2026-05-20 retrain would have FAILED validation and the live model would have stayed on the 2026-05-19 weights.

What we shipped

  • META_GATE_ENABLED=false (env, immediate mitigation 2026-05-21 09:50)
  • META_PROB_SIZING_ENABLED=false (env, sizing also depended on meta_prob and was zeroing out position sizes - required separate flag)
  • 8a253fe fix(meta): stage retrains behind explicit promotion - the durable fix: SECONDARY_RETRAIN_TRAIN_ENABLED=true (default, keep training candidates), SECONDARY_RETRAIN_PROMOTE_ENABLED=false (default, NEVER auto-promote). Held-out validation (WR ≥ 0.55, net ≥ $0) gates promotion. Per-side models (secondary_model_call, secondary_model_put). Manual review/promote/rollback CLI scripts.

The durable principle

A trading-engine ML model must not silently change behavior overnight. Train candidates automatically; promote them manually. Held-out validation must pass before promotion is even an option. A single bad-day batch can never flip a passing model.

Costs and constraint

User’s morning was missed because the model said “no” to every signal during a clean directional move. Estimated opportunity cost on the 09:03 BULLISH 85 signal alone: ~735→$740 on 100 contracts of ATM call).

References

  • Plan: plans/2026-05-21-codex-handoff-P0-meta-retrain-opt-in-only.md
  • Ship: commit 8a253fe
  • Live disable: ~/.config/cortanaroi/app.env 2026-05-21