2026-05-05 Overnight ML Pipeline Rebuild

12-hour push from 2026-05-04 evening to 2026-05-05 6:30 AM. Goal: stop the bleeding on the ML data layer (silent label rot since Apr 17), then pivot from “more XGBoost” to meta-labeling per the literature. Result: 18 commits, full test suite green for the first time in days, secondary classifier live in shadow mode with charm/vanna magnitude features empirically validated by mutual-information selection.

What broke (verified, not assumed)

Three bugs that had been silently corrupting the ML data loop:

  1. outcome_resolved backfill rolled back every run since 2026-04-17. decision_logger.py:798 ran DETACH DATABASE while a write transaction was open; SQLite under WAL throws “database pt is locked”; the exception handler rolled the whole batch back; the caller swallowed the error as a non-fatal warning. 9 nightly retrains on a frozen 36-row snapshot. Model AUC=0.50 every time because there was nothing new to fit.

    • Fix: explicit conn.commit() before DETACH (commit 00d0537).
    • Result: 36 → 130 ENTERED resolved (3.6x lift).
  2. outcome_hit_tp substring check missed TAKE_PROFIT. The check "TP" in str(exit_reason).upper() doesn’t match “TAKE_PROFIT” (the characters T-A-K-E-_-P-R-O-F-I-T contain no “TP” pair). 22 winning trades were flagged as losses in the dashboard/calibration tables.

    • Fix: er.startswith("TP") or "TAKE_PROFIT" in er (commit 9ab9b00).
    • Result: hit_tp=1 went from 5 → 27 (5.4x correction).
  3. forward_returns —apply path rolled back even when called. The script ran the UPDATE in both —apply and —no-apply paths, then incremented the counter unconditionally regardless of cursor.rowcount. Dry-run rolled the writes back but the “Updated 106 trades” log line still printed.

    • Fix: rowcount validation, —apply guard around the actual UPDATE (commit 538cb7a).
  4. triple_barrier_label read underlying_price (NULL in production); should have read option_mid. The hand-rolled Codex impl used underlying SPY price, but production trade_path_snapshots only populates option_mid (option premium). Backfill produced 0 labels on first run.

    • Fix: COALESCE(option_mid, underlying_price) + treat early exits as scratch (commit 85e058e).
    • Result: 0 → 84 production labels (55 SL-first, 29 TP-first).
  5. ENTRY_PARTIAL_THEN_REJECTED skipped confirm_fill. When an entry partially filled then got rejected, the engine adopted the partial as a smaller position via pos.contracts_remaining = filled_qty directly - skipping total_contracts, entry_count, avg_cost, entry_order_filled_qty, sl_price. P&L computation downstream was malformed; sl_price stayed at 0.

    • Fix: route through confirm_fill like _cancel_entry does (commit 895cec8).
    • Test that had been quietly failing all session is now green.
  6. 14 historical CLOSED trades had no outcomes row - most from the Apr 7-10 era plus 4 persistent later-era cases. Most likely cause: force_close scripts and broker reconcile paths that flipped status=‘CLOSED’ without going through close_trade().

    • Heal script: scripts/heal_missing_outcomes.py reconstructs from partial_exits or last path snapshot; flags exit_reason='HEALED_*'.
    • Result: 5 healed (4 from partials, 1 from snapshot), 9 lost (early-era pre-path-tracker). 130 → 130 trainable (5 new resolved scoring_events from the heal).

What we built

Triple-barrier vol-scaled label (Task #57)

  • src/cortana/engines/triple_barrier_label.py - TP/SL/30-min timeout with vol-scaled barriers from 30-min realized vol of pre-entry underlying. mlfinpy parity test (3/3 green) confirms our impl matches López de Prado’s canonical algorithm.
  • outcomes.triple_barrier_label column populated nightly.

Meta-labeling secondary classifier (Task #56)

  • src/cortana/engines/secondary_model.py - L1-penalized logistic regression with MIN_SAMPLES_TO_TRAIN=50, mutual-information feature selection down to 8 features.
  • Wired in shadow mode: emits meta_win_prob to scoring_events for every decision. Does NOT gate live trades (Task #56 phase 2 after AUC validation).
  • load_meta_labeling_training_data() JOINs decisions.db ↔ paper_trades.db with the COMMIT-before-DETACH pattern.

Charm + vanna magnitudes (Task #58)

  • Continuous magnitudes added to FEATURIZE_EVENT_COLUMNS:
    • dp_charm_magnitude = total_charm * spy_spot * 100 (per-min hedge $)
    • dp_vanna_magnitude = total_vanna * spy_spot * vix (per-VIX-pt hedge $)
  • Mutual-information selection picked both in the live model’s top-8 on 2026-05-05 06:30 retrain - empirical confirmation of the literature recommendation that magnitudes carry signal that 1-bit directions don’t.

Broker-direct reconciler (Task #42)

  • scripts/broker_truth_reconcile.py - pulls IBKR live socket (reqExecutionsAsync 24h, ib.positions, ib.openOrders, ib.accountSummary)
    • Flex Query (all-time, gated on env vars). Compares against paper_trades.db. Emits structured findings; —apply only fixes per-trade qty/price/pnl drift (refuses to touch positions table).
  • Scheduled daily at 15:30 CT via com.cortanaroi.broker-reconcile.plist.

Nightly ML pipeline orchestrator

  • scripts/nightly_ml_pipeline.py - chains forward_returns + triple_barrier + outcomes_backfill + secondary retrain. Each step independent; orchestrator returns non-zero only if ALL fail.
  • Replaces single-step com.cortanaroi.ml-backfill.plist invocation. Runs 16:00 CT weekdays.

Empirical findings worth remembering

  1. XGBoost on n=100 is anti-predictive. TabPFN spike (commit 98162f1) measured AUC = 0.430 on a 25-row time-series holdout. Brier barely above the 0.25 random floor. Conditional expectancy at p≥0.55 = 0.682, base rate = 0.680 (model adds nothing). The literature was right about small-n XGBoost.

  2. Meta-labeling beat XGBoost in the published literature - H&T’s verified examples show 17-57pp accuracy lifts (Bollinger MR 20→77% validation, SMA crossover 37→56%). At our n=100, this is the highest-EV move per the comparison.

  3. 80% win rate via ML alone is a 6-12 month horizon. Codex’s adversarial review computed: cutting losses 44 → 15 with zero win attrition is a 66% loss reduction with no win attrition - extreme bar. Filter alone won’t get there. Plausible path: meta-labeling veto on top 20-30% danger setups + microstructure capture (#51, 3-5 weeks).

Open Threads

  • Task #62: User must manually accept the TabPFN license at https://ux.priorlabs.ai to enable the spike’s TabPFN side. Once TABPFN_TOKEN is set, re-run scripts.tabpfn_spike for the A/B.
  • Task #51: Microstructure feature tables - the only path to 80% per the literature. 5 new tables (entry_microstructure, signal_execution_context, cross_asset_state, flow_state, trader_state). 3-5 weeks scope.
  • Task #46: Broker-truth-first writes - architectural shift so DB is a projection of broker_executions, not authored independently. Needs design doc.
  • Task #36: Dashboard meta_win_prob cards - fired tonight as a Codex tab; status pending verification.
  • Task #53: Promotion gate based on conditional expectancy, not AUC. Blocked on having more shadow-mode meta_win_prob data.

See Also


Timeline

2026-05-05 06:30 CDT | observed - Pre-market boot verify run after all night work. Schema migration applied production decisions.db (added dp_charm_magnitude + dp_vanna_magnitude). Secondary retrained on 101 rows. Mutual-info selected both new magnitude features in top-8 - first empirical evidence that the magnitude pivot is the right call. Engine loads cleanly.

2026-05-05 04:00-06:30 CDT | Push session 2. Codex #58 charm/vanna magnitudes shipped. position_manager bug diagnosed and fixed (was labeled “known pre-existing” but was a real production bug since de2898f). Heal script for missing outcomes ran (5 healed). Daily broker reconcile + nightly ML orchestrator wired into launchd.

2026-05-04 evening | Push session 1. outcome_resolved root cause found and fixed; 36 → 125 ENTERED resolved. Triple-barrier label shipped. TabPFN spike ran (XGBoost AUC=0.430 baseline). Meta-labeling secondary classifier landed in shadow mode.