2026-05-06 power outage exposed full state-divergence cascade

TL;DR: Power loss mid-session orphaned an open paper position for ~3 hours. The TP fired at the broker during the outage. When power restored, every layer of state (engine memory, position_state table, trades table, dashboard render) had a different view of reality. The only thing that agreed with the broker was the broker. Recovery required killing the engine, hand-crafted SQL, and disabling launchd to stop the engine’s in-memory cache from rewriting corrected DB rows. For commercialization, broker-truth-first is non-negotiable - the entire local cache hierarchy has to defer to broker state, not the other way around.

Timeline

  • 10:55:54 CT - Trade #162 (signal #238) entered: BULL CALL 727 × 39 @ 5.50.
  • ~11:47 CT - Last successful IBKR updatePortfolio log entry. SPY 727C still position=39, marketPrice ~1,754.
  • ~11:48-14:42 CT - Power outage. Mac off. ~3 hours of darkness. During this window, SPY rallied; 727C premium spiked above 1,857.70 (broker realizedPnL went from 3,742.44).
  • 14:42 CT - Power restored. Mac boots.
  • 14:44 CT - launchd KeepAlive re-kicks the engine (com.cortanaroi.mk2). Engine starts, dashboard process resumes.
  • 14:45 CT - Engine attempts to connect to IB Gateway on port 4002. ConnectionRefusedError(61). IB Gateway plist com.cortanaroi.ibgateway exited cleanly during the boot sequence (likely failed silently inside IBC). After 30s of retries, engine raises ConnectionError and exits at app.py:2832 → create_broker → client.connect.
  • 14:46-14:47 CT - launchd re-fires the engine multiple times. Each attempt fails identically.
  • 14:48 CT - Manual recovery: launchctl kickstart -k gui/$(id -u)/com.cortanaroi.ibgateway succeeds. Port 4002 opens. Re-kick engine: connects. Engine subscribes to equity feeds, IBKR streaming wired.
  • 14:49 CT - Engine receives updatePortfolio for 727C: position=0.0, realizedPNL=3742.44. Broker truth: position is closed. But local DB (trades and position_state) still says trade #162 is OPEN with 39 contracts.
  • 14:53 CT - Run broker_truth_reconcile.py. Audit JSON correctly identifies P0 open_positions_drift for trade #162. Suggested fix: “Refuse auto-fix; inspect position_state and broker manually.” (Reconciler is read-only by design.)
  • 14:54 CT - Manual SQL: UPDATE trades SET status='CLOSED' WHERE id=162; INSERT INTO outcomes (trade_id=162, exit_price=5.50, exit_reason='TAKE_PROFIT', pnl_dollars=1857.70, manual_override_flag=1, ...). Verified: 0 open trades.
  • ~14:55 CT - Dashboard still shows position OPEN with fake +$9,321 unrealized P&L. Engine’s in-memory position_state cache rewrites the DB row back to is_open=1, contracts_remaining=39. The engine actively undid the manual reconcile.
  • 14:57 CT - Engine killed via kill <pid>. SQL update redone. launchctl disable to prevent KeepAlive re-fire.
  • 14:58 CT - DB clean. Dashboard refresh shows 0 open positions. Reality matches.
  • 15:30 CT - Market closes. Engine briefly auto-restarted (launchctl enable was permitted by the time engine kicked itself), ran clean EOD: 15 trades, 9 wins, P&L $-2059.00.

Failure stack (every layer broke its contract)

Layer 1 - Physical (no UPS). OS hard-killed by power loss. No clean shutdown. Mitigation needed: UPS with NUT or similar trigger to gracefully stop the engine + Gateway before battery cutoff.

Layer 2 - IB Gateway plist. com.cortanaroi.ibgateway plist was loaded (visible in launchctl list) but the IBC launcher exited cleanly after a single attempt and launchd considered the job successful (exit code 0). No KeepAlive on this plist. Mitigation needed: KeepAlive set to true plus a stricter healthcheck that proves port 4002 is responsive, not just “the IBC launcher exited.”

Layer 3 - Engine connect retry. app.py:2832 calls create_broker which does a single 30s connect attempt then raises ConnectionError. No backoff, no retry loop. Engine crashes; launchd kicks; engine crashes again. Loop continues until something breaks the cycle. Mitigation needed: retry with exponential backoff (e.g., 5s, 15s, 60s, 300s) and graceful “wait for broker” state instead of immediate crash.

Layer 4 - State reconciliation on engine startup. Engine, on connecting, did receive updatePortfolio events including the truth position=0.0 for trade #162. But the in-memory PositionManager retained its prior state from the local DB (which said is_open=1, contracts=39) and rewrote the DB to match memory rather than reconciling memory to broker. The reconciliation flow is one-way wrong: cache → DB, not broker → cache. Mitigation needed: mandatory reconcile_positions_from_broker() on engine startup; block any new signals until reconciliation completes; broker-as-source-of-truth invariant per GH #46 (in_progress).

Layer 5 - Two parallel state stores. trades table and position_state table both track open positions. They diverge independently. Today’s manual reconcile required updating both; the existing broker_truth_reconcile.py only flags position_state divergence and refuses to auto-fix. Mitigation needed: unify the two stores OR enforce synchronized writes via a single repository pattern.

Layer 6 - Dashboard “Cut” button. The dashboard rendered a Cut button for a position that didn’t exist at the broker. Clicking it would have sent a SELL 39 SPY 727C order - which IBKR would accept as a new short position, not a close. Almost cost a multi-thousand-dollar blowout in 9 minutes before market close. Mitigation needed: dashboard Cut button must verify broker truth before sending the order; refuse to send if broker says position=0.

Layer 7 - Manual SQL was the recovery path. A human had to read audit JSON, hand-craft UPDATE and INSERT statements, and disable a launchd job to stop the engine from undoing the fix. None of this scales for commercialization. Mitigation needed: broker_truth_reconcile.py --apply --confirm-position-divergence mode that’s safe enough to run automatically when broker shows position=0 but DB shows is_open=1 with contracts > 0.

Layer 8 - Three IDs for one trade. Telegram says “Signal #238” and “Position #301.” DB trades.id = 162. Dashboard label = #301. User couldn’t easily verify which trade was being discussed. Triple-naming is a UX problem for any non-developer operator. Mitigation needed: unified canonical “Trade #” referent across Telegram, dashboard, and DB.

Layer 9 - No alert on broker disconnect with open position. Engine raised ConnectionError and exited silently. No Telegram alert. No page. Operator only noticed because the dashboard SPY ticker was frozen and they happened to glance at it. Mitigation needed: auto-page on broker_disconnected AND open_position_count > 0.

What worked

  • Mac auto-rebooted on power restore.
  • launchd KeepAlive for the engine job DID try to restart the engine repeatedly.
  • IBKR’s broker-side TP order fired correctly during the outage while everything client-side was dark. This is the right behavior - broker-side stops/TPs are the safety net.
  • broker_truth_reconcile.py correctly identified the divergence (read-only mode, did not auto-corrupt).
  • The +$1,857.70 broker realized was real money (or paper-real money). The fact that we ended the day with a winner instead of a worse loss was pure broker reliability.

Numbers

  • 15 trades total, 9 wins, 6 losses, 60% win rate. Net -$2,059.
  • Trade #162 contributed +$1,857.70 (broker realized) which the manual reconcile captured.
  • Today’s training data: 15 labeled outcomes, all with meta_prob (n=122 trained model context), all with realized exit prices.

What this changes for tonight’s bundle

The 6-fix bundle (#53 meta gate, #54 UW 404, #55 BEAR conv, #57 EMA decay, #58 meta-prob sizing, #59 UW WS timestamp) addresses signal-side and feature-side issues. None of it addresses today’s outage cascade. That’s GH #46 (broker-truth-first writes, in_progress) plus the 9 layers above, captured as task #91 (long-horizon contingency plan).

Dependencies for live capital

This incident is the cleanest empirical case for why paper trading toward live capital cannot proceed without the contingency stack. Order of dependency:

  1. UPS + clean shutdown (physical)
  2. IB Gateway KeepAlive hardening + healthcheck
  3. Engine retry-with-backoff (no crash on transient broker disconnect)
  4. Mandatory broker-state reconcile on startup (GH #46)
  5. Auto-page on broker-disconnect-with-open-position
  6. Dashboard Cut button validates broker truth (P0 invariant)
  7. Reconciler --apply --confirm-position-divergence safe auto-fix mode
  8. Unified state store (or synchronized writes)
  9. Cellular failover for internet
  10. Cloud-hosted engine (eliminate single-machine dependence)

The first 6 are weekend-to-2-weeks of work. The last 4 are pre-live prerequisites that need real infrastructure. None can be skipped.

See also

  • 2026-05-06-morning-whipsaw-cluster.md (same-session, signal-side failure)
  • Memory feedback_no_kill_with_open_positions.md (reinforced today: don’t kill engine while open position; but ALSO don’t let the engine kill the truth)
  • Memory project_pm_ibkr_exit_invariant.md (PM exit intent → SELL at IBKR → close. Today’s variant: BROKER exit intent → broker fires → ENGINE STILL THINKS POSITION OPEN.)
  • GH #46 Broker-truth-first writes (in_progress)
  • Task #91 Contingency plan (long-horizon)