Mocked tests pass while live behavior is broken

The pattern that bit today

2026-05-21 shipped 7+ dashboard/engine fixes. Each was unit-tested with mocked IB clients. Each test suite was green (pytest tests/test_dashboard* → 38-51 passed depending on commit). Every single one was treated as “ship-quality” based on green tests.

Every single one had latent live-fire failures:

  • e2e37e6 stream IBKR ticks into open positions - mocked subscribe works; live ib.reqMktData(contract, '', False, False) returns bid=None, ask=None, last=None and dashboard falls back to snapshot_fallback mode. Yellow badge instead of green.
  • 06e2df6 sanitize tick stream NaN payloads - JSON sanitization works (no more bare NaN literals breaking browser parse); tick stream itself still broken (NaN under the hood, sanitized to null).
  • 4f4ff83 clear stale position cards - DOM clearing works in isolation; under real broker open/close cycles, WS path could still repaint stale state in some race conditions.
  • a243dfb prevent stale broker_execution trade_ids - repair script worked on historical data; live capture path had separate races Codex’s mocks didn’t exercise.

The user’s testing strategy filled the gap

Without a harness, the user validated each fix by opening a small real paper trade and watching the dashboard. They took 6+ “test” trades during 2026-05-21 just to see what fixes did under real conditions. Each test cost real commissions + bid-ask slippage. Day P&L went from +5,303 → +$9,397 mostly because of test-trade outcomes intermixed with strategy outcomes.

The user explicitly said: “How do I know if the code changes help or hurt otherwise? Do we need some sort of testing harness to validate this?”

YES.

Why mocks didn’t catch the bug

ib_async lets you mock the IB class. Mocks return whatever you program them to return. Today’s tests programmed the mock reqMktData to return a Ticker with bid/ask values set. But the real-broker behavior is:

  1. reqMktData returns immediately with a Ticker whose fields are nan (initial state).
  2. Tick callbacks fire asynchronously via pendingTickersEvent or ticker.updateEvent.
  3. Bid/ask only populate after the first tick arrives - which for illiquid contracts may take seconds, or never.
  4. If the contract isn’t qualified first (conId == 0), the subscription is silently broken - no callbacks ever fire.

Mocks skip step 1-4 entirely and pretend the bid/ask are immediately populated. Tests pass. Production breaks.

The durable answer

scripts/dashboard_integration_test.py (plan: 2026-05-21-codex-handoff-P0-dashboard-live-integration-harness.md):

  • Runs against the real paper IB Gateway, not a mock
  • 4 phases: health → tick subscription → 1-contract round-trip → cleanup
  • Phase 2 alone catches today’s tick-stream bug WITHOUT placing an order - it just calls ib.reqMktData() and asserts bid/ask return real values within 5s.
  • Phase 3 round-trips a 1-contract paper trade ($5-10 cost) to validate end-to-end (open → SSE event → tick stream → cut → cleared).
  • Launchd plist fires at 08:05 CT daily, 5 min before engine kick. Telegram message: ✅ MK2 integration test PASSED or 🚨 FAILED - DO NOT TRUST DASHBOARD.

The principle

Mocked unit tests prove the code does what the test author thought it should. Real-broker integration tests prove the code does what IBKR actually does. The gap between those two is where production bugs live. Build the harness BEFORE you ship the fix.

What this changes going forward

Every dashboard / engine / broker-integration fix from 2026-05-22 forward should:

  1. Ship with mocked unit tests for fast iteration.
  2. Run through the live harness before being declared “ship-quality.”
  3. Harness PASS becomes the trust gate, not Codex’s self-confidence.

This is the only way the project gets to high confidence on a non-trivial live trading stack without paying for validation in real-money paper losses every time something ships.

References

  • Plan: plans/2026-05-21-codex-handoff-P0-dashboard-live-integration-harness.md (commit e4faefb)
  • Day P&L cost of mock-only testing: ~$50-200 in commissions + spread across 6+ test trades. Bigger cost: hours of frustration and lost confidence in the system.