Mocked tests pass while live behavior is broken
The pattern that bit today
2026-05-21 shipped 7+ dashboard/engine fixes. Each was unit-tested
with mocked IB clients. Each test suite was green
(pytest tests/test_dashboard* → 38-51 passed depending on commit).
Every single one was treated as “ship-quality” based on green tests.
Every single one had latent live-fire failures:
e2e37e6 stream IBKR ticks into open positions- mocked subscribe works; liveib.reqMktData(contract, '', False, False)returnsbid=None, ask=None, last=Noneand dashboard falls back tosnapshot_fallbackmode. Yellow badge instead of green.06e2df6 sanitize tick stream NaN payloads- JSON sanitization works (no more bare NaN literals breaking browser parse); tick stream itself still broken (NaN under the hood, sanitized to null).4f4ff83 clear stale position cards- DOM clearing works in isolation; under real broker open/close cycles, WS path could still repaint stale state in some race conditions.a243dfb prevent stale broker_execution trade_ids- repair script worked on historical data; live capture path had separate races Codex’s mocks didn’t exercise.
The user’s testing strategy filled the gap
Without a harness, the user validated each fix by opening a small real paper trade and watching the dashboard. They took 6+ “test” trades during 2026-05-21 just to see what fixes did under real conditions. Each test cost real commissions + bid-ask slippage. Day P&L went from +5,303 → +$9,397 mostly because of test-trade outcomes intermixed with strategy outcomes.
The user explicitly said: “How do I know if the code changes help or hurt otherwise? Do we need some sort of testing harness to validate this?”
YES.
Why mocks didn’t catch the bug
ib_async lets you mock the IB class. Mocks return whatever you
program them to return. Today’s tests programmed the mock reqMktData
to return a Ticker with bid/ask values set. But the real-broker
behavior is:
reqMktDatareturns immediately with aTickerwhose fields arenan(initial state).- Tick callbacks fire asynchronously via
pendingTickersEventorticker.updateEvent. - Bid/ask only populate after the first tick arrives - which for illiquid contracts may take seconds, or never.
- If the contract isn’t qualified first (
conId == 0), the subscription is silently broken - no callbacks ever fire.
Mocks skip step 1-4 entirely and pretend the bid/ask are immediately populated. Tests pass. Production breaks.
The durable answer
scripts/dashboard_integration_test.py (plan: 2026-05-21-codex-handoff-P0-dashboard-live-integration-harness.md):
- Runs against the real paper IB Gateway, not a mock
- 4 phases: health → tick subscription → 1-contract round-trip → cleanup
- Phase 2 alone catches today’s tick-stream bug WITHOUT placing
an order - it just calls
ib.reqMktData()and asserts bid/ask return real values within 5s. - Phase 3 round-trips a 1-contract paper trade ($5-10 cost) to validate end-to-end (open → SSE event → tick stream → cut → cleared).
- Launchd plist fires at 08:05 CT daily, 5 min before engine kick.
Telegram message:
✅ MK2 integration test PASSEDor🚨 FAILED - DO NOT TRUST DASHBOARD.
The principle
Mocked unit tests prove the code does what the test author thought it should. Real-broker integration tests prove the code does what IBKR actually does. The gap between those two is where production bugs live. Build the harness BEFORE you ship the fix.
What this changes going forward
Every dashboard / engine / broker-integration fix from 2026-05-22 forward should:
- Ship with mocked unit tests for fast iteration.
- Run through the live harness before being declared “ship-quality.”
- Harness PASS becomes the trust gate, not Codex’s self-confidence.
This is the only way the project gets to high confidence on a non-trivial live trading stack without paying for validation in real-money paper losses every time something ships.
References
- Plan:
plans/2026-05-21-codex-handoff-P0-dashboard-live-integration-harness.md(commite4faefb) - Day P&L cost of mock-only testing: ~$50-200 in commissions + spread across 6+ test trades. Bigger cost: hours of frustration and lost confidence in the system.