2026-05-05 TabPFN Spike + XGBoost Baseline (n=125)
The fixed-label dataset (post-Task #49) has 125 resolved scoring_events. XGBoost trained on 100/25 time-series split: AUC 0.430 - actively anti-predictive (worse than random). TabPFN comparison gated on a one-time manual license acceptance from Cody (Task #62). Empirical confirmation of the literature’s verdict: at our sample size + feature dimensionality, XGBoost has no signal. Meta-labeling is the right pivot, not “more data into the same model.”
What we measured
scripts/tabpfn_spike.py - read-only research script, no deployment.
Reads decisions.db resolved ENTERED rows, time-series split (latest 25
holdout), median imputation from training set only, fits XGBoost as
control + TabPFN as candidate.
Results
Sample size: train=100, holdout=25, total=125 Holdout span: 2026-05-01 13:07 → 2026-05-04 12:40 Base win rate (holdout): 0.680 (17 of 25)
XGBoost
| Metric | Value | Reading |
|---|---|---|
| AUC | 0.430 | Anti-predictive; ranks losers higher than winners |
| Brier | 0.253 | Barely above the 0.25 random-baseline floor |
| Bottom-decile win rate | 0.500 | Trades model says are worst still win 50% |
| E[y | p≥0.55] | 0.682 | Equal to base rate; threshold gate adds nothing |
| n above threshold | 22 / 25 | Model classifies almost everything as “win” |
The XGBoost model has not learned anything that generalizes to the most recent 25 trades. The conditional-expectancy result (0.682 above threshold vs 0.680 base rate) means the model’s high-probability predictions are statistically indistinguishable from blind betting.
TabPFN
Failed cleanly: license acceptance + API token required for first-time
weight download. Filed as Task #62 for Cody. Once tokens are in
TABPFN_TOKEN env var, re-running the script produces a real A/B.
What this proves
The Codex adversarial review and the 0DTE ML literature both said: at n<200 with 38 features, no tree ensemble produces stable feature ranking; AUC=0.5 is what you get when there’s no information to split on. We had AUC=0.50 on n=26. Now at n=100, we have AUC=0.430 - not random, anti-predictive. The model has fit noise.
This is the empirical case for:
- Don’t trust the existing XGBoost predictions. The
model_win_probcolumn in scoring_events is worse than a coin flip for the most recent month. - Meta-labeling has a much narrower hypothesis space. Instead of “predict win probability from raw features,” it learns “given the primary’s already-made decision, how confident am I?” - a much easier problem at our sample size.
- 80% via ML alone is a multi-month horizon. This run is the data point that quantifies how far we are: AUC 0.430 means we’re not just below 80% - we’re below blind betting.
Reproducing
.venv/bin/python -m scripts.tabpfn_spike
# Outputs markdown report to ~/cortanaroi-data/audit/tabpfn_spike_<UTC-ISO>.mdFor TabPFN section: see Task #62 setup steps (one-time license accept).
See Also
Timeline
2026-05-05 04:52 CDT | observed - Ran XGBoost on 100/25 time-series split of post-Task-#49 resolved scoring events. AUC 0.430, Brier 0.253. Compared to literature claim that XGBoost needs ~1000+ samples for stable AUC, this is the real-data confirmation that we’re well below useful threshold. Filed task #62 for TabPFN license to complete the A/B.