2026-05-05 TabPFN Spike + XGBoost Baseline (n=125)

The fixed-label dataset (post-Task #49) has 125 resolved scoring_events. XGBoost trained on 100/25 time-series split: AUC 0.430 - actively anti-predictive (worse than random). TabPFN comparison gated on a one-time manual license acceptance from Cody (Task #62). Empirical confirmation of the literature’s verdict: at our sample size + feature dimensionality, XGBoost has no signal. Meta-labeling is the right pivot, not “more data into the same model.”

What we measured

scripts/tabpfn_spike.py - read-only research script, no deployment. Reads decisions.db resolved ENTERED rows, time-series split (latest 25 holdout), median imputation from training set only, fits XGBoost as control + TabPFN as candidate.

Results

Sample size: train=100, holdout=25, total=125 Holdout span: 2026-05-01 13:07 → 2026-05-04 12:40 Base win rate (holdout): 0.680 (17 of 25)

XGBoost

MetricValueReading
AUC0.430Anti-predictive; ranks losers higher than winners
Brier0.253Barely above the 0.25 random-baseline floor
Bottom-decile win rate0.500Trades model says are worst still win 50%
E[y | p≥0.55]0.682Equal to base rate; threshold gate adds nothing
n above threshold22 / 25Model classifies almost everything as “win”

The XGBoost model has not learned anything that generalizes to the most recent 25 trades. The conditional-expectancy result (0.682 above threshold vs 0.680 base rate) means the model’s high-probability predictions are statistically indistinguishable from blind betting.

TabPFN

Failed cleanly: license acceptance + API token required for first-time weight download. Filed as Task #62 for Cody. Once tokens are in TABPFN_TOKEN env var, re-running the script produces a real A/B.

What this proves

The Codex adversarial review and the 0DTE ML literature both said: at n<200 with 38 features, no tree ensemble produces stable feature ranking; AUC=0.5 is what you get when there’s no information to split on. We had AUC=0.50 on n=26. Now at n=100, we have AUC=0.430 - not random, anti-predictive. The model has fit noise.

This is the empirical case for:

  1. Don’t trust the existing XGBoost predictions. The model_win_prob column in scoring_events is worse than a coin flip for the most recent month.
  2. Meta-labeling has a much narrower hypothesis space. Instead of “predict win probability from raw features,” it learns “given the primary’s already-made decision, how confident am I?” - a much easier problem at our sample size.
  3. 80% via ML alone is a multi-month horizon. This run is the data point that quantifies how far we are: AUC 0.430 means we’re not just below 80% - we’re below blind betting.

Reproducing

.venv/bin/python -m scripts.tabpfn_spike
# Outputs markdown report to ~/cortanaroi-data/audit/tabpfn_spike_<UTC-ISO>.md

For TabPFN section: see Task #62 setup steps (one-time license accept).

See Also


Timeline

2026-05-05 04:52 CDT | observed - Ran XGBoost on 100/25 time-series split of post-Task-#49 resolved scoring events. AUC 0.430, Brier 0.253. Compared to literature claim that XGBoost needs ~1000+ samples for stable AUC, this is the real-data confirmation that we’re well below useful threshold. Filed task #62 for TabPFN license to complete the A/B.