Can AI Beat the Sports Betting Market? 8 of the Top Models Tried

1 month ago 14

In brief

Frontier AI models blew up betting connected real-world shot markets.
They knew the close strategy—but failed to execute it.
A elemental 1990s exemplary was capable to champion astir of them.

General Reasoning conscionable gave frontier AI its worst study paper yet. Eight apical models, including Claude, Grok, Gemini, and GPT-5.4, were each fixed a virtual bankroll and asked to physique a instrumentality learning betting strategy crossed a afloat 2023-24 English Premier League season.

Every azygous 1 mislaid money. Several went wholly bankrupt.

The benchmark is called KellyBench, named aft the Kelly criterion, a 1956 look that tells you precisely however overmuch to stake erstwhile you person an borderline implicit the market. Every exemplary could recite the Kelly formula. None of them could really usage it.

xAI's Grok 4.20 failed each 3 runs, going afloat bankrupt successful one, forfeiting mid-season successful the different two. Google's Gemini Flash forfeited 2 of 3 runs aft placing a azygous wager of astir £273,000 connected a three-percentage-point humanities win-rate edge—and losing it. Claude Opus 4.6, Anthropic's champion model, mislaid 11% connected mean and someway came retired looking similar the liable big successful the room.

In fact, the probe insubstantial mentions that the aged Dixon-Coles from the precocious 1990s outperformed astir of the frontier models evaluated — finishing up of six retired of eight, adjacent with constricted data.

“Dixon-Coles is an outdated 2000s baseline which doesn’t utilise each disposable information oregon relationship for non-stationarity successful a principled way,” the researchers note. “It is truthful adjacent much astonishing that galore frontier models, specified arsenic Gemini 3.1 Pro, are incapable to bushed oregon lucifer it connected KellyBench.

This matters beyond football. Earlier this year, AI benchmarks showed that Claude could predominate concern simulations done price-fixing, cartel agreements, and strategical deception.

That decision-making process progressive static competition, constricted opponents, wide scoring, and truthful on. KellyBench is the opposite: 120 matchdays, perpetually shifting data, a marketplace that gets smarter each week, and promoted teams with zero humanities records.

The researchers telephone the halfway occupation a "knowledge-action gap." It is precisely what it sounds like.

Business decisions are mostly based connected fixed conditions portion sports betting is simply a much fluid and mutable market, which makes things hard for these models. “KellyBench requires agents to support coherent intent crossed perchance thousands of sequential decisions, show the consequences of those decisions, and adjacent the loop betwixt reflection and action,” researchers argue.

We’re not determination yet, obviously.

The models could articulate the close strategy, diagnose erstwhile thing was broken, and place the origin of their losses, but past failed to verify their codification really implemented what they planned, failed to announcement erstwhile execution diverged from intent, and failed to enactment connected their ain findings.

GLM-5 wrote 3 abstracted self-critique documents during its run. Each 1 correctly identified that its hardcoded 25% gully complaint and overestimation of location vantage were destroying its returns. At 1 point, with its bankroll astir £44,200, it noted that its predicted 40% location triumph complaint was lone hitting 30% successful reality. It ne'er changed the code. It kept betting the aforesaid mode until the wealth was gone.

Kimi K2.5 did thing arguably much awesome and much tragic. It wrote a mathematically close fractional Kelly staking function—the close formula, decently structured. Then it ne'er called it. A formatting bug caused the exemplary to nonstop a breached bash bid astir 50 times successful a row. Its reasoning noted the problem. It past sent the identical breached bid again. An accidental £114,000 bet—98% of its remaining bankroll—on a Burnley versus Luton lucifer finished the job.

GPT-5.4 was the astir methodical. It spent 160 instrumentality calls gathering models earlier placing a azygous bet, past calculated that its log-loss (0.974) was hardly worse than the market's (0.971) and concluded it had nary edge. It spent the remainder of the play placing penny bets to sphere capital. Sound reasoning.

OpenAI’s exemplary mislaid 13.6% connected average. One effect unsocial outgo astir $2,012 to run.

Ross Taylor, General Reasoning's CEO and erstwhile Meta AI researcher, told the Financial Times that astir AI benchmarks run successful "very static environments" that carnivore small resemblance to the existent world. "There's a batch of excitement astir AI automation, but determination haven't been galore attempts to measure AI successful long-term, real-world environments," helium said.

The General Reasoning squad didn’t instantly respond to a petition for comments by Decrypt.

To measurement strategy prime beyond earthy returns, the researchers built a 44-point sophistication rubric with quantitative betting money experts—covering diagnostic development, involvement sizing, non-stationarity handling, and execution. Claude Opus 4.6 scored highest astatine 32.6%. Less than a 3rd of disposable points. On the champion model.

Higher sophistication scores importantly predicted little bankruptcy rates (p = 0.008) and correlated with amended wide returns. The models are not failing due to the fact that the marketplace is unbeatable. They are failing due to the fact that they are not utilizing what they have.

This fits a pattern. Research published past year recovered AI models make thing resembling gambling addiction erstwhile told to maximize rewards—going bankrupt up to 48% of the clip successful simulated slot instrumentality tests. A abstracted real-money crypto trading competition recovered the aforesaid reliability problems implicit extended periods.

The best-performing exemplary averaged a last bankroll of £89,035—a nett nonaccomplishment of £10,965 connected a normalized £100,000 starting stake. Gradient boosting, fractional Kelly staking, months of Premier League football, authorities of the creation performance… each conscionable to get rekt.