Lee Harden
All posts
llmkalshitradingpost-mortem

An after-action on an LLM-driven Kalshi trading bot

It had a 59% win rate. It still lost money. Three lessons about edge, prompt-spec leakage, and why a strong reasoning model is not a trading edge.

6 min readby Lee Harden

I built an automated trading bot for short-horizon cryptocurrency prediction markets on Kalshi. Over the first month of live trading the bot won seventeen of its twenty-nine bets — a 59% win rate. Over the same period the running profit-and-loss was negative $51.54 on a $20 bankroll that had been topped up several times. After three iterations of tighter prompts and tuned parameters, I disabled all seven cron jobs in a single change and shut the system down.

This is the after-action.

What the bot did

The Kalshi 15-minute crypto markets ask, every fifteen minutes, whether the price of BTC, ETH, SOL, DOGE, BNB, or XRP will be above a strike at the next quarter hour. Each contract resolves yes or no. The bot's job, run on a cron, was to:

  1. Pull the current price and recent volatility from a market data API.
  2. Read the Kalshi orderbook for the next-expiring market in each series.
  3. Ask Claude to take a position and a bet size.
  4. Place the order via the Kalshi API.
  5. Wait for resolution and book the result.

Seven cron jobs total: one master that fired every fifteen minutes, and six per-series triggers that ran in the final minutes before each close. The whole rig was a couple of hundred lines of Node.

The hypothesis was that a thinking model, given the orderbook, recent price history, and a prompt that explained Kalshi's payout structure, could find an edge in markets that humans were mostly ignoring at off-peak hours. Markets that get heavy attention during the day and almost none in the evening should, in theory, be the easiest to beat.

It turns out: no. Or at least not the way I built it.

The early signal that wasn't

The first week looked plausible. The bot won more than it lost. The win rate, calculated the way a casual reader would calculate it, was 59%. That is well above the 50/50 a coin would deliver, and most retail traders don't sustain it on prediction markets.

A 59% win rate on a Kalshi-style yes/no market is also worth approximately nothing.

The reason is the payout structure. On Kalshi you pay a price between zero and one for a contract that pays exactly one dollar if you're right and zero if you're wrong. If you buy at sixty-five cents and you're right, you make thirty-five cents. If you buy at sixty-five cents and you're wrong, you lose sixty-five cents. Break-even on that trade requires a win rate of price / 1.00 — sixty-five percent. Anything below that and the trade is a money-loser in expected value, regardless of how often it lands on the correct side. A directional bot, in practice, hunts the markets that feel most confident, and those are precisely the markets the orderbook has already priced near the boundary where the math turns against you.

This is the first lesson, and the most expensive: a high win rate on prediction markets is not the same as a positive expected value. The price you paid to enter is the load-bearing variable, and "we won more than we lost" is a vanity metric that hides the question of whether you should have bet at all.

What broke structurally

Beyond the win-rate-versus-EV problem, the bot kept doing things it had been instructed not to.

Every few sessions it would place a range bet — a contract on whether the price would land inside a band — when the prompt had explicitly asked for a directional bet on the next-quarter close. Same model, same prompt, occasional category drift mid-reasoning. I added an explicit range_only=false flag in the prompt, sharpened the system message, and wrote a post-decision validator that rejected any order whose market type did not match what the prompt had asked for. The validator caught some violations. It did not catch all of them.

A separate class of failure: the bot would occasionally trigger a "stop loss reached" outcome and halt on a position that had not moved enough to justify it. The bankroll for that stretch of testing was small — twenty dollars — and no individual trade should have produced position-level risk events. The cause was prompt-spec leakage. A comment in the system prompt about stop-loss as a general principle of trading was being interpreted by the model as a live operational gate, and the model would sometimes invoke it on its own initiative.

Both failures had the same shape. The model was, loosely, doing what it understood the instructions to ask for. The instructions were ambiguous enough that it could justify behaviour the author had not authorised. Tightening the instructions reduced the rate. It did not eliminate it.

Why iteration didn't help

I went through three rounds of fixes. Each iteration tightened the prompt, added validators, and adjusted the bet-sizing curve. The win rate moved a couple of points up and down. The profit-and-loss stayed negative.

The fundamental issue was not a tuning issue. The data the model saw at decision time — orderbook, recent price history, volatility — was the same data Kalshi market makers and other algorithmic traders had been watching for longer, and they had already priced it in. A reasoning model can do well on a prompt nobody has thought about. A reasoning model does not have an information edge over an orderbook that is being continuously updated by people whose job is to price these contracts.

The third lesson, then: a strong reasoning model is not automatically a strong trader. Reasoning is about working through a question carefully. Trading edge is about knowing something, or modelling something, that your counterparty does not. A general-purpose model with public-internet context has neither. Adding "think harder" to the prompt does not produce edge from no edge.

The decommission

On 2026-04-11 I disabled all seven cron jobs in a single change and let the open positions resolve out. Total realised profit-and-loss over the experiment: negative $51.54. Total time invested, including the three iterations: somewhere around forty hours.

The infrastructure is still on disk. The bot is not coming back in its current form. If a successor emerges, it will be a human-in-the-loop ranker that proposes candidate trades, posts them to Slack with the model's reasoning, and lets a person place the bet. The autonomous loop is the part that did not work, and that is the part that came out.

What I would tell past-me

Three things, in roughly the order of how expensive they were to learn:

  1. Win rate without payout context is a vanity metric. Expected value requires the entry price, not just the directional accuracy. A 59% win rate at sixty-five-cent average entry is a money-losing strategy by construction.
  2. Prompt-spec leakage is not a tuning problem. If a model occasionally violates explicit instructions, more explicit instructions reduce frequency but rarely close the failure mode. For loops that move real money, validators outside the model are mandatory, and the right answer is often to not let the model close the loop at all.
  3. A reasoning model is not, on its own, a trading edge. Public reasoning on public information cannot, on average, beat a market that already prices public information. An LLM-driven trader needs a private input — proprietary data, faster access, a context the market cannot process — to outperform. None of those described what I built.

The cron jobs are gone. The lesson, expensively, is not.