Lee Harden
All posts
llmgemmagcpcloud-rungpucost

Running Gemma 4 31B on Cloud Run: A Performance and Cost Report

What it actually feels like to use Gemma 4 31B on Cloud Run GPUs after the deploy is done. Above Grok, below Opus, fast enough for real work, and cheap enough that the cost line stops mattering.

8 min readby Lee Harden

The previous post was a trip report on getting Gemma 4 31B running on a Cloud Run RTX PRO 6000. This is the follow-up: what it has actually been like to use the deployment in anger over the weeks since.

The short version is that Gemma 4 on Cloud Run is the most surprising piece of infrastructure I have stood up in some time. The performance is real, the cost line is small enough to disappear into rounding, and the cold-start trade-off is more livable than the published numbers suggest. There are workloads it is not the right answer for, but the set is narrower than I expected going in.

The latency profile in practice

Three numbers describe how the deployment behaves once the wrapper around it is settled.

A basic chat completion on a short prompt lands in roughly two seconds. That includes the model's reasoning step, not just the first token. For an interactive surface — a chat client, a CLI, a notebook — the latency is noticeable but not annoying.

A tool-call emission lands in roughly half a second. Gemma 4's tool-call output is structured cleanly when vLLM is started with --enable-auto-tool-choice --tool-call-parser gemma4, and the model decides to call a tool quickly when the prompt invites it.

A multi-turn follow-up lands in roughly two-tenths of a second once the KV cache is warm. This one is the surprise. For a back-and-forth conversation, Gemma 4 31B feels closer to a small model than to a thirty-one-billion-parameter one. The KV cache is doing its job and the resulting interaction is genuinely snappy.

The unhappy number is the cold start. Twenty minutes of model load on first request after the instance has scaled to zero. That is not a typo. Streaming fifty-eight gigabytes of weights from same-region Cloud Storage onto the GPU is the slowest part of the system by an order of magnitude. Everything that follows in this post is about workloads where that one-time cost is acceptable, and what they look like.

A side-by-side test

The clearest read I have on Gemma 4 31B's quality came from a small, deliberately unscientific bake-off I ran shortly after the deployment was up. Three models, the same prompt — build a small marketing website — and three different substrates underneath:

  • Claude Opus ran through Claude CLI, which is itself a mature agent loop with shell access, file read and write, search, and the rest of the standard tool set built in.
  • Grok 4 was driven through xAI's API, wrapped in a small hand-rolled agent loop — bash, file read and write, glob, and grep — running on a host VM that could compile, serve, and inspect whatever the model produced.
  • Gemma 4 31B was driven through the vLLM endpoint described in the previous post, wrapped in the same hand-rolled loop as the Grok path.

Only the two non-Claude paths needed a custom harness; Claude CLI already is one. That asymmetry is worth naming — it means the comparison is not strictly model-to-model — but it also reflects the real-world choice an operator faces: when you reach for Claude, you reach for Claude with its harness; when you reach for an open-weights model or a raw API, you build the harness yourself.

The results, in the order they landed:

  1. Claude Opus produced the most complete and most visually polished site. Layout was clean, the copy was on-brand, and the project structure on disk was the kind of thing a senior front-end engineer would commit. No surprise.
  2. Gemma 4 31B produced the second-best result. The structure was sensible, the markup was clean, and the output looked like a real product page rather than a placeholder. For an open-weights model serving a thirty-one-billion-parameter checkpoint on commodity GPU hardware, this was the moment that turned the deployment from a curiosity into something I would actually reach for.
  3. Grok 4 produced the weakest result. The model is capable on many tasks, but visual web work — layout, taste, the small judgments that separate a passable landing page from an ugly one — is not where it shines. Functional output, but not the kind of output you would ship.

This is one prompt, one round, one judge. It is not a benchmark. But the gap between Gemma 4 31B and the frontier closed-weights option above it was visibly smaller than the gap between Gemma 4 31B and the closed-weights option below it, and that pattern has held up on the other tasks I have thrown at the deployment since.

Where Gemma 4 31B is most impressive is on focused tasks. Code generation, structured extraction from text, and tool selection are all clean. The model reaches for the right tool, fills in the right arguments, and rarely emits the kind of low-grade hallucination that smaller open-weights models still tend to produce.

Where it lags the frontier is on long-horizon planning and on tasks that require holding many constraints in mind at once. For an agent that needs to plan a multi-step refactor across a large codebase, the frontier closed-weights models still feel a level above. For a task you can describe in a paragraph and verify in a function call, Gemma 4 31B is more than enough.

What it costs

This is the section that makes the architecture worth thinking about.

The GPU instance bills at roughly a tenth of a cent per second while warm. That works out to approximately three to four dollars per active hour at the time of writing — Cloud Run GPU pricing moves, so confirm against the current rate before quoting it. The Cloud Storage bucket holding the fifty-eight gigabytes of weights bills at about a dollar fifty per month. There is no other persistent infrastructure cost.

Because Cloud Run scales to zero, the instance only bills while a request is in flight or the instance is in its idle keep-alive window. For a workload that fires intermittently — an internal assistant, a nightly evaluation run, an offline batch job — the monthly bill is dominated by the few minutes per day the instance is actually up. A tool that handles a few dozen requests a day, with each request finishing in seconds, costs single-digit dollars per month to run a 31B-parameter model behind.

For comparison, the cheapest sustained API alternatives in the same quality band cost more than that for one hour of equivalent throughput. The economics of the scale-to-zero pattern matter to the order of magnitude, not to the second decimal place.

Where this configuration shines

Three workload shapes are particularly well served by this deployment.

Batch processing. The cold-start cost is amortized across the work the instance does once it is warm. Spin up, churn through a thousand documents, classify them, extract from them, summarize them, then let the instance scale back down. The total cost is the warm time, which is short relative to the work done. For overnight evaluation harnesses, periodic content pipelines, and any work that does not care about first-token latency, this is the right tool.

Internal tools and assistants. A tool that a small team uses throughout the working day pays the cold-start once in the morning, then runs warm-cheap for the rest of the day. The latency profile is fine for a chat assistant or a CLI helper. The cost is small enough that no one needs to negotiate over it.

Comparison harnesses. Standing the same prompt set against multiple models becomes feasible when one of the models costs cents per session to run. Because vLLM exposes an OpenAI-compatible endpoint, the deployment slots into the same client that talks to closed-weights APIs — the comparison code is the same code path with a different base URL.

Where it falters

Two cases push back on the configuration.

Cold first-token latency. A consumer-facing API where the user expects first-token in under five seconds from a cold queue cannot tolerate a twenty-minute startup. The fix is to set min-instances above zero, which keeps an instance permanently warm — and the bill goes from "negligible" to "a real line item" the moment it does. The scale-to-zero economics are the whole point of the configuration; abandoning them changes the calculation.

Sustained high concurrency. A single instance, configured to fit a 31B model into the GPU memory, comfortably handles a handful of concurrent requests. For sustained tens of queries per second or higher, a horizontal scaling story on a real Kubernetes cluster, or a fully managed inference endpoint product, is the right tool. The Cloud Run shape is for spiky, intermittent, or batch loads, not for steady high-throughput serving.

For everything else — and everything else is most internal AI workloads — the configuration is a quietly excellent default.

What I would tell someone considering this

Three things are worth knowing before committing to the pattern.

The cold-start length is shocking the first time and routine the second time. Plan for it explicitly. If the workload can tolerate twenty minutes once, then never again until the instance idles back down, the pattern works. If it cannot, do not start.

The quality of an open-weights 31B model in 2026 is genuinely in the conversation with the frontier. It is not better, but it is close enough that the cost differential makes it the right answer for a much wider set of jobs than the frontier-or-nothing framing suggests.

The Cloud Run scale-to-zero economics are the actual product. Without them, this is a normal GPU deployment competing with every other GPU deployment, and the open-weights model is just one option among many. With them, it is something different: a 31B model that costs nothing when no one is using it, and costs almost nothing when someone is.

That is an unusual primitive. It is worth designing workloads around.