Engineering

The idle GPU tax: What it is, why it’s getting worse, and how you can fix it

The "secure capacity at all costs" mindset of the last few years forced AI companies to accept years-long depreciation cycles and fixed GPU-hour contracts as the cost of doing business. But when your inference bill doesn't flex with actual usage, you end up paying a premium for hardware that’s sitting idle.

Underutilized capacity is a big problem. A recent enterprise infrastructure audit estimates average GPU utilization for the tech industry sits at just 5%, meaning 95% is idle waste. Infrastructure that looked like a necessary investment two years ago is now a fixed cost depreciating on a schedule, disconnected from the actual output it produces. 

As inference becomes a top P&L line item, AI leaders are moving from just asking "how do we get more GPUs?" to also asking “how do we maximize the economic output of what we already have?" This blog will help you answer that question. 

What is the idle GPU tax? 

The idle GPU tax, also called the AI compute tax or model parking tax, is the power and cost you pay to keep a model loaded in GPU memory and ready to serve, even when zero tokens are being processed. It's the gap between the GPU-hours you're paying for and the productive output you're getting. 

When you run a dedicated inference endpoint, that idle cost is baked in as the price of keeping the model warm and avoiding cold-start latency. The more your GPU is in use, the less idle tax you pay. At 100% utilization around the clock, you'd avoid it altogether. But that's not the reality for most applications.

Take a meeting notes application where traffic peaks during workdays and flatlines overnight. On a dedicated endpoint, a GPU running at 10% utilization costs the same per hour as one running at 100%. That means your effective cost per token is 10x what you'd pay at full utilization. This is the result of a billing structure that charges for availability, not inference. 

How much does an idle GPU cost? 

The energy cost of keeping a model "parked" and ready to go was recently quantified in research around The Model Parking Tax.

The study ran 18 days of production telemetry across 335,267 samples on 14 H100 GPUs, combined with controlled experiments on three architectures: the H100 (HBM3), A100 (HBM2e), and L40S (GDDR6).

Here’s what it found: 

Cost comes from the CUDA context, not the model weights. When a model is loaded and ready to serve, the GPU initializes a CUDA context. That alone forces the GPU's streaming multiprocessor clocks to jump to maximum boost frequency and stay there, even at 0% utilization. The power jump from that DVFS transition runs 26–50W over bare idle on HBM architectures like the H100 and A100, and 66W on GDDR6 architectures like the L40S.

Evicting weights doesn't help. The intuitive fix is clearing VRAM when traffic is low. Unfortunately that doesn't move the needle. The marginal power effect of VRAM allocation is less than 0.02W per GB, below measurement relevance. The CUDA context accounts for more than 98% of the idle GPU tax, regardless of memory occupancy.

The idle GPU tax is a fixed, binary cost tied to keeping a CUDA context alive, not to memory usage or workload volume. You pay an energy overhead of 26-66W the instant a model is loaded, whether it's processing 1,000 tokens/second or zero, and there's no way to reduce it through memory management. 

The only lever left is architectural, either keep the GPU busier (raise utilization) or stop holding the context continuously when it's not needed. 

How to reduce idle inference costs

There are several techniques for reducing inference waste. Each one helps, but come with their own limitations. 

Disaggregated serving 

Disaggregated serving splits the two distinct phases of inference, prefill and decode, onto dedicated, purpose-optimized hardware pools:

  1. Prefill is bound by compute. The GPU processes the input prompt in a single forward pass. 
  2. Decode bound by memory-bandwidth. The GPU generates output tokens one at a time, autoregressively, with each step requiring a read of the full KV cache.

These two phases have fundamentally different resource requirements, and forcing a single GPU to handle both sequentially means it's operating suboptimally during each. Splitting them onto separate pools reduces cross-phase idle time and improves hardware efficiency. 

The limitation: You still hold the hardware for both pools, and you still hold the CUDA context on both. The idle GPU tax scales with the fleet, not with the tokens you produce.

For a deeper technical breakdown of why prefill and decode have such different resource profiles, Julia Turc's walkthrough of the memory wall in LLM inference is worth watching.

In-Flight Batching and PagedAttention

In-flight batching, as implemented in engines like vLLM and NVIDIA Triton, packs multiple requests dynamically into the same inference pass rather than processing them sequentially. Instead of a GPU sitting idle between individual requests, it handles several concurrently, improving throughput per unit of time. 

PagedAttention improves VRAM efficiency by managing the key-value cache in non-contiguous memory blocks, allowing more model context to fit on a single GPU. Both techniques push utilization up from the floor and are worth deploying on any production inference stack. 

The limitation: When traffic is low and there are no requests to batch, the CUDA context still stays live, meaning the idle GPU tax keeps running. Batching improves the economics of a busy GPU, but doesn’t address the economics of an underutilized one. 

KV-cache management and speculative decoding

KV-cache sharing reduces memory footprint by reusing cached computations across requests with shared prefixes, allowing more models to run on a single GPU context without memory thrashing. 

Speculative decoding uses a smaller draft model to propose multiple tokens ahead, which the main model then verifies in parallel, accelerating output generation without changing model quality. Both are meaningful throughput improvements. 

The limitation: These techniques make an always-on GPU fleet more efficient, but they don't change your billing when traffic dips. You're still paying for GPU-hours whether your hardware is busy or not.

The structural fix: Only pay for the tokens you use

The pattern across all of these approaches is consistent. Each one reduces the idle tax at the margin. None of them eliminate it, because none of them change the way inference is billed. 

When you pay by the GPU-hour, you pay whether or not a token is being processed. Every optimization technique above reduces waste within that structure. But none of them change the unit of exchange. The GPU-hour is still the meter, and it runs continuously.

The only way to structurally eliminate the idle tax is to decouple the billing from hardware occupancy. When you pay per token produced rather than GPU-hour, you bypass the idle GPU tax completely. You don't pay for off-peak readiness and the cost is directly proportional to output.

Get dedicated endpoints on usage-based billing 

The shift AI decision-makers need to make is from securing capacity to maximizing the economic output of their GPU fleet. But the fix isn't just on individual companies. The AI industry itself needs to change how billing is structured. 

Billing should be based on token output, not GPU-hours. When cost tracks directly with what the infrastructure produces, the incentives align around efficiency and productivity rather than renting out space. 

The reason most inference providers can't offer this is their rigid infrastructure. A GPU reserved for one customer has nowhere to go when traffic drops. The idle cost has to land somewhere, so it lands on your bill.

Parasail's fleet is built around agility. Dedicated and shared workloads run on one fluid pool of compute, with traffic routed between them in real time. When a dedicated endpoint goes quiet, that GPU capacity shifts to shared traffic instead of sitting idle. 

That's what keeps our fleet utilization at ~95% and makes per-token billing on dedicated hardware possible. You get the performance of a dedicated endpoint and pay for what you actually produce, not what you reserve.

Curious about our dedicated endpoints on per token pricing? Contact us to learn more.