The idle GPU tax: What it is, why it’s getting worse, and how you can fix it
Learn what the idle GPU tax is, what it costs, and how usage-based billing on dedicated endpoints helps you avoid it altogether.
EngineeringHow to choose the right managed inference architecture: Serverless, dedicated, dedicated serverless, or batch
Use this decision framework to choose the right managed inference mode based on latency requirements, GPU breakeven utilization, and whether your workload needs a dedicated endpoint.
ProductServerless vs. Dedicated Inference: Why We Built Dedicated Serverless
With dedicated serverless you get dedicated hardware on per-token pricing, no idle-hour charges or long-term GPU commitment.
ProductMaking an EAGLE fly: How We Got 2.6x Faster LLM Inference (Without Cheating)
We trained a custom EAGLE-3 speculative decoding head for OLMo-3.1-32B-Think and got 2.6x faster inference.
Engineering