GPU time is the most expensive thing in an AI startup's infrastructure. Here's exactly where it gets wasted — and how we got it back.
A fintech startup came to us with a problem. They had two models running in production — a fraud detection model and an LLM powering their customer assistant — plus a fine-tuning pipeline that ran weekly. Their AWS bill had crept past $18,000 a month, and nobody was quite sure why.
The answer, almost entirely, was GPU waste. Not storage. Not EC2 web servers. GPUs — either the wrong size, running at the wrong time, or doing work that a managed service could have done for a fraction of the cost.
First: understand your AI cost map
AI workloads split into two fundamentally different cost shapes. Training (including fine-tuning) is a burst workload — you need a lot of GPU for a few hours, then nothing. Inference is the opposite — steady, always-on, and billing whether you have traffic or not.
Most cost waste happens because teams treat both the same way. They spin up always-on GPU instances for training jobs, and under-utilise always-on GPU instances for inference. The fix is different for each.
The two zones have different cost patterns — and different fixes.
Fix 1: Use spot instances for all training and fine-tuning
The startup was running their weekly fine-tuning jobs on on-demand p3.8xlarge instances. That's about $12.24/hr. The same job on a spot instance costs roughly $3–4/hr — a 70% drop for identical compute.
Spot instances are AWS's way of selling spare GPU capacity cheaply. The catch: AWS can reclaim them with two minutes' notice. For training jobs, that sounds scary. In practice, it's completely manageable if you do one thing: checkpoint frequently.
SageMaker managed training jobs handle this natively. You set a checkpoint S3 path, enable spot training with one parameter, and SageMaker does the rest — including automatic retry on interruption. The startup's fine-tuning job went from ~$180/run to ~$55/run. It runs weekly, so that's $500/month saved from one setting change.
Spot is not right for everything though. The rule is simple: if your workload serves real users in real time, spot is too risky. If it can pause and resume, spot is almost always worth it.
The decision is simple: interruptible workloads go on spot. Real-time workloads stay on-demand.
Fix 2: Right-size your GPU instances for inference
This was the single biggest saving. The team was serving their fraud detection model — a fine-tuned 7B parameter model — on a g5.12xlarge. That instance has 4 GPUs and 96GB of GPU memory. Their model needed about 14GB. They were paying for four GPUs and using one.
GPU memory is what matters for model serving, not raw compute. A 7B model in 16-bit precision takes roughly 14GB of VRAM. A g5.xlarge has 24GB and costs about $1/hr. A g5.12xlarge costs $5.67/hr. Same model, same latency, 80% cost difference.
The matching logic is straightforward: take your model's parameter count, multiply by 2 (for 16-bit weights), and that's your minimum VRAM requirement. Add 20% headroom for the KV cache during inference. Then pick the smallest instance that clears that number.
Rule of thumb: model parameters × 2 bytes = minimum VRAM. Add 20% for inference overhead. Pick the smallest instance that fits.
Fix 3: Stop hosting models you don't need to host
The team's LLM — the customer assistant — was a large general-purpose model running on a dedicated g5.48xlarge. That instance runs at around $16/hr, 24 hours a day, seven days a week. That's $11,500/month before a single user query.
The question we asked: why are you hosting this yourself? The honest answer was "we set it up early and never questioned it." There was no fine-tuning, no custom weights, no proprietary data in the model. It was a stock model anyone can access via Bedrock.
We moved it to Bedrock. Cost went from $11,500/month (always-on GPU) to roughly $1,800/month (pay-per-token at their actual query volume). The model quality was identical — it was the same underlying model. The only thing that changed was who managed the GPU.
Self-hosting makes sense when you have custom weights or strict data residency requirements. For standard models, managed services are almost always cheaper.
The managed service decision at a glance
| Scenario | Self-host | Bedrock / Managed |
|---|---|---|
| Stock model, no fine-tuning | Wasteful | Use this |
| Fine-tuned model, proprietary weights | Makes sense | Limited support |
| Low to moderate query volume | Idle GPU billing | Pay only for use |
| Very high query volume (millions/day) | May be cheaper | Run the maths |
| Strict data residency / air-gap required | Use this | May not qualify |
The full result
The playbook, in four questions
- Is your training job running on on-demand GPU? If so, does it checkpoint? If yes — switch to spot today.
- What is your model's VRAM requirement? Is your current inference instance within 30% of that? If not, downsize.
- Are you self-hosting a stock, unmodified model? If there's no fine-tuning, Bedrock is almost certainly cheaper.
- Do you have non-production endpoints (dev, staging, testing) running 24/7? Replace them with SageMaker Serverless or shut them down outside working hours.