GPU time is the most expensive thing in an AI startup's infrastructure. Here's exactly where it gets wasted — and how we got it back.

A fintech startup came to us with a problem. They had two models running in production — a fraud detection model and an LLM powering their customer assistant — plus a fine-tuning pipeline that ran weekly. Their AWS bill had crept past $18,000 a month, and nobody was quite sure why.

The answer, almost entirely, was GPU waste. Not storage. Not EC2 web servers. GPUs — either the wrong size, running at the wrong time, or doing work that a managed service could have done for a fraction of the cost.

Monthly AI infra spend before
$18,400
GPU instances, training jobs, endpoints
Monthly AI infra spend after
$12,000
Same models. Same latency. Same outputs.
Saving per month
$6,400
$76,800 a year back into the product
Time to first saving
48 hrs
Spot instances for training — fastest win

First: understand your AI cost map

AI workloads split into two fundamentally different cost shapes. Training (including fine-tuning) is a burst workload — you need a lot of GPU for a few hours, then nothing. Inference is the opposite — steady, always-on, and billing whether you have traffic or not.

Most cost waste happens because teams treat both the same way. They spin up always-on GPU instances for training jobs, and under-utilise always-on GPU instances for inference. The fix is different for each.

Training / fine-tuning GPU spot instances p3 / p4d — biggest cost SageMaker Managed jobs S3 checkpoints Model state storage Data pipeline ECR, preprocessing jobs Inference / serving GPU on-demand (g5 / inf2) Always-on — always billing Bedrock API Pay per token SageMaker EP Managed endpoints Caching layer Prompt / KV cache vs

The two zones have different cost patterns — and different fixes.


Fix 1: Use spot instances for all training and fine-tuning

The startup was running their weekly fine-tuning jobs on on-demand p3.8xlarge instances. That's about $12.24/hr. The same job on a spot instance costs roughly $3–4/hr — a 70% drop for identical compute.

Spot instances are AWS's way of selling spare GPU capacity cheaply. The catch: AWS can reclaim them with two minutes' notice. For training jobs, that sounds scary. In practice, it's completely manageable if you do one thing: checkpoint frequently.

Checkpointing in plain terms: every 15–30 minutes, your training job saves its current state to S3. If the instance gets interrupted, SageMaker restarts the job from the last checkpoint — not from scratch. You lose at most 30 minutes of compute, not the whole run.

SageMaker managed training jobs handle this natively. You set a checkpoint S3 path, enable spot training with one parameter, and SageMaker does the rest — including automatic retry on interruption. The startup's fine-tuning job went from ~$180/run to ~$55/run. It runs weekly, so that's $500/month saved from one setting change.

Spot is not right for everything though. The rule is simple: if your workload serves real users in real time, spot is too risky. If it can pause and resume, spot is almost always worth it.

GPU workload needed Can it survive interruption? checkpoint + resume support Yes Spot GPU instances Up to 90% cheaper Training Fine-tuning Batch inference Offline scoring No On-demand GPU Predictable, always-on Live inference User-facing APIs Low-latency EPs Sub-100ms SLAs

The decision is simple: interruptible workloads go on spot. Real-time workloads stay on-demand.


Fix 2: Right-size your GPU instances for inference

This was the single biggest saving. The team was serving their fraud detection model — a fine-tuned 7B parameter model — on a g5.12xlarge. That instance has 4 GPUs and 96GB of GPU memory. Their model needed about 14GB. They were paying for four GPUs and using one.

GPU memory is what matters for model serving, not raw compute. A 7B model in 16-bit precision takes roughly 14GB of VRAM. A g5.xlarge has 24GB and costs about $1/hr. A g5.12xlarge costs $5.67/hr. Same model, same latency, 80% cost difference.

The trap: teams pick larger instances "for safety" at launch and never revisit. Six months later the model hasn't grown but the instance size is locked in because nobody wants to touch prod. Schedule a GPU utilisation review every quarter.

The matching logic is straightforward: take your model's parameter count, multiply by 2 (for 16-bit weights), and that's your minimum VRAM requirement. Add 20% headroom for the KV cache during inference. Then pick the smallest instance that clears that number.

Model size Right-sized instance If over-provisioned 7B model ~14GB VRAM needed g5.xlarge (24GB) ~$1.00/hr on-demand p4d.24xlarge $32/hr — 32x waste 13–34B model ~28–70GB VRAM g5.12xlarge or inf2 ~$5–8/hr, good throughput p4d cluster Idle GPU memory billed 70B+ model 140GB+ VRAM needed p4d or Bedrock Skip the GPU entirely Multi-node cluster $100s/hr if unnecessary

Rule of thumb: model parameters × 2 bytes = minimum VRAM. Add 20% for inference overhead. Pick the smallest instance that fits.


Fix 3: Stop hosting models you don't need to host

The team's LLM — the customer assistant — was a large general-purpose model running on a dedicated g5.48xlarge. That instance runs at around $16/hr, 24 hours a day, seven days a week. That's $11,500/month before a single user query.

The question we asked: why are you hosting this yourself? The honest answer was "we set it up early and never questioned it." There was no fine-tuning, no custom weights, no proprietary data in the model. It was a stock model anyone can access via Bedrock.

We moved it to Bedrock. Cost went from $11,500/month (always-on GPU) to roughly $1,800/month (pay-per-token at their actual query volume). The model quality was identical — it was the same underlying model. The only thing that changed was who managed the GPU.

Self-hosted on EC2 GPU Always-on g5.48xlarge ~$16/hr running 24/7 Model serving vLLM / TGI — you run it Scaling logic You build + maintain Billed even at zero traffic Plus eng. time to operate Bedrock / SageMaker Serverless No GPU instance to manage AWS provisions on demand Scales to zero No idle GPU cost Model updates Managed by AWS Pay per token / invocation Cost tied to actual usage

Self-hosting makes sense when you have custom weights or strict data residency requirements. For standard models, managed services are almost always cheaper.

When does self-hosting win? When you have fine-tuned weights you can't share with a third party, when you need sub-50ms latency at very high volume, or when your query volume is so high that per-token pricing exceeds the cost of a dedicated instance. At early-to-mid stage, that breakeven point is usually much higher than teams assume.

The managed service decision at a glance

Scenario Self-host Bedrock / Managed
Stock model, no fine-tuning Wasteful Use this
Fine-tuned model, proprietary weights Makes sense Limited support
Low to moderate query volume Idle GPU billing Pay only for use
Very high query volume (millions/day) May be cheaper Run the maths
Strict data residency / air-gap required Use this May not qualify

The full result

Training on spot (fine-tuning)
$500/mo
On-demand → spot with checkpointing
GPU right-sizing (fraud model)
$3,400/mo
g5.12xlarge → g5.xlarge
LLM moved to Bedrock
$1,900/mo
Always-on GPU → pay per token
SageMaker Serverless for dev EPs
$600/mo
Replaced always-on dev endpoints
Total: $6,400/month. Every dollar came from GPU decisions — instance type, hosting model, and spot eligibility. Nothing else changed.

The playbook, in four questions

  • Is your training job running on on-demand GPU? If so, does it checkpoint? If yes — switch to spot today.
  • What is your model's VRAM requirement? Is your current inference instance within 30% of that? If not, downsize.
  • Are you self-hosting a stock, unmodified model? If there's no fine-tuning, Bedrock is almost certainly cheaper.
  • Do you have non-production endpoints (dev, staging, testing) running 24/7? Replace them with SageMaker Serverless or shut them down outside working hours.