The AI FinOps Playbook: How We Cut a Startup's AWS AI Bill by 35%

GPU time is the most expensive thing in an AI startup's infrastructure. Here's exactly where it gets wasted — and how we got it back.

A fintech startup came to us with a problem. They had two models running in production — a fraud detection model and an LLM powering their customer assistant — plus a fine-tuning pipeline that ran weekly. Their AWS bill had crept past $18,000 a month, and nobody was quite sure why.

The answer, almost entirely, was GPU waste. Not storage. Not EC2 web servers. GPUs — either the wrong size, running at the wrong time, or doing work that a managed service could have done for a fraction of the cost.

Monthly AI infra spend before

$18,400

GPU instances, training jobs, endpoints

Monthly AI infra spend after

$12,000

Same models. Same latency. Same outputs.

Saving per month

$6,400

$76,800 a year back into the product

Time to first saving

48 hrs

Spot instances for training — fastest win

First: understand your AI cost map

AI workloads split into two fundamentally different cost shapes. Training (including fine-tuning) is a burst workload — you need a lot of GPU for a few hours, then nothing. Inference is the opposite — steady, always-on, and billing whether you have traffic or not.

Most cost waste happens because teams treat both the same way. They spin up always-on GPU instances for training jobs, and under-utilise always-on GPU instances for inference. The fix is different for each.

The two zones have different cost patterns — and different fixes.

Fix 1: Use spot instances for all training and fine-tuning

The startup was running their weekly fine-tuning jobs on on-demand p3.8xlarge instances. That's about $12.24/hr. The same job on a spot instance costs roughly $3–4/hr — a 70% drop for identical compute.

Spot instances are AWS's way of selling spare GPU capacity cheaply. The catch: AWS can reclaim them with two minutes' notice. For training jobs, that sounds scary. In practice, it's completely manageable if you do one thing: checkpoint frequently.

Checkpointing in plain terms: every 15–30 minutes, your training job saves its current state to S3. If the instance gets interrupted, SageMaker restarts the job from the last checkpoint — not from scratch. You lose at most 30 minutes of compute, not the whole run.

SageMaker managed training jobs handle this natively. You set a checkpoint S3 path, enable spot training with one parameter, and SageMaker does the rest — including automatic retry on interruption. The startup's fine-tuning job went from ~$180/run to ~$55/run. It runs weekly, so that's $500/month saved from one setting change.

Spot is not right for everything though. The rule is simple: if your workload serves real users in real time, spot is too risky. If it can pause and resume, spot is almost always worth it.

The decision is simple: interruptible workloads go on spot. Real-time workloads stay on-demand.

Fix 2: Right-size your GPU instances for inference

This was the single biggest saving. The team was serving their fraud detection model — a fine-tuned 7B parameter model — on a g5.12xlarge. That instance has 4 GPUs and 96GB of GPU memory. Their model needed about 14GB. They were paying for four GPUs and using one.

GPU memory is what matters for model serving, not raw compute. A 7B model in 16-bit precision takes roughly 14GB of VRAM. A g5.xlarge has 24GB and costs about $1/hr. A g5.12xlarge costs $5.67/hr. Same model, same latency, 80% cost difference.

The trap: teams pick larger instances "for safety" at launch and never revisit. Six months later the model hasn't grown but the instance size is locked in because nobody wants to touch prod. Schedule a GPU utilisation review every quarter.

The matching logic is straightforward: take your model's parameter count, multiply by 2 (for 16-bit weights), and that's your minimum VRAM requirement. Add 20% headroom for the KV cache during inference. Then pick the smallest instance that clears that number.

Rule of thumb: model parameters × 2 bytes = minimum VRAM. Add 20% for inference overhead. Pick the smallest instance that fits.

Fix 3: Stop hosting models you don't need to host

The team's LLM — the customer assistant — was a large general-purpose model running on a dedicated g5.48xlarge. That instance runs at around $16/hr, 24 hours a day, seven days a week. That's $11,500/month before a single user query.

The question we asked: why are you hosting this yourself? The honest answer was "we set it up early and never questioned it." There was no fine-tuning, no custom weights, no proprietary data in the model. It was a stock model anyone can access via Bedrock.

We moved it to Bedrock. Cost went from $11,500/month (always-on GPU) to roughly $1,800/month (pay-per-token at their actual query volume). The model quality was identical — it was the same underlying model. The only thing that changed was who managed the GPU.

Self-hosting makes sense when you have custom weights or strict data residency requirements. For standard models, managed services are almost always cheaper.

When does self-hosting win? When you have fine-tuned weights you can't share with a third party, when you need sub-50ms latency at very high volume, or when your query volume is so high that per-token pricing exceeds the cost of a dedicated instance. At early-to-mid stage, that breakeven point is usually much higher than teams assume.

The managed service decision at a glance

Scenario	Self-host	Bedrock / Managed
Stock model, no fine-tuning	Wasteful	Use this
Fine-tuned model, proprietary weights	Makes sense	Limited support
Low to moderate query volume	Idle GPU billing	Pay only for use
Very high query volume (millions/day)	May be cheaper	Run the maths
Strict data residency / air-gap required	Use this	May not qualify

The full result

Training on spot (fine-tuning)

$500/mo

On-demand → spot with checkpointing

GPU right-sizing (fraud model)

$3,400/mo

g5.12xlarge → g5.xlarge

LLM moved to Bedrock

$1,900/mo

Always-on GPU → pay per token

SageMaker Serverless for dev EPs

$600/mo

Replaced always-on dev endpoints

Total: $6,400/month. Every dollar came from GPU decisions — instance type, hosting model, and spot eligibility. Nothing else changed.

The playbook, in four questions

Is your training job running on on-demand GPU? If so, does it checkpoint? If yes — switch to spot today.
What is your model's VRAM requirement? Is your current inference instance within 30% of that? If not, downsize.
Are you self-hosting a stock, unmodified model? If there's no fine-tuning, Bedrock is almost certainly cheaper.
Do you have non-production endpoints (dev, staging, testing) running 24/7? Replace them with SageMaker Serverless or shut them down outside working hours.