Qwen3-8B is a capable open-source AI model from Alibaba. It's small enough to run on a few GPUs, smart enough for most business tasks, and free to use. This guide covers how to run it in production on AWS — reliably, securely, and without a runaway cloud bill.

We'll use two key pieces of infrastructure. vLLM is serving software that handles multiple AI requests at once efficiently — think of it as a smart queue manager for your model. EKS is Amazon's managed Kubernetes service — it handles starting, stopping, and scaling the servers that run your model.

Why this combination
Running your own model instead of using OpenAI or Anthropic's API means you own your data, you control costs at scale, and you can customise the model. vLLM handles 2–5× more requests per GPU than a basic setup. EKS gives you automation — servers spin up when traffic grows and shut down when it drops.

How a request gets to your model

Before touching anything, it helps to understand the full journey. A user sends a request. Several things check and route it before it ever reaches the model. Here's the path:

End-to-end request flow A request travels from the client through CloudFront, ALB, and API Gateway before reaching the EKS cluster where vLLM GPU pods process it. Model weights are stored in S3. User HTTPS CloudFront Blocks attacks Load balancer Routes traffic API Gateway Checks API keys YOUR SERVERS (PRIVATE — NOT REACHABLE FROM INTERNET) vLLM router Picks a free GPU GPU server Runs the model GPU server Runs the model S3 storage Model weights Monitoring: CloudWatch · Prometheus · Grafana resp
Fig 1 — A request passes through three checks before reaching your model

Nothing in your private servers is directly reachable from the internet. The model only sees traffic that has already been verified by the API Gateway. The model weights themselves — the actual "brain" file that's around 16GB — live in S3 and get downloaded to each server when it starts up.

What the cluster looks like

EKS runs two types of servers. Regular (CPU) servers handle routing and monitoring. GPU servers run the actual model. You want them separate so a traffic spike on the model doesn't starve your routing layer.

EKS cluster structure Two server groups inside a VPC: CPU servers for routing and monitoring on the left, GPU servers running the model on the right. GPU servers scale from 2 to 8 based on demand. VPC — 10.0.0.0/16 AZ-A AZ-B CPU SERVERS (routing + monitoring) Server 1 API Gateway 8 vCPU / 32 GB Server 2 Dashboards Alerts Server 3 System services Private Subnet — 10.0.1.0/24 Networking Internal DNS Traffic routing Autoscaler Watches queue Adds GPU servers GPU SERVERS — MODEL RUNS HERE (2–8 servers) GPU Server A 4× A10G GPUs (96 GB) vLLM process Uses all 4 GPUs 4 GPU chips connected directly fast network link Server B…N Starts when queue gets too long
Fig 2 — CPU servers handle traffic; GPU servers run the model and scale with demand
Instance type to use
Use g5.12xlarge for GPU servers. That's 4 A10G GPUs with 96 GB of combined GPU memory — enough for Qwen3-8B with room to handle many parallel requests. Expect around $16/hour on-demand, around $5–6/hour on Spot (more on that later).

The five things you can't skip

AWS has a framework called "Well-Architected" — basically a checklist of mistakes people commonly make when running things in the cloud. For an AI model deployment, five areas matter most.

Security — keep the model private
Your model endpoint should never be reachable directly from the internet. API keys should be stored in AWS Secrets Manager, not hardcoded. The servers running your model should only be able to read the model weights from S3 — nothing else.
Reliability — survive a crashed server
The model takes 60–90 seconds to load. If Kubernetes kills a server during an update before a new one is ready, you get downtime. You need to tell Kubernetes to wait for the model to be ready before switching traffic over.
Performance — don't waste GPU memory
vLLM splits the model across all 4 GPUs so each GPU handles part of the work in parallel. You also want to use 92% of GPU memory for caching — leaving too much headroom wastes throughput; using too much causes crashes.
Cost — only pay for what you use
GPU servers are expensive. Use AWS Spot instances (unused capacity sold cheap) for background jobs. Keep at least one on-demand server running for live requests. Set up auto-shutdown overnight if you don't need 24/7 availability.
Operations — know when something breaks
You need dashboards showing: how long users are waiting for the first word of a response, how full the GPU memory cache is, how many requests are queued up. Without these, you're flying blind.

Security: lock it down properly

The most common mistake is putting the model endpoint behind only an API key check — and nothing else. You want multiple layers, so a single failure doesn't expose everything.

Security layers protecting the model Six nested security zones: edge firewall, VPC boundary, private subnet, Kubernetes access controls, pod restrictions, and encrypted model storage. LAYER 1 — EDGE FIREWALL · Blocks known attacks, enforces HTTPS LAYER 2 — NETWORK BOUNDARY · Only whitelisted traffic enters LAYER 3 — PRIVATE SUBNET · Servers have no public IP address LAYER 4 — ACCESS CONTROLS · Servers can only talk to what they need LAYER 5 — SERVER RESTRICTIONS · Runs as non-root, read-only filesystem Model (vLLM) Qwen3-8B Can only read from S3 S3 encrypted weights file Secrets Mgr API keys
Fig 3 — Six layers of security between the internet and your model

Controlling what the model server can access

Each GPU server gets a temporary AWS permission token (called IRSA) that only allows reading the model weights from one specific S3 folder. It can't write to S3, can't access other buckets, can't touch anything else in your AWS account. This is the principle of least privilege — if a server gets compromised, the blast radius is tiny.

# What the GPU server is allowed to do in AWS — nothing more
{
  "Effect": "Allow",
  "Action": ["s3:GetObject"],
  "Resource": "arn:aws:s3:::my-model-bucket/qwen3-8b/*"
}
Common mistake
Don't store the model weights directly on the server's disk as a permanent file. Pull them from S3 into a temporary folder when the server starts. This way, if a server is terminated, nothing sensitive is left on disk.

Reliability: surviving updates and crashes

The model takes about 2 minutes to load. Kubernetes doesn't know this by default — it will assume the server is broken and kill it before it's ready. You fix this by configuring health checks with a long enough startup window.

# Tell Kubernetes: give the server 5 minutes to start before worrying
startupProbe:
  httpGet:
    path: /health
    port: 8000
  failureThreshold: 30
  periodSeconds: 10    # 30 checks × 10s = 5 minutes

# Only send traffic once the model is actually ready
readinessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 90

You also need to tell Kubernetes never to shut down your last running server during an update. If it does, there's a gap where no model is available.

# Never reduce below 1 running server, even during updates
kind: PodDisruptionBudget
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: vllm-qwen3

Performance: getting the most out of your GPUs

vLLM has a smart trick: instead of processing one request at a time, it batches multiple requests together on the GPU. Think of it like a shared taxi vs individual cars — same GPU doing more work per minute. It also splits the model across all 4 GPUs simultaneously, which is why one server can handle a model this large at all.

vLLM memory management and GPU splitting Shows how vLLM manages memory by dividing it into fixed-size pages for different requests, and how the model is split across 4 GPUs with each handling different parts of the computation. REQUEST BATCHING GPU MEMORY PAGES MODEL SPLIT ACROSS 4 GPUs Request A Request B Request C Request D (waiting) Batch A+B+C together Memory divided into fixed-size chunks A A B B C free free A B C free free free free Request A Request B Request C Free Each GPU handles part of the model GPU 0 layers 0–7 GPU 1 layers 8–15 GPU 2 layers 16–23 GPU 3 layers 24–31 All 4 sync after each pass
Fig 4 — vLLM batches requests together and splits the model across all 4 GPUs

The launch command

This is what actually starts vLLM. The key numbers to understand:

python -m vllm.entrypoints.openai.api_server \
  --model /mnt/model/qwen3-8b \
  --tensor-parallel-size 4 \       # use all 4 GPUs
  --gpu-memory-utilization 0.92 \  # use 92% of GPU memory
  --max-model-len 32768 \          # max 32k tokens per request
  --enable-chunked-prefill \       # don't let one big request block others
  --dtype bfloat16 \               # half precision — uses half the memory
  --port 8000
The 92% setting
The --gpu-memory-utilization 0.92 flag tells vLLM how much GPU memory to use for caching conversation history. Set it too low (say, 0.70) and you can handle fewer parallel requests. Set it above 0.95 and you'll get out-of-memory crashes on long prompts. 0.90–0.93 is the sweet spot for most workloads.

Autoscaling: adding servers when it gets busy

GPU servers are expensive and slow to start — they take 3–5 minutes to be ready. You can't wait until things are already broken to scale up. The system watches a queue of waiting requests and starts adding servers proactively.

Autoscaling chain vLLM exposes metrics, Prometheus collects them, KEDA reads the queue depth and adjusts replica count, Kubernetes scheduler creates new pods, Karpenter provisions new GPU nodes. vLLM Reports queue length every 15s Prometheus Stores metrics history reads KEDA Decides: need more servers? yes → add Kubernetes Requests new GPU server spin up Karpenter Starts a new EC2 GPU box 0s +15s +30s +45s +3–5 min ready
Fig 5 — When the queue grows, a new GPU server is ready in about 4 minutes

The rule: when more than 10 requests are waiting per server, add another server. When the queue drops and stays empty, remove the extra servers after 5 minutes.

# Add a server when >10 requests are queued per server
triggers:
- type: prometheus
  metadata:
    threshold: "10"
    query: avg(vllm:num_requests_waiting)
cooldownPeriod: 300   # wait 5 min before removing a server
Cut costs with Spot
AWS sells unused GPU capacity at 60–70% off as "Spot instances." They can be reclaimed by AWS with 2 minutes notice. Use Spot for non-urgent batch jobs (summarisation, embeddings). Keep one standard server always running for live requests. On a g5.12xlarge, this saves roughly $10/hour.

Monitoring: four numbers to watch

You don't need 50 dashboards. Start with these four metrics and set alerts on them:

Time to first word
How long from sending a request to getting the first token back. Alert if this exceeds 2 seconds for 95% of requests. This is what users feel most.
GPU memory fullness
How full is the GPU memory cache. Alert if it stays above 95% — that means you're close to running out of room for new requests and should add a server.
Requests waiting
How many requests are sitting in the queue right now. If this stays above 50, your autoscaler isn't keeping up — either tune the threshold or raise the max server count.
Error rate
Percentage of requests that fail. Anything above 1% needs investigation. Normal causes: OOM on a very long prompt, a server crash mid-response, or a bad deployment.

For structured logs: enable JSON output in vLLM and add a request ID header at the load balancer. This lets you trace a single failing request across all the services it touched, which otherwise takes a very long time to debug.

Before you go to production

Run through this. Most outages and security issues come from skipping items in the middle.

What Why it matters Pillar
Run EKS across 2+ availability zones If one AWS data center goes down, the other picks up traffic Reliability
Put all servers in private subnets No server should have a public IP — traffic comes through the load balancer only Security
Use IRSA for S3 access Temporary permissions only — no hardcoded AWS keys on the server Security
Store API keys in Secrets Manager Secrets in environment variables get leaked in logs Security
Set startup probe to 5 min budget Prevents Kubernetes from killing a server that's still loading the model Reliability
Set minAvailable: 1 on the deployment Prevents all servers being shut down simultaneously during updates Reliability
Set --tensor-parallel-size 4 Uses all 4 GPUs — without this, only 1 GPU runs the model Performance
Set --gpu-memory-utilization 0.92 Balances throughput vs crash risk on long prompts Performance
Set KEDA to scale on queue depth CPU/memory metrics don't reflect GPU model load accurately Cost
Use Spot for batch replicas 60–70% cheaper for non-urgent work Cost
Alert on P95 first-token time The single metric most correlated with user complaints Operations
Write infra as code (Terraform/CDK) Undocumented manual changes cause most incidents Operations

Wrapping up

The hardest part of running an AI model in production isn't the model — it's everything around it. The model itself is a file you download. The tricky bits are: keeping it private, making sure it doesn't go down when a server crashes, and not paying $30k/month for idle GPUs at 3am.

Start with a single GPU server, get the health probes and security config right, then add autoscaling once you have real traffic to tune against. The checklist above is roughly in priority order — the top items are the ones that will bite you first.

What to try next
Once this is stable, look into speculative decoding — a technique where a tiny model predicts the next few tokens and the big model verifies them, cutting average response time by 20–40%. vLLM supports this with a one-line flag. Also worth looking at: LoRA adapters, which let you load small customisations on top of the base model without running a separate deployment for each one.

Tested with vLLM 0.4.x on EKS 1.30, g5.12xlarge instances. Benchmark your own workload before locking in memory utilization settings — prompt length distributions vary a lot by use case.