Qwen3-8B is a capable open-source AI model from Alibaba. It's small enough to run on a few GPUs, smart enough for most business tasks, and free to use. This guide covers how to run it in production on AWS — reliably, securely, and without a runaway cloud bill.
We'll use two key pieces of infrastructure. vLLM is serving software that handles multiple AI requests at once efficiently — think of it as a smart queue manager for your model. EKS is Amazon's managed Kubernetes service — it handles starting, stopping, and scaling the servers that run your model.
How a request gets to your model
Before touching anything, it helps to understand the full journey. A user sends a request. Several things check and route it before it ever reaches the model. Here's the path:
Nothing in your private servers is directly reachable from the internet. The model only sees traffic that has already been verified by the API Gateway. The model weights themselves — the actual "brain" file that's around 16GB — live in S3 and get downloaded to each server when it starts up.
What the cluster looks like
EKS runs two types of servers. Regular (CPU) servers handle routing and monitoring. GPU servers run the actual model. You want them separate so a traffic spike on the model doesn't starve your routing layer.
g5.12xlarge for GPU servers. That's 4 A10G GPUs with 96 GB of combined GPU memory — enough for Qwen3-8B with room to handle many parallel requests. Expect around $16/hour on-demand, around $5–6/hour on Spot (more on that later).
The five things you can't skip
AWS has a framework called "Well-Architected" — basically a checklist of mistakes people commonly make when running things in the cloud. For an AI model deployment, five areas matter most.
Security: lock it down properly
The most common mistake is putting the model endpoint behind only an API key check — and nothing else. You want multiple layers, so a single failure doesn't expose everything.
Controlling what the model server can access
Each GPU server gets a temporary AWS permission token (called IRSA) that only allows reading the model weights from one specific S3 folder. It can't write to S3, can't access other buckets, can't touch anything else in your AWS account. This is the principle of least privilege — if a server gets compromised, the blast radius is tiny.
# What the GPU server is allowed to do in AWS — nothing more { "Effect": "Allow", "Action": ["s3:GetObject"], "Resource": "arn:aws:s3:::my-model-bucket/qwen3-8b/*" }
Reliability: surviving updates and crashes
The model takes about 2 minutes to load. Kubernetes doesn't know this by default — it will assume the server is broken and kill it before it's ready. You fix this by configuring health checks with a long enough startup window.
# Tell Kubernetes: give the server 5 minutes to start before worrying startupProbe: httpGet: path: /health port: 8000 failureThreshold: 30 periodSeconds: 10 # 30 checks × 10s = 5 minutes # Only send traffic once the model is actually ready readinessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 90
You also need to tell Kubernetes never to shut down your last running server during an update. If it does, there's a gap where no model is available.
# Never reduce below 1 running server, even during updates kind: PodDisruptionBudget spec: minAvailable: 1 selector: matchLabels: app: vllm-qwen3
Performance: getting the most out of your GPUs
vLLM has a smart trick: instead of processing one request at a time, it batches multiple requests together on the GPU. Think of it like a shared taxi vs individual cars — same GPU doing more work per minute. It also splits the model across all 4 GPUs simultaneously, which is why one server can handle a model this large at all.
The launch command
This is what actually starts vLLM. The key numbers to understand:
python -m vllm.entrypoints.openai.api_server \ --model /mnt/model/qwen3-8b \ --tensor-parallel-size 4 \ # use all 4 GPUs --gpu-memory-utilization 0.92 \ # use 92% of GPU memory --max-model-len 32768 \ # max 32k tokens per request --enable-chunked-prefill \ # don't let one big request block others --dtype bfloat16 \ # half precision — uses half the memory --port 8000
--gpu-memory-utilization 0.92 flag tells vLLM how much GPU memory to use for caching conversation history. Set it too low (say, 0.70) and you can handle fewer parallel requests. Set it above 0.95 and you'll get out-of-memory crashes on long prompts. 0.90–0.93 is the sweet spot for most workloads.
Autoscaling: adding servers when it gets busy
GPU servers are expensive and slow to start — they take 3–5 minutes to be ready. You can't wait until things are already broken to scale up. The system watches a queue of waiting requests and starts adding servers proactively.
The rule: when more than 10 requests are waiting per server, add another server. When the queue drops and stays empty, remove the extra servers after 5 minutes.
# Add a server when >10 requests are queued per server triggers: - type: prometheus metadata: threshold: "10" query: avg(vllm:num_requests_waiting) cooldownPeriod: 300 # wait 5 min before removing a server
g5.12xlarge, this saves roughly $10/hour.
Monitoring: four numbers to watch
You don't need 50 dashboards. Start with these four metrics and set alerts on them:
For structured logs: enable JSON output in vLLM and add a request ID header at the load balancer. This lets you trace a single failing request across all the services it touched, which otherwise takes a very long time to debug.
Before you go to production
Run through this. Most outages and security issues come from skipping items in the middle.
| What | Why it matters | Pillar |
|---|---|---|
| Run EKS across 2+ availability zones | If one AWS data center goes down, the other picks up traffic | Reliability |
| Put all servers in private subnets | No server should have a public IP — traffic comes through the load balancer only | Security |
| Use IRSA for S3 access | Temporary permissions only — no hardcoded AWS keys on the server | Security |
| Store API keys in Secrets Manager | Secrets in environment variables get leaked in logs | Security |
| Set startup probe to 5 min budget | Prevents Kubernetes from killing a server that's still loading the model | Reliability |
| Set minAvailable: 1 on the deployment | Prevents all servers being shut down simultaneously during updates | Reliability |
Set --tensor-parallel-size 4 | Uses all 4 GPUs — without this, only 1 GPU runs the model | Performance |
Set --gpu-memory-utilization 0.92 | Balances throughput vs crash risk on long prompts | Performance |
| Set KEDA to scale on queue depth | CPU/memory metrics don't reflect GPU model load accurately | Cost |
| Use Spot for batch replicas | 60–70% cheaper for non-urgent work | Cost |
| Alert on P95 first-token time | The single metric most correlated with user complaints | Operations |
| Write infra as code (Terraform/CDK) | Undocumented manual changes cause most incidents | Operations |
Wrapping up
The hardest part of running an AI model in production isn't the model — it's everything around it. The model itself is a file you download. The tricky bits are: keeping it private, making sure it doesn't go down when a server crashes, and not paying $30k/month for idle GPUs at 3am.
Start with a single GPU server, get the health probes and security config right, then add autoscaling once you have real traffic to tune against. The checklist above is roughly in priority order — the top items are the ones that will bite you first.
Tested with vLLM 0.4.x on EKS 1.30, g5.12xlarge instances. Benchmark your own workload before locking in memory utilization settings — prompt length distributions vary a lot by use case.