Running Qwen3-8B on vLLM in EKS

Qwen3-8B is a capable open-source AI model from Alibaba. It's small enough to run on a few GPUs, smart enough for most business tasks, and free to use. This guide covers how to run it in production on AWS — reliably, securely, and without a runaway cloud bill.

We'll use two key pieces of infrastructure. vLLM is serving software that handles multiple AI requests at once efficiently — think of it as a smart queue manager for your model. EKS is Amazon's managed Kubernetes service — it handles starting, stopping, and scaling the servers that run your model.

Why this combination

Running your own model instead of using OpenAI or Anthropic's API means you own your data, you control costs at scale, and you can customise the model. vLLM handles 2–5× more requests per GPU than a basic setup. EKS gives you automation — servers spin up when traffic grows and shut down when it drops.

How a request gets to your model

Before touching anything, it helps to understand the full journey. A user sends a request. Several things check and route it before it ever reaches the model. Here's the path:

Fig 1 — A request passes through three checks before reaching your model

Nothing in your private servers is directly reachable from the internet. The model only sees traffic that has already been verified by the API Gateway. The model weights themselves — the actual "brain" file that's around 16GB — live in S3 and get downloaded to each server when it starts up.

What the cluster looks like

EKS runs two types of servers. Regular (CPU) servers handle routing and monitoring. GPU servers run the actual model. You want them separate so a traffic spike on the model doesn't starve your routing layer.

Fig 2 — CPU servers handle traffic; GPU servers run the model and scale with demand

Instance type to use

Use g5.12xlarge for GPU servers. That's 4 A10G GPUs with 96 GB of combined GPU memory — enough for Qwen3-8B with room to handle many parallel requests. Expect around $16/hour on-demand, around $5–6/hour on Spot (more on that later).

The five things you can't skip

AWS has a framework called "Well-Architected" — basically a checklist of mistakes people commonly make when running things in the cloud. For an AI model deployment, five areas matter most.

Security — keep the model private

Your model endpoint should never be reachable directly from the internet. API keys should be stored in AWS Secrets Manager, not hardcoded. The servers running your model should only be able to read the model weights from S3 — nothing else.

Reliability — survive a crashed server

The model takes 60–90 seconds to load. If Kubernetes kills a server during an update before a new one is ready, you get downtime. You need to tell Kubernetes to wait for the model to be ready before switching traffic over.

Performance — don't waste GPU memory

vLLM splits the model across all 4 GPUs so each GPU handles part of the work in parallel. You also want to use 92% of GPU memory for caching — leaving too much headroom wastes throughput; using too much causes crashes.

Cost — only pay for what you use

GPU servers are expensive. Use AWS Spot instances (unused capacity sold cheap) for background jobs. Keep at least one on-demand server running for live requests. Set up auto-shutdown overnight if you don't need 24/7 availability.

Operations — know when something breaks

You need dashboards showing: how long users are waiting for the first word of a response, how full the GPU memory cache is, how many requests are queued up. Without these, you're flying blind.

Security: lock it down properly

The most common mistake is putting the model endpoint behind only an API key check — and nothing else. You want multiple layers, so a single failure doesn't expose everything.

Fig 3 — Six layers of security between the internet and your model

Controlling what the model server can access

Each GPU server gets a temporary AWS permission token (called IRSA) that only allows reading the model weights from one specific S3 folder. It can't write to S3, can't access other buckets, can't touch anything else in your AWS account. This is the principle of least privilege — if a server gets compromised, the blast radius is tiny.

# What the GPU server is allowed to do in AWS — nothing more
{
  "Effect": "Allow",
  "Action": ["s3:GetObject"],
  "Resource": "arn:aws:s3:::my-model-bucket/qwen3-8b/*"
}

Common mistake

Don't store the model weights directly on the server's disk as a permanent file. Pull them from S3 into a temporary folder when the server starts. This way, if a server is terminated, nothing sensitive is left on disk.

Reliability: surviving updates and crashes

The model takes about 2 minutes to load. Kubernetes doesn't know this by default — it will assume the server is broken and kill it before it's ready. You fix this by configuring health checks with a long enough startup window.

# Tell Kubernetes: give the server 5 minutes to start before worrying
startupProbe:
  httpGet:
    path: /health
    port: 8000
  failureThreshold: 30
  periodSeconds: 10    # 30 checks × 10s = 5 minutes

# Only send traffic once the model is actually ready
readinessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 90

You also need to tell Kubernetes never to shut down your last running server during an update. If it does, there's a gap where no model is available.

# Never reduce below 1 running server, even during updates
kind: PodDisruptionBudget
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: vllm-qwen3

Performance: getting the most out of your GPUs

vLLM has a smart trick: instead of processing one request at a time, it batches multiple requests together on the GPU. Think of it like a shared taxi vs individual cars — same GPU doing more work per minute. It also splits the model across all 4 GPUs simultaneously, which is why one server can handle a model this large at all.

Fig 4 — vLLM batches requests together and splits the model across all 4 GPUs

The launch command

This is what actually starts vLLM. The key numbers to understand:

python -m vllm.entrypoints.openai.api_server \
  --model /mnt/model/qwen3-8b \
  --tensor-parallel-size 4 \       # use all 4 GPUs
  --gpu-memory-utilization 0.92 \  # use 92% of GPU memory
  --max-model-len 32768 \          # max 32k tokens per request
  --enable-chunked-prefill \       # don't let one big request block others
  --dtype bfloat16 \               # half precision — uses half the memory
  --port 8000

The 92% setting

The --gpu-memory-utilization 0.92 flag tells vLLM how much GPU memory to use for caching conversation history. Set it too low (say, 0.70) and you can handle fewer parallel requests. Set it above 0.95 and you'll get out-of-memory crashes on long prompts. 0.90–0.93 is the sweet spot for most workloads.

Autoscaling: adding servers when it gets busy

GPU servers are expensive and slow to start — they take 3–5 minutes to be ready. You can't wait until things are already broken to scale up. The system watches a queue of waiting requests and starts adding servers proactively.

Fig 5 — When the queue grows, a new GPU server is ready in about 4 minutes

The rule: when more than 10 requests are waiting per server, add another server. When the queue drops and stays empty, remove the extra servers after 5 minutes.

# Add a server when >10 requests are queued per server
triggers:
- type: prometheus
  metadata:
    threshold: "10"
    query: avg(vllm:num_requests_waiting)
cooldownPeriod: 300   # wait 5 min before removing a server

Cut costs with Spot

AWS sells unused GPU capacity at 60–70% off as "Spot instances." They can be reclaimed by AWS with 2 minutes notice. Use Spot for non-urgent batch jobs (summarisation, embeddings). Keep one standard server always running for live requests. On a g5.12xlarge, this saves roughly $10/hour.

Monitoring: four numbers to watch

You don't need 50 dashboards. Start with these four metrics and set alerts on them:

Time to first word

How long from sending a request to getting the first token back. Alert if this exceeds 2 seconds for 95% of requests. This is what users feel most.

GPU memory fullness

How full is the GPU memory cache. Alert if it stays above 95% — that means you're close to running out of room for new requests and should add a server.

Requests waiting

How many requests are sitting in the queue right now. If this stays above 50, your autoscaler isn't keeping up — either tune the threshold or raise the max server count.

Error rate

Percentage of requests that fail. Anything above 1% needs investigation. Normal causes: OOM on a very long prompt, a server crash mid-response, or a bad deployment.

For structured logs: enable JSON output in vLLM and add a request ID header at the load balancer. This lets you trace a single failing request across all the services it touched, which otherwise takes a very long time to debug.

Before you go to production

Run through this. Most outages and security issues come from skipping items in the middle.

What	Why it matters	Pillar
Run EKS across 2+ availability zones	If one AWS data center goes down, the other picks up traffic	Reliability
Put all servers in private subnets	No server should have a public IP — traffic comes through the load balancer only	Security
Use IRSA for S3 access	Temporary permissions only — no hardcoded AWS keys on the server	Security
Store API keys in Secrets Manager	Secrets in environment variables get leaked in logs	Security
Set startup probe to 5 min budget	Prevents Kubernetes from killing a server that's still loading the model	Reliability
Set minAvailable: 1 on the deployment	Prevents all servers being shut down simultaneously during updates	Reliability
Set `--tensor-parallel-size 4`	Uses all 4 GPUs — without this, only 1 GPU runs the model	Performance
Set `--gpu-memory-utilization 0.92`	Balances throughput vs crash risk on long prompts	Performance
Set KEDA to scale on queue depth	CPU/memory metrics don't reflect GPU model load accurately	Cost
Use Spot for batch replicas	60–70% cheaper for non-urgent work	Cost
Alert on P95 first-token time	The single metric most correlated with user complaints	Operations
Write infra as code (Terraform/CDK)	Undocumented manual changes cause most incidents	Operations

Wrapping up

The hardest part of running an AI model in production isn't the model — it's everything around it. The model itself is a file you download. The tricky bits are: keeping it private, making sure it doesn't go down when a server crashes, and not paying $30k/month for idle GPUs at 3am.

Start with a single GPU server, get the health probes and security config right, then add autoscaling once you have real traffic to tune against. The checklist above is roughly in priority order — the top items are the ones that will bite you first.

What to try next

Once this is stable, look into speculative decoding — a technique where a tiny model predicts the next few tokens and the big model verifies them, cutting average response time by 20–40%. vLLM supports this with a one-line flag. Also worth looking at: LoRA adapters, which let you load small customisations on top of the base model without running a separate deployment for each one.

Tested with vLLM 0.4.x on EKS 1.30, g5.12xlarge instances. Benchmark your own workload before locking in memory utilization settings — prompt length distributions vary a lot by use case.