Production Hardening for vLLM

Follow-up to: Securing vLLM with Caddy Reverse Proxy

This guide adds hardening to your vLLM deployment. It assumes you’ve completed both the base installation and reverse proxy configuration.

Security Note: This guide implements systemd security directives and preflight checks. Combined with the previous guides (HTTPS, firewall, endpoint restriction), this provides a production-ready deployment. Review all Notes before deploying.

Part I: Systemd Hardening

Stop service:

sudo systemctl stop vllm.service

Create Preflight Check Script

			
sudo tee /opt/vllm/preflight-check.sh >/dev/null <<'EOF'
#!/bin/bash
set -e
log_error() { echo "[ERROR] $1" >&2; }
log_success() { echo "[OK] $1"; }
# Check GPU
if ! nvidia-smi &>/dev/null; then
    log_error "GPU not available"
    exit 1
fi
log_success "GPU available"
# Check CUDA (retry up to 120s)
CUDA_TIMEOUT=120
CUDA_INTERVAL=5
CUDA_ELAPSED=0
until /opt/vllm/.venv/bin/python -c "import torch; assert torch.cuda.is_available()" 2>/dev/null; do
    if [ "$CUDA_ELAPSED" -ge "$CUDA_TIMEOUT" ]; then
        log_error "CUDA not accessible after ${CUDA_TIMEOUT}s"
        exit 1
    fi
    echo "[WAIT] CUDA not ready, retrying in ${CUDA_INTERVAL}s (${CUDA_ELAPSED}/${CUDA_TIMEOUT}s)"
    sleep "$CUDA_INTERVAL"
    CUDA_ELAPSED=$((CUDA_ELAPSED + CUDA_INTERVAL))
done
log_success "CUDA accessible"
# Check port
if ss -tlnp 2>/dev/null | grep -q ':8000 '; then
    log_error "Port 8000 in use"
    exit 1
fi
log_success "Port available"
# Check vLLM
if [ ! -f /opt/vllm/.venv/bin/vllm ]; then
    log_error "vLLM not found"
    exit 1
fi
log_success "vLLM found"
log_success "All checks passed"
EOF

		

Make executable:

sudo chmod +x /opt/vllm/preflight-check.sh

Test:

sudo -u vllm /opt/vllm/preflight-check.sh

Update Service File

Create hardened service file:

			
sudo tee /etc/systemd/system/vllm.service >/dev/null <<'EOF'
[Unit]
Description=vLLM Inference Server
After=network-online.target nvidia-persistenced.service
Wants=network-online.target nvidia-persistenced.service
Documentation=https://docs.vllm.ai
StartLimitIntervalSec=600s
StartLimitBurst=10
[Service]
Type=simple
User=vllm
Group=vllm
WorkingDirectory=/opt/vllm
# Environment
EnvironmentFile=/etc/vllm/vllm.env
# Pre-start checks
ExecStartPre=+/sbin/modprobe nvidia_uvm
ExecStartPre=/bin/sleep 15
ExecStartPre=/opt/vllm/preflight-check.sh
# Start service
ExecStart=/opt/vllm/start-vllm.sh
# Graceful shutdown
TimeoutStopSec=30
# Restart policy
Restart=on-failure
RestartSec=30s
# Resource limits
LimitNOFILE=65535
# Security: Filesystem
ProtectSystem=strict
ReadWritePaths=/opt/vllm /opt/models
ProtectHome=true
PrivateTmp=true
# Security: Privileges
NoNewPrivileges=true
# Security: Kernel
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectKernelLogs=true
ProtectControlGroups=true
# Security: System calls
LockPersonality=true
RestrictSUIDSGID=true
RestrictRealtime=true
RestrictNamespaces=true
# Security: Capabilities
CapabilityBoundingSet=
AmbientCapabilities=
# Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=vllm
[Install]
WantedBy=multi-user.target
EOF

		

Reload and start:

			
sudo systemctl daemon-reload
sudo systemctl start vllm.service
sudo systemctl status vllm.service

Verify Service

sudo journalctl -u vllm.service -n 50

Test API:

			
API_KEY=$(sudo grep VLLM_API_KEY /etc/vllm/vllm.env | cut -d '=' -f 2)
curl http://127.0.0.1:8000/v1/models \
  -H "Authorization: Bearer $API_KEY"

Part II: Rate Limiting

Rate limiting prevents API abuse and resource exhaustion. Options:

Caddy – Requires xcaddy build with rate limit plugin
Nginx – Built-in limit_req module
Fail2ban – Ban IPs after authentication failures
Cloudflare – Edge-level rate limiting
Application – Custom middleware

Implementation varies by platform and is beyond scope of this guide.

Notes

Preflight Checks: Runs before every service start. Add custom checks (disk space, memory) as needed.
Graceful Shutdown: TimeoutStopSec=30 allows in-flight requests to complete.
Restart Policy: Service restarts on failure with 10s delay. After 5 failures in 5 minutes, systemd stops retrying.
Filesystem Protection: ProtectSystem=strict makes filesystem read-only except writable paths (/opt/vllm, /opt/models). Cache stored in /opt/vllm/.cache.
Kernel Protection: ProtectKernelTunables, ProtectKernelModules, ProtectKernelLogs, and ProtectControlGroups prevent kernel modification.
Privilege Restrictions: NoNewPrivileges=true prevents privilege escalation. All capabilities dropped via CapabilityBoundingSet=.
System Call Restrictions: LockPersonality, RestrictSUIDSGID, RestrictRealtime, and RestrictNamespaces enabled.
Address Families: RestrictAddressFamilies cannot be enabled – vLLM requires address families beyond TCP/UDP/Unix for internal IPC.
CUDA Limitations: MemoryDenyWriteExecute and SystemCallFilter omitted – incompatible with CUDA.
Monitoring: View logs with journalctl -u vllm.service -f.
Troubleshooting: Check journalctl -u vllm.service for failures. Common issues: GPU permissions, port conflicts, missing model files.