Production Hardening for vLLM

Follow-up to: Securing vLLM with Caddy Reverse Proxy

This guide adds hardening to your vLLM deployment. It assumes you’ve completed both the base installation and reverse proxy configuration.

Security Note: This guide implements systemd security directives and preflight checks. Combined with the previous guides (HTTPS, firewall, endpoint restriction), this provides a production-ready deployment. Review all Notes before deploying.


Part I: Systemd Hardening

Stop service:

sudo systemctl stop vllm.service

Create Preflight Check Script

sudo tee /opt/vllm/preflight-check.sh >/dev/null <<'EOF'
#!/bin/bash
set -e
log_error() { echo "[ERROR] $1" >&2; }
log_success() { echo "[OK] $1"; }
# Check GPU
if ! nvidia-smi &>/dev/null; then
log_error "GPU not available"
exit 1
fi
log_success "GPU available"
# Check CUDA
if ! /opt/vllm/.venv/bin/python -c "import torch; assert torch.cuda.is_available()" 2>/dev/null; then
log_error "CUDA not accessible"
exit 1
fi
log_success "CUDA accessible"
# Check port
if ss -tlnp 2>/dev/null | grep -q ':8000 '; then
log_error "Port 8000 in use"
exit 1
fi
log_success "Port available"
# Check vLLM
if [ ! -f /opt/vllm/.venv/bin/vllm ]; then
log_error "vLLM not found"
exit 1
fi
log_success "vLLM found"
log_success "All checks passed"
EOF

Make executable:

sudo chmod +x /opt/vllm/preflight-check.sh

Test:

sudo -u vllm /opt/vllm/preflight-check.sh

Update Service File

Create hardened service file:

sudo tee /etc/systemd/system/vllm.service >/dev/null <<'EOF'
[Unit]
Description=vLLM Inference Server
After=network-online.target nvidia-persistenced.service
Wants=network-online.target
Documentation=https://docs.vllm.ai
[Service]
Type=simple
User=vllm
Group=vllm
WorkingDirectory=/opt/vllm
# Environment
EnvironmentFile=/etc/vllm/vllm.env
# Pre-start checks
ExecStartPre=/bin/sleep 5
ExecStartPre=/opt/vllm/preflight-check.sh
# Start service
ExecStart=/opt/vllm/start-vllm.sh
# Graceful shutdown
TimeoutStopSec=30
# Restart policy
Restart=on-failure
RestartSec=10s
StartLimitInterval=300s
StartLimitBurst=5
# Resource limits
LimitNOFILE=65535
# Security: Filesystem
ProtectSystem=strict
ReadWritePaths=/opt/vllm /opt/models
ProtectHome=true
PrivateTmp=true
# Security: Privileges
NoNewPrivileges=true
# Security: Kernel
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectKernelLogs=true
ProtectControlGroups=true
# Security: System calls
LockPersonality=true
RestrictSUIDSGID=true
RestrictRealtime=true
RestrictNamespaces=true
# Security: Capabilities
CapabilityBoundingSet=
AmbientCapabilities=
# Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=vllm
[Install]
WantedBy=multi-user.target
EOF

Reload and start:

sudo systemctl daemon-reload
sudo systemctl start vllm.service
sudo systemctl status vllm.service

Verify Service

sudo journalctl -u vllm.service -n 50

Test API:

API_KEY=$(sudo grep VLLM_API_KEY /etc/vllm/vllm.env | cut -d '=' -f 2)
curl http://127.0.0.1:8000/v1/models \
-H "Authorization: Bearer $API_KEY"

Part II: Rate Limiting

Rate limiting prevents API abuse and resource exhaustion. Options:

  • Caddy – Requires xcaddy build with rate limit plugin
  • Nginx – Built-in limit_req module
  • Fail2ban – Ban IPs after authentication failures
  • Cloudflare – Edge-level rate limiting
  • Application – Custom middleware

Implementation varies by platform and is beyond scope of this guide.


Notes

  • Preflight Checks: Runs before every service start. Add custom checks (disk space, memory) as needed.
  • Graceful ShutdownTimeoutStopSec=30 allows in-flight requests to complete.
  • Restart Policy: Service restarts on failure with 10s delay. After 5 failures in 5 minutes, systemd stops retrying.
  • Filesystem ProtectionProtectSystem=strict makes filesystem read-only except writable paths (/opt/vllm/opt/models). Cache stored in /opt/vllm/.cache.
  • Kernel ProtectionProtectKernelTunablesProtectKernelModulesProtectKernelLogs, and ProtectControlGroups prevent kernel modification.
  • Privilege RestrictionsNoNewPrivileges=true prevents privilege escalation. All capabilities dropped via CapabilityBoundingSet=.
  • System Call RestrictionsLockPersonalityRestrictSUIDSGIDRestrictRealtime, and RestrictNamespaces enabled.
  • Address FamiliesRestrictAddressFamilies cannot be enabled – vLLM requires address families beyond TCP/UDP/Unix for internal IPC.
  • CUDA LimitationsMemoryDenyWriteExecute and SystemCallFilter omitted – incompatible with CUDA.
  • Monitoring: View logs with journalctl -u vllm.service -f.
  • Troubleshooting: Check journalctl -u vllm.service for failures. Common issues: GPU permissions, port conflicts, missing model files.

Dave Ziegler

I’m a full-stack AI/LLM practitioner and solutions architect with 30+ years enterprise IT, application development, consulting, and technical communication experience.

While I currently engage in LLM consulting, application development, integration, local deployments, and technical training, my focus is on AI safety, ethics, education, and industry transparency.

Open to opportunities in technical education, system design consultation, practical deployment guidance, model evaluation, red teaming/adversarial prompting, and technical communication.

My passion is bridging the gap between theory and practice by making complex systems comprehensible and actionable.

Founding Member, AI Mental Health Collective

Community Moderator / SME, The Human Line Project

Let’s connect

Discord: AightBits