Follow-up to: Securing vLLM with Caddy Reverse Proxy
This guide adds hardening to your vLLM deployment. It assumes you’ve completed both the base installation and reverse proxy configuration.
Security Note: This guide implements systemd security directives and preflight checks. Combined with the previous guides (HTTPS, firewall, endpoint restriction), this provides a production-ready deployment. Review all Notes before deploying.
Part I: Systemd Hardening
Stop service:
sudo systemctl stop vllm.service
Create Preflight Check Script
sudo tee /opt/vllm/preflight-check.sh >/dev/null <<'EOF'#!/bin/bashset -elog_error() { echo "[ERROR] $1" >&2; }log_success() { echo "[OK] $1"; }# Check GPUif ! nvidia-smi &>/dev/null; then log_error "GPU not available" exit 1filog_success "GPU available"# Check CUDAif ! /opt/vllm/.venv/bin/python -c "import torch; assert torch.cuda.is_available()" 2>/dev/null; then log_error "CUDA not accessible" exit 1filog_success "CUDA accessible"# Check portif ss -tlnp 2>/dev/null | grep -q ':8000 '; then log_error "Port 8000 in use" exit 1filog_success "Port available"# Check vLLMif [ ! -f /opt/vllm/.venv/bin/vllm ]; then log_error "vLLM not found" exit 1filog_success "vLLM found"log_success "All checks passed"EOF
Make executable:
sudo chmod +x /opt/vllm/preflight-check.sh
Test:
sudo -u vllm /opt/vllm/preflight-check.sh
Update Service File
Create hardened service file:
sudo tee /etc/systemd/system/vllm.service >/dev/null <<'EOF'[Unit]Description=vLLM Inference ServerAfter=network-online.target nvidia-persistenced.serviceWants=network-online.targetDocumentation=https://docs.vllm.ai[Service]Type=simpleUser=vllmGroup=vllmWorkingDirectory=/opt/vllm# EnvironmentEnvironmentFile=/etc/vllm/vllm.env# Pre-start checksExecStartPre=/bin/sleep 5ExecStartPre=/opt/vllm/preflight-check.sh# Start serviceExecStart=/opt/vllm/start-vllm.sh# Graceful shutdownTimeoutStopSec=30# Restart policyRestart=on-failureRestartSec=10sStartLimitInterval=300sStartLimitBurst=5# Resource limitsLimitNOFILE=65535# Security: FilesystemProtectSystem=strictReadWritePaths=/opt/vllm /opt/modelsProtectHome=truePrivateTmp=true# Security: PrivilegesNoNewPrivileges=true# Security: KernelProtectKernelTunables=trueProtectKernelModules=trueProtectKernelLogs=trueProtectControlGroups=true# Security: System callsLockPersonality=trueRestrictSUIDSGID=trueRestrictRealtime=trueRestrictNamespaces=true# Security: CapabilitiesCapabilityBoundingSet=AmbientCapabilities=# LoggingStandardOutput=journalStandardError=journalSyslogIdentifier=vllm[Install]WantedBy=multi-user.targetEOF
Reload and start:
sudo systemctl daemon-reloadsudo systemctl start vllm.servicesudo systemctl status vllm.service
Verify Service
sudo journalctl -u vllm.service -n 50
Test API:
API_KEY=$(sudo grep VLLM_API_KEY /etc/vllm/vllm.env | cut -d '=' -f 2)curl http://127.0.0.1:8000/v1/models \ -H "Authorization: Bearer $API_KEY"
Part II: Rate Limiting
Rate limiting prevents API abuse and resource exhaustion. Options:
- Caddy – Requires xcaddy build with rate limit plugin
- Nginx – Built-in
limit_reqmodule - Fail2ban – Ban IPs after authentication failures
- Cloudflare – Edge-level rate limiting
- Application – Custom middleware
Implementation varies by platform and is beyond scope of this guide.
Notes
- Preflight Checks: Runs before every service start. Add custom checks (disk space, memory) as needed.
- Graceful Shutdown:
TimeoutStopSec=30allows in-flight requests to complete. - Restart Policy: Service restarts on failure with 10s delay. After 5 failures in 5 minutes, systemd stops retrying.
- Filesystem Protection:
ProtectSystem=strictmakes filesystem read-only except writable paths (/opt/vllm,/opt/models). Cache stored in/opt/vllm/.cache. - Kernel Protection:
ProtectKernelTunables,ProtectKernelModules,ProtectKernelLogs, andProtectControlGroupsprevent kernel modification. - Privilege Restrictions:
NoNewPrivileges=trueprevents privilege escalation. All capabilities dropped viaCapabilityBoundingSet=. - System Call Restrictions:
LockPersonality,RestrictSUIDSGID,RestrictRealtime, andRestrictNamespacesenabled. - Address Families:
RestrictAddressFamiliescannot be enabled – vLLM requires address families beyond TCP/UDP/Unix for internal IPC. - CUDA Limitations:
MemoryDenyWriteExecuteandSystemCallFilteromitted – incompatible with CUDA. - Monitoring: View logs with
journalctl -u vllm.service -f. - Troubleshooting: Check
journalctl -u vllm.servicefor failures. Common issues: GPU permissions, port conflicts, missing model files.





