Fresh Install, Headless (no GUI), NVIDIA
Last Updated: 1/16/2026
This is a step-by-step guide for installing vLLM as a service with Ubuntu Server 24.04.3. It assumes the following:
- You are starting from a clean, basic installation.
- You have performed all basic OS configuration (initial updates, firewall, etc.).
- You are using one or more NVIDIA GPUs.
- Your hardware is capable of the model selected (Microsoft Phi-4-mini with 3.8B parameters is used as a test).
Security Note: This guide configures vLLM with baseline security defaults: localhost-only binding, proper API key file ownership, GPU device permissions, and systemd service isolation. However, this is NOT a production-hardened deployment. For production use, you must also implement:
- OS hardening (firewall rules)
- SSH configuration
- User access controls
- Network security (reverse proxy with TLS)
- Restricting API calls to localhost, or:
- Rate limiting and IP allowlisting (if exposing beyond localhost)
- Additional systemd security directives
- Monitoring/logging
Carefully review all security considerations in the Notes section below before deploying.Carefully review all security considerations in the Notes section below before deploying.
Part I: NVIDIA Driver Installation
System Updates
Update system:
sudo apt updatesudo apt upgrade
Prerequisites
Install build tools:
sudo apt install -y build-essential pkg-config linux-headers-$(uname -r) python3.12-dev
Remove Existing Drivers and Block Nouveau
Remove existing NVIDIA drivers:
sudo apt purge -y '*nvidia*' 'cuda*'sudo apt autoremove -ysudo apt autoclean
Block nouveau drivers:
sudo tee /etc/modprobe.d/blacklist-nouveau.conf >/dev/null <<'EOF'blacklist nouveauoptions nouveau modeset=0EOFsudo update-initramfs -u
Install Official NVIDIA Drivers
Download CUDA repository pin:
cd /tmpwget -q https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-ubuntu2404.pinsudo mv cuda-ubuntu2404.pin /etc/apt/preferences.d/cuda-repository-pin-600
Install CUDA repository keyring:
wget -q https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-archive-keyring.gpgsudo install -m 644 cuda-archive-keyring.gpg /usr/share/keyrings/cuda-archive-keyring.gpg
Add CUDA repository:
sudo tee /etc/apt/sources.list.d/nvidia-cuda.list >/dev/null <<'EOF'deb [signed-by=/usr/share/keyrings/cuda-archive-keyring.gpg] https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/ /EOF
Install NVIDIA drivers and (optional) CUDA toolkit:
sudo apt updatesudo apt install -y nvidia-open-590 cuda-toolkit-12-8
Reboot to load drivers:
sudo reboot
Verify Installation
Verify NVIDIA installation:
nvidia-smi/usr/local/cuda-12.8/bin/nvcc --version
Part II: vLLM User and Directory Setup
Create Dedicated User and Directories
Create vllm user and group:
sudo groupadd --system vllmsudo useradd --system --gid vllm --create-home --home-dir /opt/vllm --shell /bin/bash vllmsudo usermod -aG video,render vllm
Note: Adding the user to video and render groups grants access to GPU devices (/dev/nvidia*). Without this, the service may fail with permission errors.
Create directories:
sudo mkdir -p /opt/vllm /opt/models /etc/vllm
Set ownership and permissions:
sudo chown -R vllm:vllm /opt/vllm /opt/modelssudo chown root:vllm /etc/vllmsudo chmod 750 /etc/vllm
Install uv Package Manager
Install uv as vllm user:
sudo -u vllm bash -c 'curl -LsSf https://astral.sh/uv/install.sh | sh'
Part III: vLLM Installation
Install vLLM in Virtual Environment
Install vLLM and huggingface-hub:
sudo -u vllm bash <<'EOF'cd /opt/vllm~/.local/bin/uv venv .venvsource .venv/bin/activate~/.local/bin/uv pip install vllm huggingface-hubvllm --versionEOF
Verify CUDA Access
Verify PyTorch can access CUDA (critical for catching misconfigured installations):
sudo -u vllm bash <<'EOF'cd /opt/vllmsource .venv/bin/activatepython -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'CUDA version: {torch.version.cuda}')"EOF
Expected output: CUDA available: True and a CUDA version matching your installation (e.g., 12.8). If False, check driver installation and group membership.
Optional: Download Model from Hugging Face
Note: Some models (e.g., Llama) require Hugging Face authentication. If needed, run hf login as the vllm user before downloading.
Download model:
sudo -u vllm bash <<'EOF'cd /opt/vllmsource .venv/bin/activatehf download microsoft/Phi-4-mini-instruct \ --local-dir /opt/models/phi-4-mini-instructEOF
Verify model download:
ls -lh /opt/models/phi-4-mini-instruct
Note: Verify key files exist: config.json, tokenizer files, and model weights. You can also manually copy models into /opt/models instead of downloading.
Part IV: vLLM Service Configuration
Generate API Key and Environment File
Generate API key and create environment file:
GENERATED_KEY=$(openssl rand -hex 32)sudo tee /etc/vllm/vllm.env >/dev/null <<EOFVLLM_API_KEY=${GENERATED_KEY}VLLM_SLEEP_WHEN_IDLE=1EOF
Secure the environment file:
sudo chown root:vllm /etc/vllm/vllm.envsudo chmod 640 /etc/vllm/vllm.env
Note: Setting ownership to root:vllm with 640 permissions ensures only root can write and the vllm user can read the API key.
Create Launch Script
Create vLLM launch script (adjust for your downloaded model):
sudo -u vllm tee /opt/vllm/start-vllm.sh >/dev/null <<'EOF'#!/bin/bashset -ecd /opt/vllmsource /opt/vllm/.venv/bin/activateif [ -f /etc/vllm/vllm.env ]; then source /etc/vllm/vllm.envfiexec vllm serve /opt/models/phi-4-mini-instruct \ --host 127.0.0.1 \ --served-model-name phi-4-mini \ --max-model-len 16384 \ --max-num-seqs 1 \ --trust-remote-code \ --api-key "${VLLM_API_KEY}"EOFsudo chmod +x /opt/vllm/start-vllm.sh
Note: By default, this binds to 127.0.0.1:8000 (localhost only) for security. The --served-model-name phi-4-mini flag allows clients to use "model": "phi-4-mini" instead of the full model path. To allow network access, change --host 127.0.0.1 to --host 0.0.0.0, and consider using a reverse proxy with TLS, IP allowlisting, and rate limiting.
Test the script (press Ctrl+C after verifying it starts successfully):
sudo -u vllm /opt/vllm/start-vllm.sh
Create Systemd Service
Create systemd service:
sudo tee /etc/systemd/system/vllm.service >/dev/null <<'EOF'[Unit]Description=vLLM Inference ServerAfter=network-online.targetWants=network-online.targetDocumentation=https://docs.vllm.ai[Service]Type=simpleUser=vllmGroup=vllmWorkingDirectory=/opt/vllmEnvironmentFile=/etc/vllm/vllm.envExecStart=/opt/vllm/start-vllm.shRestart=on-failureRestartSec=10sLimitNOFILE=65535StandardOutput=journalStandardError=journalSyslogIdentifier=vllm[Install]WantedBy=multi-user.targetEOF
Enable and start service:
sudo systemctl daemon-reloadsudo systemctl enable vllm.servicesudo systemctl start vllm.service
Check service status:
sudo systemctl status vllm.service
Part V: Verification and Testing
Test the Service
Test vLLM API:
API_KEY=$(sudo grep VLLM_API_KEY /etc/vllm/vllm.env | cut -d '=' -f 2)curl http://127.0.0.1:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $API_KEY" \ -d '{ "model": "phi-4-mini", "messages": [ {"role": "user", "content": "Say hello in French"} ], "max_tokens": 50, "temperature": 0 }'
If successful, you should see something like:
{"id":"chatcmpl-b54378bebb216a8f","object":"chat.completion","created":1768590077,"model":"phi-4-mini","choices":[{"index":0,"message":{"role":"assistant","content":"Bonjour!","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":200020,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":7,"total_tokens":10,"completion_tokens":3,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
Maintenance
Part VI: Updating vLLM
To update vLLM to the latest version:
Stop the service:
sudo systemctl stop vllm.service
Update vLLM:
sudo -u vllm bash <<'EOF'cd /opt/vllmsource .venv/bin/activate~/.local/bin/uv pip install --upgrade vllmvllm --versionEOF
Restart the service:
sudo systemctl start vllm.servicesudo systemctl status vllm.service
Note: Major vLLM updates may introduce breaking changes. Review the vLLM changelog and release notes before upgrading production systems.
Notes
- Network Binding: This guide defaults to
127.0.0.1:8000(localhost only) for security. For network access, change to--host 0.0.0.0in the launch script and place vLLM behind a reverse proxy with TLS, IP allowlisting, and rate limiting. - API Key Security: The
--api-keyparameter only protects the OpenAI-compatible/v1/*endpoints. Other endpoints (health checks, metrics) remain unauthenticated. For production deployments, combine API key authentication with network restrictions (localhost binding + reverse proxy) or firewall rules. - GPU Access: The vllm user must be in the
videoandrendergroups to access GPU devices. Without this, you’ll see permission denied errors on/dev/nvidia*. - Version Pinning: For production stability, consider pinning vLLM and CUDA versions:
uv pip install vllm==0.x.x. vLLM moves quickly; pinning prevents unexpected breakage during rebuilds. - Performance Tuning: The
--max-num-seqs 1setting is conservative to prevent out-of-memory errors. Increase this value for better throughput once you’ve verified the model runs stable. - CUDA Toolkit: This guide installs the full CUDA toolkit (12.8). Many vLLM deployments only require NVIDIA drivers and runtime libraries. The toolkit is included here for completeness and compatibility.
- Systemd Hardening: For production deployments, consider adding systemd security directives like
NoNewPrivileges=true,PrivateTmp=true,ProtectSystem=strict, andProtectHome=trueto the service file. - Gated Models: Some models (e.g., Llama) require Hugging Face authentication. Run
hf loginbefore downloading. - Model Verification: After download, verify key files exist:
config.json, tokenizer files, and model weights. - Troubleshooting: View logs with
journalctl -u vllm.service -for restart withsudo systemctl restart vllm.service.





