Installing vLLM as as Service on Ubuntu Server 24.04.3 LTS

Fresh Install, Headless (no GUI), NVIDIA

Last Updated: 1/16/2026

This is a step-by-step guide for installing vLLM as a service with Ubuntu Server 24.04.3. It assumes the following:

You are starting from a clean, basic installation.
You have performed all basic OS configuration (initial updates, firewall, etc.).
You are using one or more NVIDIA GPUs.
Your hardware is capable of the model selected (Microsoft Phi-4-mini with 3.8B parameters is used as a test).

Security Note: This guide configures vLLM with baseline security defaults: localhost-only binding, proper API key file ownership, GPU device permissions, and systemd service isolation. However, this is NOT a production-hardened deployment. For production use, you must also implement:

OS hardening (firewall rules)
SSH configuration
User access controls
Network security (reverse proxy with TLS)
Restricting API calls to localhost, or:
Rate limiting and IP allowlisting (if exposing beyond localhost)
Additional systemd security directives
Monitoring/logging

Carefully review all security considerations in the Notes section below before deploying.Carefully review all security considerations in the Notes section below before deploying.

Part I: NVIDIA Driver Installation

System Updates

Update system:

			
sudo apt update
sudo apt upgrade

Prerequisites

Install build tools:

			
sudo apt install -y build-essential pkg-config linux-headers-$(uname -r) python3.12-dev

Remove Existing Drivers and Block Nouveau

Remove existing NVIDIA drivers:

			
sudo apt purge -y '*nvidia*' 'cuda*'
sudo apt autoremove -y
sudo apt autoclean

Block nouveau drivers:

			
sudo tee /etc/modprobe.d/blacklist-nouveau.conf >/dev/null <<'EOF'
blacklist nouveau
options nouveau modeset=0
EOF
sudo update-initramfs -u

		

Install Official NVIDIA Drivers

Download CUDA repository pin:

			
cd /tmp
wget -q https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-ubuntu2404.pin
sudo mv cuda-ubuntu2404.pin /etc/apt/preferences.d/cuda-repository-pin-600

Install CUDA repository keyring:

			
wget -q https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-archive-keyring.gpg
sudo install -m 644 cuda-archive-keyring.gpg /usr/share/keyrings/cuda-archive-keyring.gpg

Add CUDA repository:

			
sudo tee /etc/apt/sources.list.d/nvidia-cuda.list >/dev/null <<'EOF'
deb [signed-by=/usr/share/keyrings/cuda-archive-keyring.gpg] https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/ /
EOF

Install NVIDIA drivers and (optional) CUDA toolkit:

			
sudo apt update
sudo apt install -y nvidia-open-590 cuda-toolkit-12-8

Reboot to load drivers:

sudo reboot

Verify Installation

Verify NVIDIA installation:

			
nvidia-smi
/usr/local/cuda-12.8/bin/nvcc --version

Part II: vLLM User and Directory Setup

Create Dedicated User and Directories

Create vllm user and group:

			
sudo groupadd --system vllm
sudo useradd --system --gid vllm --create-home --home-dir /opt/vllm --shell /bin/bash vllm
sudo usermod -aG video,render vllm

Note: Adding the user to video and render groups grants access to GPU devices (/dev/nvidia*). Without this, the service may fail with permission errors.

Create directories:

sudo mkdir -p /opt/vllm /opt/models /etc/vllm

Set ownership and permissions:

			
sudo chown -R vllm:vllm /opt/vllm /opt/models
sudo chown root:vllm /etc/vllm
sudo chmod 750 /etc/vllm

Install uv Package Manager

Install uv as vllm user:

sudo -u vllm bash -c 'curl -LsSf https://astral.sh/uv/install.sh | sh'

Part III: vLLM Installation

Install vLLM in Virtual Environment

Install vLLM and huggingface-hub:

			
sudo -u vllm bash <<'EOF'
cd /opt/vllm
~/.local/bin/uv venv .venv
source .venv/bin/activate
~/.local/bin/uv pip install vllm huggingface-hub
vllm --version
EOF

		

Verify CUDA Access

Verify PyTorch can access CUDA (critical for catching misconfigured installations):

			
sudo -u vllm bash <<'EOF'
cd /opt/vllm
source .venv/bin/activate
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'CUDA version: {torch.version.cuda}')"
EOF

		

Expected output: CUDA available: True and a CUDA version matching your installation (e.g., 12.8). If False, check driver installation and group membership.

Optional: Download Model from Hugging Face

Note: Some models (e.g., Llama) require Hugging Face authentication. If needed, run hf login as the vllm user before downloading.

Download model:

			
sudo -u vllm bash <<'EOF'
cd /opt/vllm
source .venv/bin/activate
hf download microsoft/Phi-4-mini-instruct \
  --local-dir /opt/models/phi-4-mini-instruct
EOF

		

Verify model download:

ls -lh /opt/models/phi-4-mini-instruct

Note: Verify key files exist: config.json, tokenizer files, and model weights. You can also manually copy models into /opt/models instead of downloading.

Part IV: vLLM Service Configuration

Generate API Key and Environment File

Generate API key and create environment file:

			
GENERATED_KEY=$(openssl rand -hex 32)
sudo tee /etc/vllm/vllm.env >/dev/null <<EOF
VLLM_API_KEY=${GENERATED_KEY}
VLLM_SLEEP_WHEN_IDLE=1
EOF

		

Secure the environment file:

			
sudo chown root:vllm /etc/vllm/vllm.env
sudo chmod 640 /etc/vllm/vllm.env

Note: Setting ownership to root:vllm with 640 permissions ensures only root can write and the vllm user can read the API key.

Create Launch Script

Create vLLM launch script (adjust for your downloaded model):

			
sudo -u vllm tee /opt/vllm/start-vllm.sh >/dev/null <<'EOF'
#!/bin/bash
set -e
cd /opt/vllm
source /opt/vllm/.venv/bin/activate
if [ -f /etc/vllm/vllm.env ]; then
    source /etc/vllm/vllm.env
fi
exec vllm serve /opt/models/phi-4-mini-instruct \
    --host 127.0.0.1 \
    --served-model-name phi-4-mini \
    --max-model-len 16384 \
    --max-num-seqs 1 \
    --trust-remote-code \
    --api-key "${VLLM_API_KEY}"
EOF
sudo chmod +x /opt/vllm/start-vllm.sh

		

Note: By default, this binds to 127.0.0.1:8000 (localhost only) for security. The --served-model-name phi-4-mini flag allows clients to use "model": "phi-4-mini" instead of the full model path. To allow network access, change --host 127.0.0.1 to --host 0.0.0.0, and consider using a reverse proxy with TLS, IP allowlisting, and rate limiting.

Test the script (press Ctrl+C after verifying it starts successfully):

sudo -u vllm /opt/vllm/start-vllm.sh

Create Systemd Service

Create systemd service:

			
sudo tee /etc/systemd/system/vllm.service >/dev/null <<'EOF'
[Unit]
Description=vLLM Inference Server
After=network-online.target
Wants=network-online.target
Documentation=https://docs.vllm.ai
[Service]
Type=simple
User=vllm
Group=vllm
WorkingDirectory=/opt/vllm
EnvironmentFile=/etc/vllm/vllm.env
ExecStart=/opt/vllm/start-vllm.sh
Restart=on-failure
RestartSec=10s
LimitNOFILE=65535
StandardOutput=journal
StandardError=journal
SyslogIdentifier=vllm
[Install]
WantedBy=multi-user.target
EOF

		

Enable and start service:

			
sudo systemctl daemon-reload
sudo systemctl enable vllm.service
sudo systemctl start vllm.service

Check service status:

sudo systemctl status vllm.service

Part V: Verification and Testing

Test the Service

Test vLLM API:

			
API_KEY=$(sudo grep VLLM_API_KEY /etc/vllm/vllm.env | cut -d '=' -f 2)
curl http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "phi-4-mini",
    "messages": [
      {"role": "user", "content": "Say hello in French"}
    ],
    "max_tokens": 50,
    "temperature": 0
  }'

		

If successful, you should see something like:

{"id":"chatcmpl-b54378bebb216a8f","object":"chat.completion","created":1768590077,"model":"phi-4-mini","choices":[{"index":0,"message":{"role":"assistant","content":"Bonjour!","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":200020,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":7,"total_tokens":10,"completion_tokens":3,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

Maintenance

Part VI: Updating vLLM

To update vLLM to the latest version:

Stop the service:

sudo systemctl stop vllm.service

Update vLLM:

			
sudo -u vllm bash <<'EOF'
cd /opt/vllm
source .venv/bin/activate
~/.local/bin/uv pip install --upgrade vllm
vllm --version
EOF

		

Restart the service:

			
sudo systemctl start vllm.service
sudo systemctl status vllm.service

Note: Major vLLM updates may introduce breaking changes. Review the vLLM changelog and release notes before upgrading production systems.

Notes

Network Binding: This guide defaults to 127.0.0.1:8000 (localhost only) for security. For network access, change to --host 0.0.0.0 in the launch script and place vLLM behind a reverse proxy with TLS, IP allowlisting, and rate limiting.
API Key Security: The --api-key parameter only protects the OpenAI-compatible /v1/* endpoints. Other endpoints (health checks, metrics) remain unauthenticated. For production deployments, combine API key authentication with network restrictions (localhost binding + reverse proxy) or firewall rules.
GPU Access: The vllm user must be in the video and render groups to access GPU devices. Without this, you’ll see permission denied errors on /dev/nvidia*.
Version Pinning: For production stability, consider pinning vLLM and CUDA versions: uv pip install vllm==0.x.x. vLLM moves quickly; pinning prevents unexpected breakage during rebuilds.
Performance Tuning: The --max-num-seqs 1 setting is conservative to prevent out-of-memory errors. Increase this value for better throughput once you’ve verified the model runs stable.
CUDA Toolkit: This guide installs the full CUDA toolkit (12.8). Many vLLM deployments only require NVIDIA drivers and runtime libraries. The toolkit is included here for completeness and compatibility.
Systemd Hardening: For production deployments, consider adding systemd security directives like NoNewPrivileges=true, PrivateTmp=true, ProtectSystem=strict, and ProtectHome=true to the service file.
Gated Models: Some models (e.g., Llama) require Hugging Face authentication. Run hf login before downloading.
Model Verification: After download, verify key files exist: config.json, tokenizer files, and model weights.
Troubleshooting: View logs with journalctl -u vllm.service -f or restart with sudo systemctl restart vllm.service.

AightBits