Kubernetes nvidia gpu
A GPU can be used to run applications that leverage machine learning models as well as the name suggests, graphics applications. In the homelab I will use a GPU to run AI models, transcode media files, and any other application that can leverage GPU hardware acceleration.
Using a GPU in a Kubernetes cluster involves setting up the cluster to recognize and allocate GPU resources to pods that require them.
Prerequisites
First of all, you need a Kubernetes cluster and a computer with a Nvidia GPU that is also joined to the Kubernetes cluster. If you are not familiar with Kubernetes, check out a previous post.
Install nvidia drivers
Install drivers based on your distribution’s third-party repos or by downloading binaries from Nvidia.
Verify the drivers are working:
nvidia-smi
Alternatively, you can install drivers via the operator Helm chart later.
Install nvidia-container-toolkit
The NVIDIA Container Toolkit allows users to build and run GPU accelerated containers. The toolkit includes a container runtime library and utilities to automatically configure containers to leverage NVIDIA GPUs.
Here is a example to install the toolkit on a Debian based system.
First you install the apt repo and then you can use apt-get
commands to install the toolkit.
#!/usr/bin/env bash
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install nvidia-container-toolkit
Install nvidia gpu operator on k8s cluster
This operator will scan nodes in the cluster for Nvidia GPUs and apply labels to nodes to allow pods to know where to schedule to use a GPU device.
Prereqs
helm
&kubectl
CLI utilities
First create a namespace for the operator and allow privileged pods:
kubectl create ns gpu-operator
kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged
This operator can be installed via helm chart and will identify available GPUs and add labels that can be used for scheduling in k8s.
Add the helm repo:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
Install the helm chart with a few value overrides:
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set driver.enabled=false \
--set psp.enabled=true
If you want the helm chart to install the nvidia drivers, omit the option driver.enabled=false
.
Once the install completes you should see output:
NAME: gpu-operator-1720620673
LAST DEPLOYED: Wed Jul 10 08:11:19 2024
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None
Deploy a sample application to test the operator
You can create a pod that requests using a GPU to confirm that the operator installed all of the proper labels onto the cluster.
Create a file pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: cuda-vectoradd
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vectoradd
image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
resources:
limits:
nvidia.com/gpu: 1
To create this pod:
kubectl apply -f pod.yaml
To see if the pod worked, check the logs:
kubectl get pods
Once you see the name of the pod:
kubectl logs pod/$pod_name_here
The logs should indicate success.
Next steps
If the validation was a success, now other pods can be scheduled and use a GPU.