Kubernetes nvidia gpu
A GPU can be used to run applications that leverage machine learning models as well as the name suggests, graphics applications. In the homelab I will use a GPU to run AI models, transcode media files, and any other application that can leverage GPU hardware acceleration.
Using a GPU in a Kubernetes cluster involves setting up the cluster to recognize and allocate GPU resources to pods that require them.
Prerequisites
First of all, you need a Kubernetes cluster and a computer with a Nvidia GPU that is also joined to the Kubernetes cluster. If you are not familiar with Kubernetes, check out a previous post.
Install nvidia drivers
Install drivers based on your distribution’s third-party repos or by downloading binaries from Nvidia.
Verify the drivers are working:
|
|
Alternatively, you can install drivers via the operator Helm chart later.
Install nvidia-container-toolkit
The NVIDIA Container Toolkit allows users to build and run GPU accelerated containers. The toolkit includes a container runtime library and utilities to automatically configure containers to leverage NVIDIA GPUs.
Here is a example to install the toolkit on a Debian based system.
First you install the apt repo and then you can use apt-get
commands to install the toolkit.
|
|
Install nvidia gpu operator on k8s cluster
This operator will scan nodes in the cluster for Nvidia GPUs and apply labels to nodes to allow pods to know where to schedule to use a GPU device.
Prereqs
helm
&kubectl
CLI utilities
First create a namespace for the operator and allow privileged pods:
|
|
This operator can be installed via helm chart and will identify available GPUs and add labels that can be used for scheduling in k8s.
Add the helm repo:
|
|
Install the helm chart with a few value overrides:
|
|
If you want the helm chart to install the nvidia drivers, omit the option driver.enabled=false
.
Once the install completes you should see output:
|
|
Deploy a sample application to test the operator
You can create a pod that requests using a GPU to confirm that the operator installed all of the proper labels onto the cluster.
Create a file pod.yaml
|
|
To create this pod:
|
|
To see if the pod worked, check the logs:
|
|
Once you see the name of the pod:
|
|
The logs should indicate success.
Next steps
If the validation was a success, now other pods can be scheduled and use a GPU.