mirror of
https://github.com/Brandon-Rozek/website.git
synced 2025-08-03 23:31:59 +00:00
New post
This commit is contained in:
parent
734a3e5721
commit
6a9e1847d7
1 changed files with 153 additions and 0 deletions
153
content/blog/ollama-cuda-podman-quadlets.md
Normal file
153
content/blog/ollama-cuda-podman-quadlets.md
Normal file
|
@ -0,0 +1,153 @@
|
||||||
|
---
|
||||||
|
title: "Setting up Ollama with CUDA on Podman Quadlets"
|
||||||
|
date: 2025-03-29T09:59:55-04:00
|
||||||
|
draft: false
|
||||||
|
tags: []
|
||||||
|
math: false
|
||||||
|
medium_enabled: false
|
||||||
|
---
|
||||||
|
|
||||||
|
[Open WebUI](https://www.openwebui.com/) provides a nice chat interface for interacting with LLMs over Ollama and OpenAI compatible APIs. Using [Ollama](https://ollama.com/), we can self-host many different LLMs that are open-sourced! This post documents the steps that I took in order to get Ollama working with CUDA using my Podman setup. However given how fast Machine Learning projects iterate, I wouldn't be surprised if these exact steps no longer work. In that case, I'll provide links to the official documentation which hopefully can help.
|
||||||
|
|
||||||
|
I'll assume that you have the NVIDIA driver installed on your machine. The steps vary by OS/distribution and how modern of a driver you want, but I generally recommend to stick with what's packaged in your distribution's repository. This is to minimize headaches...
|
||||||
|
|
||||||
|
With that, our first step is to install the `nvidia-container-toolkit`. This package contains a collection of libraries and scripts to help us run our GPU inside a container.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo dnf install nvidia-container-toolkit
|
||||||
|
```
|
||||||
|
|
||||||
|
As of this time of writing, instructions for installing the toolkit can be found on [NVIDIA's website](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html).
|
||||||
|
|
||||||
|
We can use this toolkit to generate a Common Device Interface (CDI) file which Podman will use to talk to the GPU.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
**Note:** Every time you update your NVIDIA driver, you'll have to run this command.
|
||||||
|
|
||||||
|
NVIDIA also documents the steps for configuring CDI on [their website](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html#running-a-workload-with-cdi).
|
||||||
|
|
||||||
|
From here, we should make sure that the NVIDIA toolkit found the appropriate GPU(s) and has set up their CDI.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
nvidia-ctk cdi list
|
||||||
|
```
|
||||||
|
|
||||||
|
I only have one GPU on my machine, so it outputs something like the following:
|
||||||
|
|
||||||
|
```
|
||||||
|
INFO[0000] Found 3 CDI devices
|
||||||
|
nvidia.com/gpu=0
|
||||||
|
nvidia.com/gpu=GPU-52785a8a-f8ca-99b9-0312-01a1f59e789b
|
||||||
|
nvidia.com/gpu=all
|
||||||
|
```
|
||||||
|
|
||||||
|
If you want your container to be able to access all the GPUs, we can use the `nvidia.com/gpu=all` device interface. Otherwise, we can use a specific one.
|
||||||
|
|
||||||
|
Then, we restart Podman so that the CDI files are loaded.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo systemctl restart podman
|
||||||
|
```
|
||||||
|
|
||||||
|
For our first test, we'll make sure that the container can appropriately access the GPU by running the `nvidia-smi` command.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo podman run --rm \
|
||||||
|
--device nvidia.com/gpu=all \
|
||||||
|
docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04 \
|
||||||
|
nvidia-smi
|
||||||
|
```
|
||||||
|
|
||||||
|
For my GPU it outputs:
|
||||||
|
|
||||||
|
```
|
||||||
|
+-----------------------------------------------------------------------------------------+
|
||||||
|
| NVIDIA-SMI 570.124.04 Driver Version: 570.124.04 CUDA Version: 12.8 |
|
||||||
|
|-----------------------------------------+------------------------+----------------------+
|
||||||
|
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
|
||||||
|
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
|
||||||
|
| | | MIG M. |
|
||||||
|
|=========================================+========================+======================|
|
||||||
|
| 0 NVIDIA GeForce RTX 3060 Off | 00000000:02:00.0 On | N/A |
|
||||||
|
| 0% 50C P8 19W / 170W | 1546MiB / 12288MiB | 0% Default |
|
||||||
|
| | | N/A |
|
||||||
|
+-----------------------------------------+------------------------+----------------------+
|
||||||
|
|
||||||
|
+-----------------------------------------------------------------------------------------+
|
||||||
|
| Processes: |
|
||||||
|
| GPU GI CI PID Type Process name GPU Memory |
|
||||||
|
| ID ID Usage |
|
||||||
|
|=========================================================================================|
|
||||||
|
+-----------------------------------------------------------------------------------------+
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
Now we are ready to set up Ollama! To save time when running our `systemd` commands, let's pull the image ahead of time.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo podman pull docker.io/ollama/ollama
|
||||||
|
```
|
||||||
|
|
||||||
|
We'll have to save the models somewhere, so in this example we'll save them to `/opt/ollama`.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo mkdir /opt/ollama
|
||||||
|
```
|
||||||
|
|
||||||
|
Let's configure the Quadlet. Save the following to `/etc/containers/systemd/ollama.container`:
|
||||||
|
|
||||||
|
```ini
|
||||||
|
[Container]
|
||||||
|
ContainerName=ollama
|
||||||
|
HostName=ollama
|
||||||
|
Image=docker.io/ollama/ollama
|
||||||
|
AutoUpdate=registry
|
||||||
|
Volume=/opt/ollama:/root/.ollama
|
||||||
|
PublishPort=11434:11434
|
||||||
|
AddDevice=nvidia.com/gpu=all
|
||||||
|
|
||||||
|
[Unit]
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Restart=always
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=default.target
|
||||||
|
```
|
||||||
|
|
||||||
|
This file specifies the flags that we pass to the podman command:
|
||||||
|
|
||||||
|
- Publish the port 11434: This is the port we'll use when sending messages to Ollama from Open WebUI. Of course you're welcome to use other networking tricks to pull that off.
|
||||||
|
- Mount the folder `/opt/ollama` on the filesystem to `/root/.ollama` within the container: We don't want to have to re-download the LLM models each time!
|
||||||
|
|
||||||
|
For the moment of truth, let's start it!
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo systemctl start ollama
|
||||||
|
```
|
||||||
|
|
||||||
|
I won't show in this post how to configure Open WebUI, but we can make sure that everything is working by looking at the Ollama container itself.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo podman exec -it ollama /bin/bash
|
||||||
|
```
|
||||||
|
|
||||||
|
We'll perform a test with a smaller model (1.2 GB):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ollama run llama3.2:1b
|
||||||
|
```
|
||||||
|
|
||||||
|
Depending on your Internet connection, this will take a couple minutes to download and load onto the GPU.
|
||||||
|
|
||||||
|
When it's done the prompt will be replaced with:
|
||||||
|
|
||||||
|
```
|
||||||
|
>>>
|
||||||
|
```
|
||||||
|
|
||||||
|
From here you can chat with the LLM!
|
||||||
|
|
Loading…
Add table
Reference in a new issue