How to Upgrade CUDA on NVIDIA DGX A100

NVIDIA DGX^™ A100 is the universal system for all AI workloads, offering unprecedented compute density, performance, and flexibility in the world’s first 5 petaFLOPS AI system. Featuring the NVIDIA A100 Tensor Core GPU, DGX A100 enables enterprises to consolidate training, inference, and analytics into a unified, easy-to-deploy AI infrastructure that includes direct access to NVIDIA AI experts.

Since installation there might be Cuda upgrades. In order to proceed with Cuda upgrade we will need to execute the following procedure:

Download desired deb files:

wget https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64/pool/common/n/nvidia-repo-keys/nvidia-repo-keys_22.04-1_all.deb

wget https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64/pool/dgx/n/nvidia-repos/dgx-repo_21.07-1_amd64.deb

wget https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64/pool/common/n/nvidia-repos/cuda-compute-repo_21.07-1_amd64.deb

Install downloaded debs:

sudo dpkg –force-confnew -i ./nvidia-repo-keys_22.04-1_all.deb ./dgx-repo_21.07-1_amd64.deb ./cuda-compute-repo_21.07-1_amd64.deb

Update repo list and upgrade packages if there are updates:

apt update
apt upgrade -y

Remove old driver and install the new one:

sudo apt-get purge *nvidia*450*
sudo apt install -y linux-modules-nvidia-515-server-generic nvidia-driver-515-server libnvidia-nscq-515 nvidia-modprobe nvidia-fabricmanager-515
sudo systemctl unmask nvidia-fabricmanager
sudo systemctl enable nvidia-fabricmanager
sudo apt install -y --reinstall nvidia-peer-memory-dkms
sudo /usr/sbin/update-rc.d nv_peer_mem defaults

Reboot DGX box.

Check if update works using the following commands:

apt install cuda-toolkit-11-7
apt list --installed nvidia-driver*server
apt list --installed cuda-toolkit*
systemctl status nvidia-fabricmanager.service
cat /etc/dgx-release

Need help managing high-performance AI infrastructure?
→ Explore our Infrastructure Support Services

DGX A100 Cuda upgrade

Published by razvan on November 10, 2022November 10, 2022

0 Comments

Leave a Reply Cancel reply

AWX building custom execution environment

ESP32 temperature sensor with Zabbix

Logstash email alert