NVIDIA DGX™ A100 is the universal system for all AI workloads, offering unprecedented compute density, performance, and flexibility in the world’s first 5 petaFLOPS AI system. Featuring the NVIDIA A100 Tensor Core GPU, DGX A100 enables enterprises to consolidate training, inference, and analytics into a unified, easy-to-deploy AI infrastructure that includes direct access to NVIDIA AI experts.
Since installation there might be Cuda upgrades. In order to proceed with Cuda upgrade we will need to execute the following procedure:
Download desired deb files:
wget https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64/pool/common/n/nvidia-repo-keys/nvidia-repo-keys_22.04-1_all.deb
wget https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64/pool/dgx/n/nvidia-repos/dgx-repo_21.07-1_amd64.deb
wget https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64/pool/common/n/nvidia-repos/cuda-compute-repo_21.07-1_amd64.deb
Install downloaded debs:
sudo dpkg –force-confnew -i ./nvidia-repo-keys_22.04-1_all.deb ./dgx-repo_21.07-1_amd64.deb ./cuda-compute-repo_21.07-1_amd64.deb
Update repo list and upgrade packages if there are updates:
apt update
apt upgrade -y
Remove old driver and install the new one:
sudo apt-get purge *nvidia*450*
sudo apt install -y linux-modules-nvidia-515-server-generic nvidia-driver-515-server libnvidia-nscq-515 nvidia-modprobe nvidia-fabricmanager-515
sudo systemctl unmask nvidia-fabricmanager
sudo systemctl enable nvidia-fabricmanager
sudo apt install -y --reinstall nvidia-peer-memory-dkms
sudo /usr/sbin/update-rc.d nv_peer_mem defaults
Reboot DGX box.
Check if update works using the following commands:
apt install cuda-toolkit-11-7
apt list --installed nvidia-driver*server
apt list --installed cuda-toolkit*
systemctl status nvidia-fabricmanager.service
cat /etc/dgx-release
Need help managing high-performance AI infrastructure?
→ Explore our Infrastructure Support Services
0 Comments