Image Maintenance
From UF HPC Wiki
We use SystemImager to provision our nodes. We use the "Use Your Own Kernel" capabilities in SystemImager (aka UYOK). That is, we maintain out our own installation kernel and initrd.
Occassionally, we want to provision nodes that have different hardware from what we currently are using. To do that, we may need to add modules to the installation and provisioned initrd images that support this "new" hardware.
Adding a module to the Installation initrd
Here are the steps to add a module to the installation initrd. In short, we load the necessary modules on a "golden client", instruct si_prepareclient to incorporate those loaded modules into an initrd image, and copy that initrd to our image server. Explicitly:
[root@imgsrv ~]# ssh -x osg Last login: Wed Oct 15 12:30:59 2008 from imgsrv.ufhpc [root@osg ~]# modprobe <new modules> [root@osg ~]# /etc/init.d/lustre-client stop [root@osg ~]# /etc/init.d/openibd stop [root@osg ~]# ufsi_prepareclient --server imgsrv.ufhpc --yes --my-modules <...> <lots of output suppressed> <...> [root@osg ~]# /etc/init.d/openibd start [root@osg ~]# /etc/init.d/lustre-client start [root@osg ~]# exit [root@imgsrv ~]# cd /tftpboot/ [root@imgsrv tftpboot]# cp initrd.img-centos5.1-x86_64-uyok initrd.img-centos5.1-x86_64-uyok.bak [root@imgsrv tftpboot]# scp osg:/etc/systemimager/boot/initrd.img initrd.img-centos5.1-x86_64-uyok initrd.img 100% 41MB 40.9MB/s 00:01
Adding a module to the Provisioned initrd
If we add modules to the installation initrd, we need to also add them to the provisioned initrd that will reside on a compute nodes disk, for example. To do that, we execute mkinitrd using --with options that we need to make an initrd that will work on any of our machines, and pull a new image. Explicitly:
[root@imgsrv ~]# vi /opt/cluster/config/usr/local/sbin/ufmkinitrd # Add necessary modules [root@imgsrv ~]# rdist -P /usr/bin/ssh -f /opt/cluster/Distfile -M 16 local-sbin <...> <lots of output suppressed> <...> [root@imgsrv ~]# ssh -x r1a-s42 Last login: Wed Oct 15 17:21:00 2008 from imgsrv.ufhpc [root@r1a-s42 ~]# ufmkinitrd -v -f /boot/initrd-2.6.18-8.1.14.el5.L-1642.img 2.6.18-8.1.14.el5.L-1642 <...> <lots of output suppressed> <...> [root@r1a-s42 ~]# ufsi_prepareclient --server imgsrv.ufhpc --yes <...> <lots of output suppressed> <...> [root@r1a-s42 ~]# exit [root@imgsrv ~]# si_getimage --golden-client r1a-s42.ufhpc --image ComputeNodes --exclude "/local/scratch/*"
Adding CUDA modules
The NVIDIA installer for the CUDA drivers will fail to install on nodes without any nvidia graphics hardware. Here's what we normally do to update the NVIDIA drivers.
- Install driver, SDK, and toolkit for testing on one of the tesla nodes. Build the whole SDK.
- Note - beginning with CUDA 2.3, 64-bit libs are placed in
lib64/instead oflib/.common.mkmust be edited to point to the correct cuda libdir in order for the SDK to be built. - Assuming tests work, deploy the driver, SDK, and toolkit on a non-tesla node that we use as a golden-client. Build the whole SDK. To prevent the driver installer from failing at insmod time, we use the
--no-kernel-moduleinstaller option:
./cudadriver_2.3_linux_64_190.16.run --no-kernel-module
- Copy the driver itself from the tesla node used for testing to the golden client:
scp tesla1:/lib/modules/2.6.18-128.1.6.el5/kernel/drivers/video/nvidia.ko /lib/modules/2.6.18-128.1.6.el5/kernel/drivers/video/nvidia.ko depmod -a
- Update the image.
