Image Maintenance

From UF HPC Wiki

Jump to: navigation, search

We use SystemImager to provision our nodes. We use the "Use Your Own Kernel" capabilities in SystemImager (aka UYOK). That is, we maintain out our own installation kernel and initrd.

Occassionally, we want to provision nodes that have different hardware from what we currently are using. To do that, we may need to add modules to the installation and provisioned initrd images that support this "new" hardware.

Adding a module to the Installation initrd

Here are the steps to add a module to the installation initrd. In short, we load the necessary modules on a "golden client", instruct si_prepareclient to incorporate those loaded modules into an initrd image, and copy that initrd to our image server. Explicitly:

[root@imgsrv ~]# ssh -x osg
Last login: Wed Oct 15 12:30:59 2008 from imgsrv.ufhpc
[root@osg ~]# modprobe <new modules>
[root@osg ~]# /etc/init.d/lustre-client stop
[root@osg ~]# /etc/init.d/openibd stop
[root@osg ~]# ufsi_prepareclient --server imgsrv.ufhpc --yes --my-modules
<...>
<lots of output suppressed>
<...>
[root@osg ~]# /etc/init.d/openibd start
[root@osg ~]# /etc/init.d/lustre-client start
[root@osg ~]# exit
[root@imgsrv ~]# cd /tftpboot/
[root@imgsrv tftpboot]# cp initrd.img-centos5.1-x86_64-uyok initrd.img-centos5.1-x86_64-uyok.bak
[root@imgsrv tftpboot]# scp osg:/etc/systemimager/boot/initrd.img initrd.img-centos5.1-x86_64-uyok
initrd.img                                                                           100%   41MB  40.9MB/s   00:01    

Adding a module to the Provisioned initrd

If we add modules to the installation initrd, we need to also add them to the provisioned initrd that will reside on a compute nodes disk, for example. To do that, we execute mkinitrd using --with options that we need to make an initrd that will work on any of our machines, and pull a new image. Explicitly:

[root@imgsrv ~]# vi /opt/cluster/config/usr/local/sbin/ufmkinitrd   # Add necessary modules
[root@imgsrv ~]# rdist -P /usr/bin/ssh -f /opt/cluster/Distfile -M 16 local-sbin
<...>
<lots of output suppressed>
<...>
[root@imgsrv ~]# ssh -x r1a-s42
Last login: Wed Oct 15 17:21:00 2008 from imgsrv.ufhpc
[root@r1a-s42 ~]# ufmkinitrd -v -f /boot/initrd-2.6.18-8.1.14.el5.L-1642.img 2.6.18-8.1.14.el5.L-1642
<...>
<lots of output suppressed>
<...>
[root@r1a-s42 ~]# ufsi_prepareclient --server imgsrv.ufhpc --yes
<...>
<lots of output suppressed>
<...>
[root@r1a-s42 ~]# exit
[root@imgsrv ~]# si_getimage --golden-client r1a-s42.ufhpc --image ComputeNodes --exclude "/local/scratch/*"

Adding CUDA modules

The NVIDIA installer for the CUDA drivers will fail to install on nodes without any nvidia graphics hardware. Here's what we normally do to update the NVIDIA drivers.

  • Install driver, SDK, and toolkit for testing on one of the tesla nodes. Build the whole SDK.
  • Note - beginning with CUDA 2.3, 64-bit libs are placed in lib64/ instead of lib/. common.mk must be edited to point to the correct cuda libdir in order for the SDK to be built.
  • Assuming tests work, deploy the driver, SDK, and toolkit on a non-tesla node that we use as a golden-client. Build the whole SDK. To prevent the driver installer from failing at insmod time, we use the --no-kernel-module installer option:
./cudadriver_2.3_linux_64_190.16.run --no-kernel-module
  • Copy the driver itself from the tesla node used for testing to the golden client:
scp tesla1:/lib/modules/2.6.18-128.1.6.el5/kernel/drivers/video/nvidia.ko /lib/modules/2.6.18-128.1.6.el5/kernel/drivers/video/nvidia.ko
depmod -a
  • Update the image.