[GE users] Installing and Configuring SGE (as execute-host) on Diskless-Cluster Nodes

jbhoren jbhoren at alaska.edu
Tue Apr 6 01:03:53 BST 2010

Installing SGE on a Diskless Cluster

The cluster's head-node is used as the "Golden Node", and the operating system (RHEL/5.5) was copied to /pxe-diskless/root using this command:

rsync -a --exclude='/proc' --exclude='/sys' / /pxe-diskless/root/

system-config-netboot was used to create the tftp-boot images, and to add the cluster nodes into /pxe-diskless/snapshot.

Install the RPMs on the head-node (sun-sge-common-6.2-5 and sun-sge-bin-linux24-x64-6.2-5), then chroot /pxe-diskless/root and install them there, as well.

On the head-node, cd /gridengine/sge and run install_sge -m -x

Copy the entire default directory from /gridengine/sge into /pxe-diskless/root/gridengine/sge

Copy sgeexecd from /etc/init.d into /pxe-diskless/root/etc/init.d

On the head-node, cd /gridengine/sge and run install_sge -ux

Edit /pxe-diskless/snapshot/files.custom and add the line: /gridware/sge/default/spool/

This will cause each node's /gridware/sge/default/spool to overwrite the system SGE spool directory, mounted from the Golden Node -- we want to do so, because everything from /pxe-diskless/root is mounted read-only, while everything from /pxe-diskless/snapshot is mounted read-write; this is vital, because each node's SGE execd needs to write to its local SGE spool directory.

Make sure that the SGE qmaster daemon is running on the cluster's head-node (remember to use /sbin/chkconfig to ensure that it will start automatically on system boot!).

On the head-node, create /gridengine/sge/default/common/sge_qstat, with the line -u * as its only contents.

Log-in to each cluster compute node and cd /gridengine/sge. Run ./install_execd

Accept the default values, and do not attempt to write or replace the SGE execd startup scripts (after all, /etc/init.d is on a filesystem which is mounted read-only). Verify that the SGE execd is running.

When you have finished, chroot /pxe-diskless/root and run /sbin/chkconfig --add sgeexecd and /sbin/chkconfig --level 345 sgeexecd on

Reboot all of the cluster's compute nodes. On the head-node, run qstat -f and qmon to verify that all execute nodes and their queues are "present and accounted-for".

Systems Administrator
UAF Life Science Informatics
Center for Research Services
jbhoren at alaska.edu<mailto:jbhoren at alaska.edu>

