[GE users] Houskeeping, cleanup processes etc.

Lönroth Erik erik.lonroth at scania.com
Fri Dec 14 09:18:31 GMT 2007


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hello!
 
A while ago I asked for hints about removing "rogue/runnaway/leftover processes" from a running SGE cluster. I didn't find any appealing solutions, so I created my own. I thought I'd share it with the community.
 
I can't send attachements from where I am, so this mail will contain the script. I'm also not the "bash-gugur", so dont expect art in the code. It needs some modifications to work if you use other then linux nodes and you also need to modify some seasrch string for nodes on a "different" domain.
 
You use this script scheduled as a "crontab-job", run every hour or so on each node except master-nodes. It asks SGE for a list of known jobs (qstat) and which users that has ever run a job (qacct). It then loops over all users, killing all processes - except for users that indeed has jobs running on the node according to SGE. It also removes semaphores and shm in the same manner.
 
We have run this in production for quite a while now, and we have managed to cut down the problem with "leftover processes" completely.
 
Here is the script. I hope it will help you getting a more "clean" cluster.
 
#!/bin/bash
 
# set to true for debug
#DEBUG=true
SYSLOG_PRIORITY="local1.info"
 
# Debug print to syslog function
debug_msg () {
   if [ "$DEBUG" == "true" ]; then
      echo $@ | logger -t HOUSEKEEPING -p $SYSLOG_PRIORITY
   fi
}
 
# Source in current SGE environment
if [ -f /etc/profile.d/sge-binaries.sh ]; then
  . /etc/profile.d/sge-binaries.sh
else 
  debug_msg "Couldnt find SGE environment"
  exit 1
fi
 
debug_msg "Running cleanup script at `hostname`: $0" 
 
# Add users here that you dont want to get killed in this cleanup.
exclude_users_from_kill="apache|dbus|nobody|ntp|postfix|root|rpc|sge"
 
debug_msg "Ignoring following users from kill: $exclude_users_from_kill"
 
# If this node has at least a job running in SGE (qstat) - let it run, but kill the rest.
# It looks for ".sss" as part of the domain-name, you might need to change this for your situation.
if qstat -t | grep -q $(hostname -s)\.sss; then
   # Who has registered jobs here?
   users_here=$(qstat -t | grep `hostname -s`\.sss  | awk '{ print $4 }' | grep -v ^$ | tr " \n" "|")
   
   debug_msg "Found that $users_here has jobs on this node (`hostname`), those will be ignored."
 
   # Loop over all users that do not have a job running on this node.
   for user_name in `qacct -o | tail +3 | awk '{ print $1 }' | egrep -v "${users_here}${exclude_users_from_kill}"`; do
      debug_msg "Killing processes, removing semaphores & shared memory for user ($user_name)"
      script=$(mktemp)
      cat << EOT > $script
#!/bin/sh
for semaphore in \$(ipcs -s | grep \${USER} | tail +4 | awk '{print \$2}'); do
 if [ \$semaphore ]; then
   ipcrm -s \$semaphore
 fi
done
for shm in \$(ipcs -m | grep \${USER} | tail +4 | awk '{print \$2}'); do
 if [ \$shm ]; then
   ipcrm -m \$shm
 fi
done
rm -f \$0
kill -9 -1
EOT
      chown $user_name $script
      chmod 755 $script
      su $user_name -c $script
    done
else
   debug_msg "No jobs should be on this node, so I will clean up all except excluded users defined in this script."
   # Loop over all users that ever has had a job running.
   for user_name in `qacct -o | tail +3 | awk '{ print $1 }' | egrep -v "$exclude_users_from_kill"` ; do
      debug_msg "Killing processes,removing semaphores & shared memory for user ($user_name)"
      # build semaphore killer
      script=$(mktemp)
      cat << EOT > $script
#!/bin/sh
for semaphore in \$(ipcs -s | grep \${USER} | tail +4 | awk '{print \$2}'); do
 if [ \$semaphore ]; then
   ipcrm -s \$semaphore
 fi
done
for shm in \$(ipcs -m | grep \${USER} | tail +4 | awk '{print \$2}'); do
 if [ \$shm ]; then
   ipcrm -m \$shm
 fi
done
rm -f \$0
kill -9 -1
EOT
      chown $user_name $script
      chmod 755 $script
      su $user_name -c $script
   done
fi 




More information about the gridengine-users mailing list