[GE users] sge orphans

Paul MacInnis macinnis at dal.ca
Tue Jul 31 18:42:57 BST 2007


It would be nice if SGE always did what was intended but, in the
real world, unexpected things happen ...

I've attached a shell script that we've been using for over
2 years to clear out processes started by SGE but now detached from
any current SGE job (watch out, some line lines may wrap).

Here's how it works.
On each slave node it peeks at each process's environment looking
for a "JOB_ID=" entry, extracts the job number, asks qstat if the
number is valid, and kills the process if it's not.

We've used it on SGE 5.2 and 6.1, both using tight integration of
pe's.

It is a Linux script in that it uses /proc/$pid/environ to see a
process's environment.  For Solaris this must be replaced by
    pargs  -e  $pid
Perhaps a conditional based on "uname -s" is a way to generalize
the script but I don't have a Solaris system to test on.

Running the script with no parameters on a slave node will list
proceses that should be killed but, no action is taken.  You must
use a -kill option to have the script actually kill anything.

To safely try it out, simply log onto a slave node and run the
script with no parameters.  You should be root because otherwise
Linux won't let you read /proc entries that don't belong to you.
And $SGE_ROOT must be set.

Once an hour our master node logs into each slave node and runs
something like this:

kill_sge_orphans.sh -kill

All processes killed are noted in the syslog file using logger.
We've setup logwatch to display these messages in its daily
report. 

The number of processes killed this way varies a lot, due mainly
to what our users are up to - we sometimes go weeks with none but
if a user is trying something new, this saves us a lot of tedious
cleanup work.  We haven't had to do a manual process cleanup in
years.

Paul


    [ Part 2, ""  Text/PLAIN (Name: "kill_sge_orphans.sh") ~6 KB. ]
    [ Unable to print this part. ]


    [ Part 3: "Attached Text" ]

    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



More information about the gridengine-users mailing list