[GE users] sge orphans

Kogan, Felix Felix-Kogan at deshaw.com
Fri Aug 24 15:26:57 BST 2007


Hm-m... Interesting - this option is not described in either sge_conf(5)
or sge_execd(8) man pages. I wonder where can I read something about it?

Thanks,

Felix Kogan

-----Original Message-----
From: Beadles, Jeff [mailto:jeff_beadles at mentor.com] 
Sent: Tuesday, July 31, 2007 1:58 PM
To: users at gridengine.sunsource.net
Subject: RE: [GE users] sge orphans

Have you looked at the execd_params ENABLE_ADDGRP_KILL=true  (from qconf
-sconf)

It's been doing a pretty good job of keeping these cleaned up,
automatically.

FYI,
	-Jeff
-- 
Jeff Beadles


-----Original Message-----
From: Paul MacInnis [mailto:macinnis at dal.ca] 
Sent: Tuesday, July 31, 2007 10:43 AM
To: users at gridengine.sunsource.net
Subject: [GE users] sge orphans


It would be nice if SGE always did what was intended but, in the
real world, unexpected things happen ...

I've attached a shell script that we've been using for over
2 years to clear out processes started by SGE but now detached from
any current SGE job (watch out, some line lines may wrap).

Here's how it works.
On each slave node it peeks at each process's environment looking
for a "JOB_ID=" entry, extracts the job number, asks qstat if the
number is valid, and kills the process if it's not.

We've used it on SGE 5.2 and 6.1, both using tight integration of
pe's.

It is a Linux script in that it uses /proc/$pid/environ to see a
process's environment.  For Solaris this must be replaced by
    pargs  -e  $pid
Perhaps a conditional based on "uname -s" is a way to generalize
the script but I don't have a Solaris system to test on.

Running the script with no parameters on a slave node will list
proceses that should be killed but, no action is taken.  You must
use a -kill option to have the script actually kill anything.

To safely try it out, simply log onto a slave node and run the
script with no parameters.  You should be root because otherwise
Linux won't let you read /proc entries that don't belong to you.
And $SGE_ROOT must be set.

Once an hour our master node logs into each slave node and runs
something like this:

kill_sge_orphans.sh -kill

All processes killed are noted in the syslog file using logger.
We've setup logwatch to display these messages in its daily
report. 

The number of processes killed this way varies a lot, due mainly
to what our users are up to - we sometimes go weeks with none but
if a user is trying something new, this saves us a lot of tedious
cleanup work.  We haven't had to do a manual process cleanup in
years.

Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list