[GE users] Epilog scripts and job abort/deletion
jbecker at northwestern.edu
Mon Jul 12 17:12:12 BST 2004
I am currently having a few problems with epilogue scripts and job
deletion. My cluster is currently running SGE 5.3p5 under the ROCKs
distribution (kernel 2.4.21-4.0.1.ELsmp), and I've configured my own
queues and PEs.
My first question: Are epilog scripts run when a job is either deleted
(using qdel), or aborts for some reason? I specifically would like to
run a small script to clean up the IPCS mess left behind when mpich jobs
Second question/problem: I have found that processes involved in jobs
are not always killed, and continue to hang around, even after the job
is finished. I believe that I am using tight integration (partly to try
and avoid this problem with MPI jobs). I should note that many of these
"mpi" jobs are actually not parallel at all; the users are recycling the
wrapper scripts that have "#$ -pe mpi" embedded in them. The relevant
parts of the process tree look like this:
If I delete job 2029, there is a fair chance that it won't actually
die, but merely become a child of init (the process named "2029" does go
away properly). This actually has become enough of a problem that I have
written a few tools to find nodes with high loads, but no jobs assigned.
Is there a way around this by chance?
Third question: Is there any harm in running non-MPI jobs under a PE
designed for MPI? I don't really see how it could be an issue, except
as bumping into whatever accounting limits are in place for number of
jobs in the queue...
Various configuration details:
The 'mpi' PE:
[root at hydra Hydra]# qconf -sp mpi
start_proc_args /opt/gridengine/mpi/startmpi.sh -catch_rsh
One of the queues (one queue per host, all hosts and queues are the same):
[root at hydra Hydra]# qconf -sq cp0-20.q
qtype BATCH INTERACTIVE PARALLEL
[root at hydra Hydra]# qconf -sconf
rlogin_daemon /usr/sbin/sshd -i
rsh_daemon /usr/sbin/sshd -i
I am aware that there are no epilog or prolog scripts configured in the
examples above; I tried adding them, and deleteing a test job, but to no
avail. I reset the configuration after testing, but can change it back.
Thanks for any suggestions anyone can offer.
GPG-fingerprint: BD00 7AA4 4483 AFCC 82D0 2720 0083 0931 9A2B 06A2
[ Part 2, Application/PGP-SIGNATURE 196 bytes. ]
[ Unable to print this part. ]
More information about the gridengine-users