[GE users] Jobs stuck in delete status

seandavi seandavi at gmail.com
Tue Jun 9 17:30:21 BST 2009


I'm using 6.2 and have managed to get a couple of jobs stuck in "dr"
status.  Both were parallel jobs running across multiple machines, but
both appear to have the "master" task running on the same machine.  I
have restarted the qmaster and the execd on the machine on which the
jobs appear to have had the "master" task.  Here is what I have in the
execd messages file:

06/09/2009 12:18:46|  main|pressa|I|controlled shutdown 6.2
06/09/2009 12:18:53|  main|pressa|I|starting up SGE 6.2 (lx24-amd64)
06/09/2009 12:18:53|  main|pressa|W|reaping job "28147" ptf complains:
Job does not exist

Any ideas as to what is going on or how to go further with diagnosing
the problem.  The cluster has been up and running for months without
problems.  The only new addition is openmpi integration; it turns out
that one of the jobs stuck in "dr" status is an mpirun job.

Thanks,
Sean

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=201328

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list