[GE users] Restarting sge_execd does not clear hung job status

reuti reuti at staff.uni-marburg.de
Thu Oct 7 17:14:47 BST 2010


Am 07.10.2010 um 16:01 schrieb coffman:

> <snip>
> Some installations I saw are diskless, and create the local spool directories fresh with each new restart. What is inside messages - anything about "job ... not found" or so?
> 
> 
> The common directory is an NFS mount point, but the spool directory is local to the node.
> 
> 10/07/2010 07:30:41|  main|cs408|E|fopen("/opt/grid-6.2u5/ftcrnd/common/act_qmaster") failed: No such file or directory
> 10/07/2010 07:31:03|  main|cs408|E|shepherd of job 825292.1 died through signal = 15
> 10/07/2010 07:31:03|  main|cs408|E|abnormal termination of shepherd for job 825292.1: "exit_status" file is empty
> 10/07/2010 07:31:03|  main|cs408|E|can't open usage file "active_jobs/825292.1/usage" for job 825292.1: No such file or directory
> 10/07/2010 07:31:03|  main|cs408|E|shepherd exited with exit status 19: before writing exit_status
> 10/07/2010 07:31:03|  main|cs408|I|controlled shutdown 6.2u5

I'm not sure, but could it be, that there was a network problem with the node and you rebooted it with `reboot` or `init 6` in a proper way? This way the node thinks it sent the message of the lost job already, but due to the missing network nothing was send - and after the reboot the old jobinformation is gone?

You can use `qdel -f <job_id>` as you might know for such jobs to remove them.

-- Reuti

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=286417

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list