[GE users] Restarting sge_execd does not clear hung job status

reuti reuti at staff.uni-marburg.de
Thu Oct 7 18:25:06 BST 2010


Am 07.10.2010 um 19:18 schrieb coffman:

> On Thu, Oct 7, 2010 at 10:14 AM, reuti <reuti at staff.uni-marburg.de> wrote:
> Am 07.10.2010 um 16:01 schrieb coffman:
> 
> > <snip>
> > Some installations I saw are diskless, and create the local spool directories fresh with each new restart. What is inside messages - anything about "job ... not found" or so?
> >
> >
> > The common directory is an NFS mount point, but the spool directory is local to the node.
> >
> > 10/07/2010 07:30:41|  main|cs408|E|fopen("/opt/grid-6.2u5/ftcrnd/common/act_qmaster") failed: No such file or directory
> > 10/07/2010 07:31:03|  main|cs408|E|shepherd of job 825292.1 died through signal = 15
> > 10/07/2010 07:31:03|  main|cs408|E|abnormal termination of shepherd for job 825292.1: "exit_status" file is empty
> > 10/07/2010 07:31:03|  main|cs408|E|can't open usage file "active_jobs/825292.1/usage" for job 825292.1: No such file or directory
> > 10/07/2010 07:31:03|  main|cs408|E|shepherd exited with exit status 19: before writing exit_status
> > 10/07/2010 07:31:03|  main|cs408|I|controlled shutdown 6.2u5
> 
> I'm not sure, but could it be, that there was a network problem with the node and you rebooted it with `reboot` or `init 6` in a proper way? This way the node thinks it sent the message of the lost job already, but due to the missing network nothing was send - and after the reboot the old jobinformation is gone?

I don't know for sure about the boot order, but when you shutdown the execd this way, it will remove the entries for the active jobs and no information will be left. And maybe the network driver was shutdown already (or just on the way), and the shutdown was never reported to the qmaster.

When you press just RESET, the information should still be there.

I'm also not aware, that this behavior was different with former versions of SGE. Perhaps.

-- Reuti


> 
> No network issues.
> 
> The above log info comes from a system that I had just tested this on.    I qrsh'ed to the machine, then logged in as root via another window and ran the command reboot.
> 
>  
> You can use `qdel -f <job_id>` as you might know for such jobs to remove them.
> 
> 
> Yes :)   Just did not have to do this before and it confused one of our guys that was doing system repairs.
> 
> Any suggestions on how to better understand what is going on?   
> 
> on 6.0u8 after a simple reboot, the following info still exists on the exechost:
> 
> ./cs201/active_jobs/4139152.1
> ./cs201/active_jobs/4139152.1/addgrpid
> ./cs201/active_jobs/4139152.1/error
> ./cs201/active_jobs/4139152.1/environment
> ./cs201/active_jobs/4139152.1/job_pid
> ./cs201/active_jobs/4139152.1/pid
> ./cs201/active_jobs/4139152.1/config
> ./cs201/active_jobs/4139152.1/exit_status
> ./cs201/active_jobs/4139152.1/pe_hostfile
> ./cs201/active_jobs/4139152.1/trace
> ./cs201/execd.pid
> 
> It is gone on the 6.2u5 exechost....   No matter how I run the reboot, I can't see any reason for this info to be removed.   This is before sge_execd has been started...
> 
> 
> -- Reuti
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=286417
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> 
> 
> 
> -- 
> -MichaelC

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=286420

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list