[GE users] Restarting sge_execd does not clear hung job status

coffman michael.coffman at avagotech.com
Fri Oct 8 21:54:44 BST 2010


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Just as an FYI.   This does seem to have been network related..  It seems to be a function of rhel5 not dealing well with LSB start/stop info in the sgeexecd script:

### BEGIN INIT INFO
# Provides:       sgeexecd
# Required-Start: $network $remote_fs
# Required-Stop:
# Default-Start:  3 5
# Default-Stop: 0 1 2 6
# Description:  start Grid Engine execd
### END INIT INFO
#---------------------------------------

Using chkconfig to add and enable the service ends up having the the following kill and start scripts which has both running without networking or nfs funtional..

K50sgeexecd
S50sgeexecd

We nfs mount the common directory, so the start/stop scripts were failing.    I added the following to the script:

# chkconfig: 345 73 27

And then chkconfig creates start and stop links that allow for nfs and automount to be running and thus things work much better at startup and shutdown.

Thanks for the input...

On Thu, Oct 7, 2010 at 11:25 AM, reuti <reuti at staff.uni-marburg.de<mailto:reuti at staff.uni-marburg.de>> wrote:
Am 07.10.2010 um 19:18 schrieb coffman:

> On Thu, Oct 7, 2010 at 10:14 AM, reuti <reuti at staff.uni-marburg.de<mailto:reuti at staff.uni-marburg.de>> wrote:
> Am 07.10.2010 um 16:01 schrieb coffman:
>
> > <snip>
> > Some installations I saw are diskless, and create the local spool directories fresh with each new restart. What is inside messages - anything about "job ... not found" or so?
> >
> >
> > The common directory is an NFS mount point, but the spool directory is local to the node.
> >
> > 10/07/2010 07:30:41|  main|cs408|E|fopen("/opt/grid-6.2u5/ftcrnd/common/act_qmaster") failed: No such file or directory
> > 10/07/2010 07:31:03|  main|cs408|E|shepherd of job 825292.1 died through signal = 15
> > 10/07/2010 07:31:03|  main|cs408|E|abnormal termination of shepherd for job 825292.1: "exit_status" file is empty
> > 10/07/2010 07:31:03|  main|cs408|E|can't open usage file "active_jobs/825292.1/usage" for job 825292.1: No such file or directory
> > 10/07/2010 07:31:03|  main|cs408|E|shepherd exited with exit status 19: before writing exit_status
> > 10/07/2010 07:31:03|  main|cs408|I|controlled shutdown 6.2u5
>
> I'm not sure, but could it be, that there was a network problem with the node and you rebooted it with `reboot` or `init 6` in a proper way? This way the node thinks it sent the message of the lost job already, but due to the missing network nothing was send - and after the reboot the old jobinformation is gone?

I don't know for sure about the boot order, but when you shutdown the execd this way, it will remove the entries for the active jobs and no information will be left. And maybe the network driver was shutdown already (or just on the way), and the shutdown was never reported to the qmaster.

When you press just RESET, the information should still be there.

I'm also not aware, that this behavior was different with former versions of SGE. Perhaps.

-- Reuti


>
> No network issues.
>
> The above log info comes from a system that I had just tested this on.    I qrsh'ed to the machine, then logged in as root via another window and ran the command reboot.
>
>
> You can use `qdel -f <job_id>` as you might know for such jobs to remove them.
>
>
> Yes :)   Just did not have to do this before and it confused one of our guys that was doing system repairs.
>
> Any suggestions on how to better understand what is going on?
>
> on 6.0u8 after a simple reboot, the following info still exists on the exechost:
>
> ./cs201/active_jobs/4139152.1
> ./cs201/active_jobs/4139152.1/addgrpid
> ./cs201/active_jobs/4139152.1/error
> ./cs201/active_jobs/4139152.1/environment
> ./cs201/active_jobs/4139152.1/job_pid
> ./cs201/active_jobs/4139152.1/pid
> ./cs201/active_jobs/4139152.1/config
> ./cs201/active_jobs/4139152.1/exit_status
> ./cs201/active_jobs/4139152.1/pe_hostfile
> ./cs201/active_jobs/4139152.1/trace
> ./cs201/execd.pid
>
> It is gone on the 6.2u5 exechost....   No matter how I run the reboot, I can't see any reason for this info to be removed.   This is before sge_execd has been started...
>
>
> -- Reuti
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=286417
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].
>
>
>
> --
> -MichaelC

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=286420

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].



--
-MichaelC



More information about the gridengine-users mailing list