[GE users] Restarting sge_execd does not clear hung job status

coffman michael.coffman at avagotech.com
Thu Oct 7 18:18:32 BST 2010


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]



On Thu, Oct 7, 2010 at 10:14 AM, reuti <reuti at staff.uni-marburg.de<mailto:reuti at staff.uni-marburg.de>> wrote:
Am 07.10.2010 um 16:01 schrieb coffman:

> <snip>
> Some installations I saw are diskless, and create the local spool directories fresh with each new restart. What is inside messages - anything about "job ... not found" or so?
>
>
> The common directory is an NFS mount point, but the spool directory is local to the node.
>
> 10/07/2010 07:30:41|  main|cs408|E|fopen("/opt/grid-6.2u5/ftcrnd/common/act_qmaster") failed: No such file or directory
> 10/07/2010 07:31:03|  main|cs408|E|shepherd of job 825292.1 died through signal = 15
> 10/07/2010 07:31:03|  main|cs408|E|abnormal termination of shepherd for job 825292.1: "exit_status" file is empty
> 10/07/2010 07:31:03|  main|cs408|E|can't open usage file "active_jobs/825292.1/usage" for job 825292.1: No such file or directory
> 10/07/2010 07:31:03|  main|cs408|E|shepherd exited with exit status 19: before writing exit_status
> 10/07/2010 07:31:03|  main|cs408|I|controlled shutdown 6.2u5

I'm not sure, but could it be, that there was a network problem with the node and you rebooted it with `reboot` or `init 6` in a proper way? This way the node thinks it sent the message of the lost job already, but due to the missing network nothing was send - and after the reboot the old jobinformation is gone?


No network issues.

The above log info comes from a system that I had just tested this on.    I qrsh'ed to the machine, then logged in as root via another window and ran the command reboot.


You can use `qdel -f <job_id>` as you might know for such jobs to remove them.


Yes :)   Just did not have to do this before and it confused one of our guys that was doing system repairs.

Any suggestions on how to better understand what is going on?

on 6.0u8 after a simple reboot, the following info still exists on the exechost:

./cs201/active_jobs/4139152.1
./cs201/active_jobs/4139152.1/addgrpid
./cs201/active_jobs/4139152.1/error
./cs201/active_jobs/4139152.1/environment
./cs201/active_jobs/4139152.1/job_pid
./cs201/active_jobs/4139152.1/pid
./cs201/active_jobs/4139152.1/config
./cs201/active_jobs/4139152.1/exit_status
./cs201/active_jobs/4139152.1/pe_hostfile
./cs201/active_jobs/4139152.1/trace
./cs201/execd.pid

It is gone on the 6.2u5 exechost....   No matter how I run the reboot, I can't see any reason for this info to be removed.   This is before sge_execd has been started...


-- Reuti

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=286417

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].



--
-MichaelC



More information about the gridengine-users mailing list