[GE users] Job in Eqw state, and I don't know why ?

Pradeilles Christoph christoph.pradeilles at consultant.volvo.com
Wed Nov 7 11:17:00 GMT 2007


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hello,

I have a very strange issue and I think I found the cause, but I don't know how to resolve it. I'm using "sge6u4"

When I'm launching a jobs serie from the same directory, in some case, I have jobs in the serie which are crashing.
Here the differents steps made in my launching script.

Let's take a 10 job serie.

casename01 to casename10

- 1 - Copy initial data (casename01) on a tempory folder (/data/depot/$USER/CaseName/) before the SGE submission.
- 2 - Launching the submission script (the one with qsub command) from /data/depot/$USER/CaseName/
- 3 - The computation is processed on a calculation node. It is always the same node (in this case the node is called vxxs0025).
- 4 - During the calculation the launching script is waiting for the results file... in a while loop function.
- 5 - After the computation, results are copied on (/data/depot/$USER/CaseName/) and then to the user initial folder. The result file is called casename_result01.
- 6 - When the script detect the result file casename_result01, it means SGE job is finished (Indeed, in the "qstat", it vanished). The results is good.
- 7 - The script remove the contains of /data/depot/$USER/CaseName/
- 8 - The script is getting out the while loop and can submit the second casename (casename02) like it is described above from the step 1.

On a 10 jobs serie, I have only 1/2 calculation which works well.

The first is ok,
The second one enter in a Eqw state just after the qusb (which is ok) before executing SGE prolog file  : prolog exited with exit status 26
The third is ok
The fourth one enter in a Eqw state just after the qusb (which is ok) before executing SGE prolog file  : prolog exited with exit status 26
Etc... until the last job

When a job crashes, the sge ouptut file is not created !! (the one named $casename.o$jobid)

Here the error message : 

Job 123371 caused action: Job 123371 set to ERROR
 User        = r060963
 Queue       = aix at vxxs0025.lyon.volvo.net
 Host        = vxxs0025.lyon.volvo.net
 Start Time  = <unknown>
 End Time    = <unknown>
failed opening input/output file:11/07/2007 10:46:29 [4603:536778]: can't stat() "/data/depot/r060963/890H_ELLIPTICS" as stdout_path:
Shepherd trace:
11/07/2007 10:46:28 [144:851982]: shepherd called with uid = 0, euid = 144
11/07/2007 10:46:28 [144:851982]: starting up 6.0u4
11/07/2007 10:46:29 [144:851982]: setpgid(851982, 851982) returned 0
11/07/2007 10:46:29 [144:851982]: forked "prolog" with pid 536778
11/07/2007 10:46:29 [144:536778]: pid=536778 pgrp=536778 sid=536778 old pgrp=851982 getlogin()=<no login set>
11/07/2007 10:46:29 [144:851982]: using signal delivery delay of 120 seconds
11/07/2007 10:46:29 [144:851982]: child: prolog - pid: 536778
11/07/2007 10:46:29 [4603:536778]: closing all filedescriptors
11/07/2007 10:46:29 [4603:536778]: further messages are in "error" and "trace"
11/07/2007 10:46:29 [4603:536778]: using "/bin/csh" as shell of user "r060963"
11/07/2007 10:46:29 [144:851982]: wait3 returned 536778 (status: 6656; WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 26)
11/07/2007 10:46:29 [144:851982]: prolog exited with exit status 26
11/07/2007 10:46:29 [144:851982]: reaped "prolog" with pid 536778
11/07/2007 10:46:29 [144:851982]: prolog exited not due to signal
11/07/2007 10:46:29 [144:851982]: prolog exited with status 26
11/07/2007 10:46:29 [144:851982]: no tasker to notify
11/07/2007 10:46:29 [144:851982]: exit states increased from 0 to 1

11/07/2007 10:46:29 [144:851982]: failed starting prolog

Shepherd error:
11/07/2007 10:46:29 [4603:536778]: can't stat() "/data/depot/r060963/890H_ELLIPTICS" as stdout_path: Missing file or filesystem KRB5CCNAME=none uid=4603 gid=442 442 416 2060 2066 

Shepherd pe_hostfile:
vxxs0025.lyon.volvo.net 1 aix at vxxs0025.lyon.volvo.net <NULL>


But the very odd thing, is the following :

If I add a step - 9 - on my script which process the following command : sleep 60.
All 10 jobs works fine !!

If I put sleep 30 --> 2/3 calculation works fine. 

It seems, the more I'm waiting after the end of a job, the more I will have a chance that the next computation will work fine... -_-.

Is SGE not clean his environnement properly (Even if we don't see anymore jobs in qstat) ? Why I must waiting 1 minute before launching the next computation ? Why SGE can't launch his prolog file and make me a "prolog exited with exit status 26" ? Why I don't have a $casename.o$jobid file. It means, the qsub failed (so why I have the message "...has been submitted " or "submit job $jobid" ?

Thank you in advance for your help. I will become mad with this issue.

Regards,

Christophe PRADEILLES - SOLUTEC -
Volvo IT Supplier - CAE Lyon Application Production 
Global  I&O - Service PDEV
+33 (0) 4 72 96 94 02
christoph.pradeilles at consultant.volvo.com
 




More information about the gridengine-users mailing list