[GE users] Job in Eqw state, and I don't know why ?
christoph.pradeilles at consultant.volvo.com
Wed Nov 7 11:17:00 GMT 2007
[ The following text is in the "iso-8859-1" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
I have a very strange issue and I think I found the cause, but I don't know how to resolve it. I'm using "sge6u4"
When I'm launching a jobs serie from the same directory, in some case, I have jobs in the serie which are crashing.
Here the differents steps made in my launching script.
Let's take a 10 job serie.
casename01 to casename10
- 1 - Copy initial data (casename01) on a tempory folder (/data/depot/$USER/CaseName/) before the SGE submission.
- 2 - Launching the submission script (the one with qsub command) from /data/depot/$USER/CaseName/
- 3 - The computation is processed on a calculation node. It is always the same node (in this case the node is called vxxs0025).
- 4 - During the calculation the launching script is waiting for the results file... in a while loop function.
- 5 - After the computation, results are copied on (/data/depot/$USER/CaseName/) and then to the user initial folder. The result file is called casename_result01.
- 6 - When the script detect the result file casename_result01, it means SGE job is finished (Indeed, in the "qstat", it vanished). The results is good.
- 7 - The script remove the contains of /data/depot/$USER/CaseName/
- 8 - The script is getting out the while loop and can submit the second casename (casename02) like it is described above from the step 1.
On a 10 jobs serie, I have only 1/2 calculation which works well.
The first is ok,
The second one enter in a Eqw state just after the qusb (which is ok) before executing SGE prolog file : prolog exited with exit status 26
The third is ok
The fourth one enter in a Eqw state just after the qusb (which is ok) before executing SGE prolog file : prolog exited with exit status 26
Etc... until the last job
When a job crashes, the sge ouptut file is not created !! (the one named $casename.o$jobid)
Here the error message :
Job 123371 caused action: Job 123371 set to ERROR
User = r060963
Queue = aix at vxxs0025.lyon.volvo.net
Host = vxxs0025.lyon.volvo.net
Start Time = <unknown>
End Time = <unknown>
failed opening input/output file:11/07/2007 10:46:29 [4603:536778]: can't stat() "/data/depot/r060963/890H_ELLIPTICS" as stdout_path:
11/07/2007 10:46:28 [144:851982]: shepherd called with uid = 0, euid = 144
11/07/2007 10:46:28 [144:851982]: starting up 6.0u4
11/07/2007 10:46:29 [144:851982]: setpgid(851982, 851982) returned 0
11/07/2007 10:46:29 [144:851982]: forked "prolog" with pid 536778
11/07/2007 10:46:29 [144:536778]: pid=536778 pgrp=536778 sid=536778 old pgrp=851982 getlogin()=<no login set>
11/07/2007 10:46:29 [144:851982]: using signal delivery delay of 120 seconds
11/07/2007 10:46:29 [144:851982]: child: prolog - pid: 536778
11/07/2007 10:46:29 [4603:536778]: closing all filedescriptors
11/07/2007 10:46:29 [4603:536778]: further messages are in "error" and "trace"
11/07/2007 10:46:29 [4603:536778]: using "/bin/csh" as shell of user "r060963"
11/07/2007 10:46:29 [144:851982]: wait3 returned 536778 (status: 6656; WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 26)
11/07/2007 10:46:29 [144:851982]: prolog exited with exit status 26
11/07/2007 10:46:29 [144:851982]: reaped "prolog" with pid 536778
11/07/2007 10:46:29 [144:851982]: prolog exited not due to signal
11/07/2007 10:46:29 [144:851982]: prolog exited with status 26
11/07/2007 10:46:29 [144:851982]: no tasker to notify
11/07/2007 10:46:29 [144:851982]: exit states increased from 0 to 1
11/07/2007 10:46:29 [144:851982]: failed starting prolog
11/07/2007 10:46:29 [4603:536778]: can't stat() "/data/depot/r060963/890H_ELLIPTICS" as stdout_path: Missing file or filesystem KRB5CCNAME=none uid=4603 gid=442 442 416 2060 2066
vxxs0025.lyon.volvo.net 1 aix at vxxs0025.lyon.volvo.net <NULL>
But the very odd thing, is the following :
If I add a step - 9 - on my script which process the following command : sleep 60.
All 10 jobs works fine !!
If I put sleep 30 --> 2/3 calculation works fine.
It seems, the more I'm waiting after the end of a job, the more I will have a chance that the next computation will work fine... -_-.
Is SGE not clean his environnement properly (Even if we don't see anymore jobs in qstat) ? Why I must waiting 1 minute before launching the next computation ? Why SGE can't launch his prolog file and make me a "prolog exited with exit status 26" ? Why I don't have a $casename.o$jobid file. It means, the qsub failed (so why I have the message "...has been submitted " or "submit job $jobid" ?
Thank you in advance for your help. I will become mad with this issue.
Christophe PRADEILLES - SOLUTEC -
Volvo IT Supplier - CAE Lyon Application Production
Global I&O - Service PDEV
+33 (0) 4 72 96 94 02
christoph.pradeilles at consultant.volvo.com
More information about the gridengine-users