[GE users] Job in Eqw state, and I don't know why ?

Reuti reuti at staff.uni-marburg.de
Wed Nov 7 17:42:14 GMT 2007


    [ The following text is in the "WINDOWS-1252" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,

Am 07.11.2007 um 12:17 schrieb Pradeilles Christoph:

> I have a very strange issue and I think I found the cause, but I  
> don't know how to resolve it. I'm using "sge6u4"
>
> When I'm launching a jobs serie from the same directory, in some  
> case, I have jobs in the serie which are crashing.
> Here the differents steps made in my launching script.
>
> Let's take a 10 job serie.
>
> casename01 to casename10
>
> - 1 - Copy initial data (casename01) on a tempory folder (/data/ 
> depot/$USER/CaseName/) before the SGE submission.
> - 2 - Launching the submission script (the one with qsub command)  
> from /data/depot/$USER/CaseName/
> - 3 - The computation is processed on a calculation node. It is  
> always the same node (in this case the node is called vxxs0025).
>
> - 4 - During the calculation the launching script is waiting for  
> the results file? in a while loop function.
> - 5 - After the computation, results are copied on (/data/depot/ 
> $USER/CaseName/) and then to the user initial folder. The result  
> file is called casename_result01.
>
> - 6 - When the script detect the result file casename_result01, it  
> means SGE job is finished (Indeed, in the "qstat", it vanished).  
> The results is good.
>
> - 7 - The script remove the contains of /data/depot/$USER/CaseName/
> - 8 - The script is getting out the while loop and can submit the  
> second casename (casename02) like it is described above from the  
> step 1.
>
> On a 10 jobs serie, I have only 1/2 calculation which works well.
>
> The first is ok,
> The second one enter in a Eqw state just after the qusb (which is  
> ok) before executing SGE prolog file  : prolog exited with exit  
> status 26
what is the prolog doing - you defined any custom one on a queue or  
cluster level? Error 26 means "Text file busy". Maybe an NFS delay,  
until the distributed/changed dir is propagated again?

Is the /data/depot necessary because of the size of the files?  
Otherwise it might be possible inside the jobscript to copy the input  
files first to the SGE created $TMPDIR, and after the computation the  
result back. Avoiding NFS usage during the computation might also  
speed it up.

-- Reuti

> The third is ok
> The fourth one enter in a Eqw state just after the qusb (which is  
> ok) before executing SGE prolog file  : prolog exited with exit  
> status 26
>
> Etc? until the last job
>
> When a job crashes, the sge ouptut file is not created !! (the one  
> named $casename.o$jobid)
>
> Here the error message :
>
> Job 123371 caused action: Job 123371 set to ERROR
>  User        = r060963
>  Queue       = aix at vxxs0025.lyon.volvo.net
>  Host        = vxxs0025.lyon.volvo.net
>  Start Time  = <unknown>
>  End Time    = <unknown>
> failed opening input/output file:11/07/2007 10:46:29 [4603:536778]:  
> can't stat() "/data/depot/r060963/890H_ELLIPTICS" as stdout_path:
>
> Shepherd trace:
> 11/07/2007 10:46:28 [144:851982]: shepherd called with uid = 0,  
> euid = 144
> 11/07/2007 10:46:28 [144:851982]: starting up 6.0u4
> 11/07/2007 10:46:29 [144:851982]: setpgid(851982, 851982) returned 0
> 11/07/2007 10:46:29 [144:851982]: forked "prolog" with pid 536778
> 11/07/2007 10:46:29 [144:536778]: pid=536778 pgrp=536778 sid=536778  
> old pgrp=851982 getlogin()=<no login set>
> 11/07/2007 10:46:29 [144:851982]: using signal delivery delay of  
> 120 seconds
> 11/07/2007 10:46:29 [144:851982]: child: prolog - pid: 536778
> 11/07/2007 10:46:29 [4603:536778]: closing all filedescriptors
> 11/07/2007 10:46:29 [4603:536778]: further messages are in "error"  
> and "trace"
> 11/07/2007 10:46:29 [4603:536778]: using "/bin/csh" as shell of  
> user "r060963"
> 11/07/2007 10:46:29 [144:851982]: wait3 returned 536778 (status:  
> 6656; WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 26)
> 11/07/2007 10:46:29 [144:851982]: prolog exited with exit status 26
> 11/07/2007 10:46:29 [144:851982]: reaped "prolog" with pid 536778
> 11/07/2007 10:46:29 [144:851982]: prolog exited not due to signal
> 11/07/2007 10:46:29 [144:851982]: prolog exited with status 26
> 11/07/2007 10:46:29 [144:851982]: no tasker to notify
> 11/07/2007 10:46:29 [144:851982]: exit states increased from 0 to 1
>
> 11/07/2007 10:46:29 [144:851982]: failed starting prolog
>
> Shepherd error:
> 11/07/2007 10:46:29 [4603:536778]: can't stat() "/data/depot/ 
> r060963/890H_ELLIPTICS" as stdout_path: Missing file or filesystem  
> KRB5CCNAME=none uid=4603 gid=442 442 416 2060 2066
>
> Shepherd pe_hostfile:
> vxxs0025.lyon.volvo.net 1 aix at vxxs0025.lyon.volvo.net <NULL>
>
>
> But the very odd thing, is the following :
>
> If I add a step - 9 - on my script which process the following  
> command : sleep 60.
> All 10 jobs works fine !!
>
> If I put sleep 30 --> 2/3 calculation works fine.
>
> It seems, the more I'm waiting after the end of a job, the more I  
> will have a chance that the next computation will work fine? -_-.
>
> Is SGE not clean his environnement properly (Even if we don't see  
> anymore jobs in qstat) ? Why I must waiting 1 minute before  
> launching the next computation ? Why SGE can't launch his prolog  
> file and make me a "prolog exited with exit status 26" ? Why I  
> don't have a $casename.o$jobid file. It means, the qsub failed (so  
> why I have the message "?has been submitted " or "submit job $jobid" ?
>
> Thank you in advance for your help. I will become mad with this issue.
>
> Regards,
>
> Christophe PRADEILLES - SOLUTEC -
> Volvo IT Supplier - CAE Lyon Application Production
> Global  I&O - Service PDEV
> +33 (0) 4 72 96 94 02
> christoph.pradeilles at consultant.volvo.com
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list