[GE users] Job in Eqw state, and I don't know why ?

Pradeilles Christoph christoph.pradeilles at consultant.volvo.com
Thu Nov 8 09:55:56 GMT 2007


Hi,

>>what is the prolog doing - you defined any custom one on a queue or  
>>cluster level? Error 26 means "Text file busy". Maybe an NFS delay,  
>>until the distributed/changed dir is propagated again?

The main goal of our prolog file (preExec) is to copy initial data from
/data/depot/ to the node calculation $SCRATCHDIR/$TMPDIR before the
computation. Prolog file initializes some environnement variables linked
to the application and create "the host.list" file. It is used at the
cluster level and takes in account specificities for each queue. It is
not dedicated to one queue in particular.

When you say "Text file busy", it does not mean it is a SGE problem, but
a Unix One ? It is the same issue when you try to erase a file which is
in used, isn't it ?

>>Is the /data/depot necessary because of the size of the files?  
>>Otherwise it might be possible inside the jobscript to copy the input

>>files first to the SGE created $TMPDIR, and after the computation the

>>result back. Avoiding NFS usage during the computation might also  
>>speed it up.

We can't do, unfortunately, the submission in another way from now,
because my company has designed this kind of architecture for all the
cluster, all applications, and all users. Let me explain in few words. 

If users launch their computations from their HOMEDIR (NFS), or from
their workstation TMPDIR directly on calculation node. When results
files are getting back, there is a probability the original file system
does not have enough space to receive those result files (some of them
could have a 10 Go size) and the calculation could be lost in this case.

To avoid this kind of incident which could be dramatical for a 10-days
calculation, we put in place a "gateway" folder called /data/depot/
which is a 1.2To FS and all calculation run through this FS. No more
issue of disk space with that.

Our way of calculation is this one.

Original submission user folder ($HOMEDIR or $SCRATCHDIR)  --> Copy on
/data/depot/$USER/$CASENAME thanks to a script --> After the copy, the
script launch the qsub ---> A calculation node is choosen -->
Computation --> Getting back results file on /data/depot/$USER/$CASENAME
--> Copy result file to the original submission user folder.

If this last step fail because a lack of diskspace, the user is able to
retrieve his results files on /data/depot/.

I will continue to investigate.

Thanks for your help.

Christophe.

-----Original Message-----
From: Reuti [mailto:reuti at staff.uni-marburg.de] 
Sent: mercredi 7 novembre 2007 18:42
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Job in Eqw state, and I don't know why ?

Hi,

Am 07.11.2007 um 12:17 schrieb Pradeilles Christoph:

> I have a very strange issue and I think I found the cause, but I  
> don't know how to resolve it. I'm using "sge6u4"
>
> When I'm launching a jobs serie from the same directory, in some  
> case, I have jobs in the serie which are crashing.
> Here the differents steps made in my launching script.
>
> Let's take a 10 job serie.
>
> casename01 to casename10
>
> - 1 - Copy initial data (casename01) on a tempory folder (/data/ 
> depot/$USER/CaseName/) before the SGE submission.
> - 2 - Launching the submission script (the one with qsub command)  
> from /data/depot/$USER/CaseName/
> - 3 - The computation is processed on a calculation node. It is  
> always the same node (in this case the node is called vxxs0025).
>
> - 4 - During the calculation the launching script is waiting for  
> the results file... in a while loop function.
> - 5 - After the computation, results are copied on (/data/depot/ 
> $USER/CaseName/) and then to the user initial folder. The result  
> file is called casename_result01.
>
> - 6 - When the script detect the result file casename_result01, it  
> means SGE job is finished (Indeed, in the "qstat", it vanished).  
> The results is good.
>
> - 7 - The script remove the contains of /data/depot/$USER/CaseName/
> - 8 - The script is getting out the while loop and can submit the  
> second casename (casename02) like it is described above from the  
> step 1.
>
> On a 10 jobs serie, I have only 1/2 calculation which works well.
>
> The first is ok,
> The second one enter in a Eqw state just after the qusb (which is  
> ok) before executing SGE prolog file  : prolog exited with exit  
> status 26
what is the prolog doing - you defined any custom one on a queue or  
cluster level? Error 26 means "Text file busy". Maybe an NFS delay,  
until the distributed/changed dir is propagated again?

Is the /data/depot necessary because of the size of the files?  
Otherwise it might be possible inside the jobscript to copy the input  
files first to the SGE created $TMPDIR, and after the computation the  
result back. Avoiding NFS usage during the computation might also  
speed it up.

-- Reuti

> The third is ok
> The fourth one enter in a Eqw state just after the qusb (which is  
> ok) before executing SGE prolog file  : prolog exited with exit  
> status 26
>
> Etc... until the last job
>
> When a job crashes, the sge ouptut file is not created !! (the one  
> named $casename.o$jobid)
>
> Here the error message :
>
> Job 123371 caused action: Job 123371 set to ERROR
>  User        = r060963
>  Queue       = aix at vxxs0025.lyon.volvo.net
>  Host        = vxxs0025.lyon.volvo.net
>  Start Time  = <unknown>
>  End Time    = <unknown>
> failed opening input/output file:11/07/2007 10:46:29 [4603:536778]:  
> can't stat() "/data/depot/r060963/890H_ELLIPTICS" as stdout_path:
>
> Shepherd trace:
> 11/07/2007 10:46:28 [144:851982]: shepherd called with uid = 0,  
> euid = 144
> 11/07/2007 10:46:28 [144:851982]: starting up 6.0u4
> 11/07/2007 10:46:29 [144:851982]: setpgid(851982, 851982) returned 0
> 11/07/2007 10:46:29 [144:851982]: forked "prolog" with pid 536778
> 11/07/2007 10:46:29 [144:536778]: pid=536778 pgrp=536778 sid=536778  
> old pgrp=851982 getlogin()=<no login set>
> 11/07/2007 10:46:29 [144:851982]: using signal delivery delay of  
> 120 seconds
> 11/07/2007 10:46:29 [144:851982]: child: prolog - pid: 536778
> 11/07/2007 10:46:29 [4603:536778]: closing all filedescriptors
> 11/07/2007 10:46:29 [4603:536778]: further messages are in "error"  
> and "trace"
> 11/07/2007 10:46:29 [4603:536778]: using "/bin/csh" as shell of  
> user "r060963"
> 11/07/2007 10:46:29 [144:851982]: wait3 returned 536778 (status:  
> 6656; WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 26)
> 11/07/2007 10:46:29 [144:851982]: prolog exited with exit status 26
> 11/07/2007 10:46:29 [144:851982]: reaped "prolog" with pid 536778
> 11/07/2007 10:46:29 [144:851982]: prolog exited not due to signal
> 11/07/2007 10:46:29 [144:851982]: prolog exited with status 26
> 11/07/2007 10:46:29 [144:851982]: no tasker to notify
> 11/07/2007 10:46:29 [144:851982]: exit states increased from 0 to 1
>
> 11/07/2007 10:46:29 [144:851982]: failed starting prolog
>
> Shepherd error:
> 11/07/2007 10:46:29 [4603:536778]: can't stat() "/data/depot/ 
> r060963/890H_ELLIPTICS" as stdout_path: Missing file or filesystem  
> KRB5CCNAME=none uid=4603 gid=442 442 416 2060 2066
>
> Shepherd pe_hostfile:
> vxxs0025.lyon.volvo.net 1 aix at vxxs0025.lyon.volvo.net <NULL>
>
>
> But the very odd thing, is the following :
>
> If I add a step - 9 - on my script which process the following  
> command : sleep 60.
> All 10 jobs works fine !!
>
> If I put sleep 30 --> 2/3 calculation works fine.
>
> It seems, the more I'm waiting after the end of a job, the more I  
> will have a chance that the next computation will work fine... -_-.
>
> Is SGE not clean his environnement properly (Even if we don't see  
> anymore jobs in qstat) ? Why I must waiting 1 minute before  
> launching the next computation ? Why SGE can't launch his prolog  
> file and make me a "prolog exited with exit status 26" ? Why I  
> don't have a $casename.o$jobid file. It means, the qsub failed (so  
> why I have the message "...has been submitted " or "submit job $jobid"
?
>
> Thank you in advance for your help. I will become mad with this issue.
>
> Regards,
>
> Christophe PRADEILLES - SOLUTEC -
> Volvo IT Supplier - CAE Lyon Application Production
> Global  I&O - Service PDEV
> +33 (0) 4 72 96 94 02
> christoph.pradeilles at consultant.volvo.com
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list