[GE users] Job in Eqw state, and I don't know why ?

Reuti reuti at staff.uni-marburg.de
Sat Nov 10 14:12:52 GMT 2007


Hi,

Am 08.11.2007 um 10:55 schrieb Pradeilles Christoph:

>>> what is the prolog doing - you defined any custom one on a queue or
>>> cluster level? Error 26 means "Text file busy". Maybe an NFS delay,
>>> until the distributed/changed dir is propagated again?
>
> The main goal of our prolog file (preExec) is to copy initial data  
> from
> /data/depot/ to the node calculation $SCRATCHDIR/$TMPDIR before the
> computation. Prolog file initializes some environnement variables  
> linked
> to the application and create "the host.list" file. It is used at the
> cluster level and takes in account specificities for each queue. It is
> not dedicated to one queue in particular.
>
> When you say "Text file busy", it does not mean it is a SGE  
> problem, but
> a Unix One ? It is the same issue when you try to erase a file  
> which is
> in used, isn't it ?
>
>>> Is the /data/depot necessary because of the size of the files?
>>> Otherwise it might be possible inside the jobscript to copy the  
>>> input
>
>>> files first to the SGE created $TMPDIR, and after the computation  
>>> the
>
>>> result back. Avoiding NFS usage during the computation might also
>>> speed it up.
>
> We can't do, unfortunately, the submission in another way from now,
> because my company has designed this kind of architecture for all the
> cluster, all applications, and all users. Let me explain in few words.
>
> If users launch their computations from their HOMEDIR (NFS), or from
> their workstation TMPDIR directly on calculation node. When results
> files are getting back, there is a probability the original file  
> system
> does not have enough space to receive those result files (some of them
> could have a 10 Go size) and the calculation could be lost in this  
> case.
>
> To avoid this kind of incident which could be dramatical for a 10-days
> calculation, we put in place a "gateway" folder called /data/depot/
> which is a 1.2To FS and all calculation run through this FS. No more
> issue of disk space with that.
>
> Our way of calculation is this one.
>
> Original submission user folder ($HOMEDIR or $SCRATCHDIR)  --> Copy on
> /data/depot/$USER/$CASENAME thanks to a script --> After the copy, the
> script launch the qsub ---> A calculation node is choosen -->
> Computation --> Getting back results file on /data/depot/$USER/ 
> $CASENAME
> --> Copy result file to the original submission user folder.

to get a more unique directory in /data/depot (to be on the safe  
side), what about this:

qsub the job with -h, so you have the jobnumber
create a directory /data/depot/<job_number>
copy data as before
qrls the job
proceed like before

-- Reuti


> If this last step fail because a lack of diskspace, the user is  
> able to
> retrieve his results files on /data/depot/.
>
> I will continue to investigate.
>
> Thanks for your help.
>
> Christophe.
>
> -----Original Message-----
> From: Reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: mercredi 7 novembre 2007 18:42
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Job in Eqw state, and I don't know why ?
>
> Hi,
>
> Am 07.11.2007 um 12:17 schrieb Pradeilles Christoph:
>
>> I have a very strange issue and I think I found the cause, but I
>> don't know how to resolve it. I'm using "sge6u4"
>>
>> When I'm launching a jobs serie from the same directory, in some
>> case, I have jobs in the serie which are crashing.
>> Here the differents steps made in my launching script.
>>
>> Let's take a 10 job serie.
>>
>> casename01 to casename10
>>
>> - 1 - Copy initial data (casename01) on a tempory folder (/data/
>> depot/$USER/CaseName/) before the SGE submission.
>> - 2 - Launching the submission script (the one with qsub command)
>> from /data/depot/$USER/CaseName/
>> - 3 - The computation is processed on a calculation node. It is
>> always the same node (in this case the node is called vxxs0025).
>>
>> - 4 - During the calculation the launching script is waiting for
>> the results file... in a while loop function.
>> - 5 - After the computation, results are copied on (/data/depot/
>> $USER/CaseName/) and then to the user initial folder. The result
>> file is called casename_result01.
>>
>> - 6 - When the script detect the result file casename_result01, it
>> means SGE job is finished (Indeed, in the "qstat", it vanished).
>> The results is good.
>>
>> - 7 - The script remove the contains of /data/depot/$USER/CaseName/
>> - 8 - The script is getting out the while loop and can submit the
>> second casename (casename02) like it is described above from the
>> step 1.
>>
>> On a 10 jobs serie, I have only 1/2 calculation which works well.
>>
>> The first is ok,
>> The second one enter in a Eqw state just after the qusb (which is
>> ok) before executing SGE prolog file  : prolog exited with exit
>> status 26
> what is the prolog doing - you defined any custom one on a queue or
> cluster level? Error 26 means "Text file busy". Maybe an NFS delay,
> until the distributed/changed dir is propagated again?
>
> Is the /data/depot necessary because of the size of the files?
> Otherwise it might be possible inside the jobscript to copy the input
> files first to the SGE created $TMPDIR, and after the computation the
> result back. Avoiding NFS usage during the computation might also
> speed it up.
>
> -- Reuti
>
>> The third is ok
>> The fourth one enter in a Eqw state just after the qusb (which is
>> ok) before executing SGE prolog file  : prolog exited with exit
>> status 26
>>
>> Etc... until the last job
>>
>> When a job crashes, the sge ouptut file is not created !! (the one
>> named $casename.o$jobid)
>>
>> Here the error message :
>>
>> Job 123371 caused action: Job 123371 set to ERROR
>>  User        = r060963
>>  Queue       = aix at vxxs0025.lyon.volvo.net
>>  Host        = vxxs0025.lyon.volvo.net
>>  Start Time  = <unknown>
>>  End Time    = <unknown>
>> failed opening input/output file:11/07/2007 10:46:29 [4603:536778]:
>> can't stat() "/data/depot/r060963/890H_ELLIPTICS" as stdout_path:
>>
>> Shepherd trace:
>> 11/07/2007 10:46:28 [144:851982]: shepherd called with uid = 0,
>> euid = 144
>> 11/07/2007 10:46:28 [144:851982]: starting up 6.0u4
>> 11/07/2007 10:46:29 [144:851982]: setpgid(851982, 851982) returned 0
>> 11/07/2007 10:46:29 [144:851982]: forked "prolog" with pid 536778
>> 11/07/2007 10:46:29 [144:536778]: pid=536778 pgrp=536778 sid=536778
>> old pgrp=851982 getlogin()=<no login set>
>> 11/07/2007 10:46:29 [144:851982]: using signal delivery delay of
>> 120 seconds
>> 11/07/2007 10:46:29 [144:851982]: child: prolog - pid: 536778
>> 11/07/2007 10:46:29 [4603:536778]: closing all filedescriptors
>> 11/07/2007 10:46:29 [4603:536778]: further messages are in "error"
>> and "trace"
>> 11/07/2007 10:46:29 [4603:536778]: using "/bin/csh" as shell of
>> user "r060963"
>> 11/07/2007 10:46:29 [144:851982]: wait3 returned 536778 (status:
>> 6656; WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 26)
>> 11/07/2007 10:46:29 [144:851982]: prolog exited with exit status 26
>> 11/07/2007 10:46:29 [144:851982]: reaped "prolog" with pid 536778
>> 11/07/2007 10:46:29 [144:851982]: prolog exited not due to signal
>> 11/07/2007 10:46:29 [144:851982]: prolog exited with status 26
>> 11/07/2007 10:46:29 [144:851982]: no tasker to notify
>> 11/07/2007 10:46:29 [144:851982]: exit states increased from 0 to 1
>>
>> 11/07/2007 10:46:29 [144:851982]: failed starting prolog
>>
>> Shepherd error:
>> 11/07/2007 10:46:29 [4603:536778]: can't stat() "/data/depot/
>> r060963/890H_ELLIPTICS" as stdout_path: Missing file or filesystem
>> KRB5CCNAME=none uid=4603 gid=442 442 416 2060 2066
>>
>> Shepherd pe_hostfile:
>> vxxs0025.lyon.volvo.net 1 aix at vxxs0025.lyon.volvo.net <NULL>
>>
>>
>> But the very odd thing, is the following :
>>
>> If I add a step - 9 - on my script which process the following
>> command : sleep 60.
>> All 10 jobs works fine !!
>>
>> If I put sleep 30 --> 2/3 calculation works fine.
>>
>> It seems, the more I'm waiting after the end of a job, the more I
>> will have a chance that the next computation will work fine... -_-.
>>
>> Is SGE not clean his environnement properly (Even if we don't see
>> anymore jobs in qstat) ? Why I must waiting 1 minute before
>> launching the next computation ? Why SGE can't launch his prolog
>> file and make me a "prolog exited with exit status 26" ? Why I
>> don't have a $casename.o$jobid file. It means, the qsub failed (so
>> why I have the message "...has been submitted " or "submit job  
>> $jobid"
> ?
>>
>> Thank you in advance for your help. I will become mad with this  
>> issue.
>>
>> Regards,
>>
>> Christophe PRADEILLES - SOLUTEC -
>> Volvo IT Supplier - CAE Lyon Application Production
>> Global  I&O - Service PDEV
>> +33 (0) 4 72 96 94 02
>> christoph.pradeilles at consultant.volvo.com
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list