[GE users] Eqw random errors when submitting batches of jobs

Reuti reuti at staff.uni-marburg.de
Thu Oct 16 15:24:43 BST 2008


    [ The following text is in the "WINDOWS-1252" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,

Am 16.10.2008 um 13:01 schrieb Neil Baker:

> We seem to be experiencing random Eqw errors when submitting jobs.   
> We?ve been experiencing this for years with Grid Engine 5.3, but we  
> migrated to Grid Engine 6.1 a few months ago and this is the first  
> time we?ve experienced it on this new version.
>
> This may be due to our environment, but as we?ve already done  
> extensive testing with no successful solution on Grid Engine 5.3, I  
> was wondering if this is a known bug in Grid Engine with either a  
> solution or a work around?
>
> We?re only talking about 1 job experiencing this problem for 9,000  
> jobs submitted, but as our jobs have dependencies on results from  
> previous jobs, an Eqw causes our jobs to stop completely.  As a  
> group of jobs takes over a day to run, this is becoming a big  
> headache for us.
>
> We?re submitting 50 batches of 48 jobs.  When the jobs are  
> submitted there are enough free slots on the grid to enable all the  
> jobs to be run.
>
> On the Queue Master machine (stg-qmaster):
>
> stg-qmaster:~ # qstat | grep Eqw
> 1443430 0.50617 qsub.20    aabella      Eqw   10/16/2008 06:09:43
>
> From: /rmt/sge61/default/spool/qmaster/messages
>
> 10/15/2008 14:14:01|qmaster|stg-qmaster|W|job 1435700.1 failed on  
> host stg-dell31.crl.toshiba.co.uk general opening input/output file  
> because: 10/15/2008 14:14:00 [715:32504]: error: can't open output  
> file "/rmt/exp/enUK/aabella/ASR_expt/s2st_oct08": Is a directory
> 10/15/2008 14:14:01|qmaster|stg-qmaster|W|rescheduling job 1435700.1
>
> Although it says it rescheduled the job at the same time as the  
> error, we actually had to manually reschedule it.
>
> On the Execution Host machine (stg-dell31):
>
> /var/log/messages:
> Oct 15 14:14:00 stg-dell31 automount[4107]: do_mount_indirect:  
> indirect trigger not valid or already mounted /rmt/exp

seems to be a race condition. Can you use a static mount without  
automounter? Or put at least the spool dircetory local on each node  
and the qmaster:

http://gridengine.sunsource.net/howto/nfsreduce.html

-- Reuti


> /local/sge/spool/stg-dell31/messages:
> 10/15/2008 14:14:00|execd|stg-dell31|E|shepherd of job 1435700.1  
> exited with exit status = 26
> 10/15/2008 14:14:00|execd|stg-dell31|W|reaping job "1435700" ptf  
> complains: Job does not exist
> 10/15/2008 14:14:00|execd|stg-dell31|E|can't open usage file  
> "active_jobs/1435700.1/usage" for job 1435700.1: No such file or  
> directory
> 10/15/2008 14:14:00|execd|stg-dell31|E|10/15/2008 14:14:00  
> [715:32504]: error: can't open output file "/rmt/exp/enUK/aabella/ 
> ASR_expt/s2st_oct08": Is a directory
>
> The directory sits on a very expensive NetApp NAS device and  
> nothing is appearing in its logs to indicate a problem with it.   
> However there are a fair amount of files in the directory:
>
> stg-dell31:~ # ls -l /rmt/exp/enUK/aabella/ASR_expt/s2st_oct08 | wc -l
>
> 9707 files
>
> The size of the directory is only 301K though so I don?t think this  
> is near any limit.  New jobs successfully write to that directory  
> after this error occurs.  Well until the next time the same error  
> occurs.
>
> Any advice would be great.
>
> Regards
>
> Neil
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list