[GE users] Eqw random errors when submitting batches of jobs

Neil Baker neil.baker at crl.toshiba.co.uk
Thu Oct 16 15:58:12 BST 2008


Hi Reuti,

We are already writing to local spool directories to reduce NFS traffic, but
this error occurs when it is trying to write the error and output files back
to the current working directory (currently on an NFS volume to gather up
all the results into one location).

I suppose we could set the directory to be a local directory path, so that
results are written to a local directory on each execution host?

Neil

-----Original Message-----
From: Reuti [mailto:reuti at staff.uni-marburg.de] 
Sent: 16 October 2008 15:25
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Eqw random errors when submitting batches of jobs

Hi,

Am 16.10.2008 um 13:01 schrieb Neil Baker:

> We seem to be experiencing random Eqw errors when submitting jobs.   
> We've been experiencing this for years with Grid Engine 5.3, but we  
> migrated to Grid Engine 6.1 a few months ago and this is the first  
> time we've experienced it on this new version.
>
> This may be due to our environment, but as we've already done  
> extensive testing with no successful solution on Grid Engine 5.3, I  
> was wondering if this is a known bug in Grid Engine with either a  
> solution or a work around?
>
> We're only talking about 1 job experiencing this problem for 9,000  
> jobs submitted, but as our jobs have dependencies on results from  
> previous jobs, an Eqw causes our jobs to stop completely.  As a  
> group of jobs takes over a day to run, this is becoming a big  
> headache for us.
>
> We're submitting 50 batches of 48 jobs.  When the jobs are  
> submitted there are enough free slots on the grid to enable all the  
> jobs to be run.
>
> On the Queue Master machine (stg-qmaster):
>
> stg-qmaster:~ # qstat | grep Eqw
> 1443430 0.50617 qsub.20    aabella      Eqw   10/16/2008 06:09:43
>
> From: /rmt/sge61/default/spool/qmaster/messages
>
> 10/15/2008 14:14:01|qmaster|stg-qmaster|W|job 1435700.1 failed on  
> host stg-dell31.crl.toshiba.co.uk general opening input/output file  
> because: 10/15/2008 14:14:00 [715:32504]: error: can't open output  
> file "/rmt/exp/enUK/aabella/ASR_expt/s2st_oct08": Is a directory
> 10/15/2008 14:14:01|qmaster|stg-qmaster|W|rescheduling job 1435700.1
>
> Although it says it rescheduled the job at the same time as the  
> error, we actually had to manually reschedule it.
>
> On the Execution Host machine (stg-dell31):
>
> /var/log/messages:
> Oct 15 14:14:00 stg-dell31 automount[4107]: do_mount_indirect:  
> indirect trigger not valid or already mounted /rmt/exp

seems to be a race condition. Can you use a static mount without  
automounter? Or put at least the spool dircetory local on each node  
and the qmaster:

http://gridengine.sunsource.net/howto/nfsreduce.html

-- Reuti


> /local/sge/spool/stg-dell31/messages:
> 10/15/2008 14:14:00|execd|stg-dell31|E|shepherd of job 1435700.1  
> exited with exit status = 26
> 10/15/2008 14:14:00|execd|stg-dell31|W|reaping job "1435700" ptf  
> complains: Job does not exist
> 10/15/2008 14:14:00|execd|stg-dell31|E|can't open usage file  
> "active_jobs/1435700.1/usage" for job 1435700.1: No such file or  
> directory
> 10/15/2008 14:14:00|execd|stg-dell31|E|10/15/2008 14:14:00  
> [715:32504]: error: can't open output file "/rmt/exp/enUK/aabella/ 
> ASR_expt/s2st_oct08": Is a directory
>
> The directory sits on a very expensive NetApp NAS device and  
> nothing is appearing in its logs to indicate a problem with it.   
> However there are a fair amount of files in the directory:
>
> stg-dell31:~ # ls -l /rmt/exp/enUK/aabella/ASR_expt/s2st_oct08 | wc -l
>
> 9707 files
>
> The size of the directory is only 301K though so I don't think this  
> is near any limit.  New jobs successfully write to that directory  
> after this error occurs.  Well until the next time the same error  
> occurs.
>
> Any advice would be great.
>
> Regards
>
> Neil
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net


______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list