[GE users] Eqw random errors when submitting batches of jobs
reuti at staff.uni-marburg.de
Thu Oct 16 15:24:43 BST 2008
[ The following text is in the "WINDOWS-1252" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
Am 16.10.2008 um 13:01 schrieb Neil Baker:
> We seem to be experiencing random Eqw errors when submitting jobs.
> We?ve been experiencing this for years with Grid Engine 5.3, but we
> migrated to Grid Engine 6.1 a few months ago and this is the first
> time we?ve experienced it on this new version.
> This may be due to our environment, but as we?ve already done
> extensive testing with no successful solution on Grid Engine 5.3, I
> was wondering if this is a known bug in Grid Engine with either a
> solution or a work around?
> We?re only talking about 1 job experiencing this problem for 9,000
> jobs submitted, but as our jobs have dependencies on results from
> previous jobs, an Eqw causes our jobs to stop completely. As a
> group of jobs takes over a day to run, this is becoming a big
> headache for us.
> We?re submitting 50 batches of 48 jobs. When the jobs are
> submitted there are enough free slots on the grid to enable all the
> jobs to be run.
> On the Queue Master machine (stg-qmaster):
> stg-qmaster:~ # qstat | grep Eqw
> 1443430 0.50617 qsub.20 aabella Eqw 10/16/2008 06:09:43
> From: /rmt/sge61/default/spool/qmaster/messages
> 10/15/2008 14:14:01|qmaster|stg-qmaster|W|job 1435700.1 failed on
> host stg-dell31.crl.toshiba.co.uk general opening input/output file
> because: 10/15/2008 14:14:00 [715:32504]: error: can't open output
> file "/rmt/exp/enUK/aabella/ASR_expt/s2st_oct08": Is a directory
> 10/15/2008 14:14:01|qmaster|stg-qmaster|W|rescheduling job 1435700.1
> Although it says it rescheduled the job at the same time as the
> error, we actually had to manually reschedule it.
> On the Execution Host machine (stg-dell31):
> Oct 15 14:14:00 stg-dell31 automount: do_mount_indirect:
> indirect trigger not valid or already mounted /rmt/exp
seems to be a race condition. Can you use a static mount without
automounter? Or put at least the spool dircetory local on each node
and the qmaster:
> 10/15/2008 14:14:00|execd|stg-dell31|E|shepherd of job 1435700.1
> exited with exit status = 26
> 10/15/2008 14:14:00|execd|stg-dell31|W|reaping job "1435700" ptf
> complains: Job does not exist
> 10/15/2008 14:14:00|execd|stg-dell31|E|can't open usage file
> "active_jobs/1435700.1/usage" for job 1435700.1: No such file or
> 10/15/2008 14:14:00|execd|stg-dell31|E|10/15/2008 14:14:00
> [715:32504]: error: can't open output file "/rmt/exp/enUK/aabella/
> ASR_expt/s2st_oct08": Is a directory
> The directory sits on a very expensive NetApp NAS device and
> nothing is appearing in its logs to indicate a problem with it.
> However there are a fair amount of files in the directory:
> stg-dell31:~ # ls -l /rmt/exp/enUK/aabella/ASR_expt/s2st_oct08 | wc -l
> 9707 files
> The size of the directory is only 301K though so I don?t think this
> is near any limit. New jobs successfully write to that directory
> after this error occurs. Well until the next time the same error
> Any advice would be great.
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users