[GE users] Eqw random errors when submitting batches of jobs

Neil Baker neil.baker at crl.toshiba.co.uk
Thu Oct 16 12:01:16 BST 2008



We seem to be experiencing random Eqw errors when submitting jobs.  We've
been experiencing this for years with Grid Engine 5.3, but we migrated to
Grid Engine 6.1 a few months ago and this is the first time we've
experienced it on this new version.


This may be due to our environment, but as we've already done extensive
testing with no successful solution on Grid Engine 5.3, I was wondering if
this is a known bug in Grid Engine with either a solution or a work around?


We're only talking about 1 job experiencing this problem for 9,000 jobs
submitted, but as our jobs have dependencies on results from previous jobs,
an Eqw causes our jobs to stop completely.  As a group of jobs takes over a
day to run, this is becoming a big headache for us.


We're submitting 50 batches of 48 jobs.  When the jobs are submitted there
are enough free slots on the grid to enable all the jobs to be run.  


On the Queue Master machine (stg-qmaster):


stg-qmaster:~ # qstat | grep Eqw

1443430 0.50617 qsub.20    aabella      Eqw   10/16/2008 06:09:43


From: /rmt/sge61/default/spool/qmaster/messages


10/15/2008 14:14:01|qmaster|stg-qmaster|W|job 1435700.1 failed on host
stg-dell31.crl.toshiba.co.uk general opening input/output file because:
10/15/2008 14:14:00 [715:32504]: error: can't open output file
"/rmt/exp/enUK/aabella/ASR_expt/s2st_oct08": Is a directory

10/15/2008 14:14:01|qmaster|stg-qmaster|W|rescheduling job 1435700.1


Although it says it rescheduled the job at the same time as the error, we
actually had to manually reschedule it.


On the Execution Host machine (stg-dell31):



Oct 15 14:14:00 stg-dell31 automount[4107]: do_mount_indirect: indirect
trigger not valid or already mounted /rmt/exp



10/15/2008 14:14:00|execd|stg-dell31|E|shepherd of job 1435700.1 exited with
exit status = 26

10/15/2008 14:14:00|execd|stg-dell31|W|reaping job "1435700" ptf complains:
Job does not exist

10/15/2008 14:14:00|execd|stg-dell31|E|can't open usage file
"active_jobs/1435700.1/usage" for job 1435700.1: No such file or directory

10/15/2008 14:14:00|execd|stg-dell31|E|10/15/2008 14:14:00 [715:32504]:
error: can't open output file "/rmt/exp/enUK/aabella/ASR_expt/s2st_oct08":
Is a directory


The directory sits on a very expensive NetApp NAS device and nothing is
appearing in its logs to indicate a problem with it.  However there are a
fair amount of files in the directory:


stg-dell31:~ # ls -l /rmt/exp/enUK/aabella/ASR_expt/s2st_oct08 | wc -l


9707 files


The size of the directory is only 301K though so I don't think this is near
any limit.  New jobs successfully write to that directory after this error
occurs.  Well until the next time the same error occurs.


Any advice would be great.







More information about the gridengine-users mailing list