[GE users] Eqw random errors when submitting batches of jobs

Tore Sundqvist tore at it.uu.se
Fri Oct 17 13:08:54 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,
I'm experiencing the same, seamingly random, problem of entering NFS 
auto-mounted home directories. I have all SGE files (with exception of 
the common directory) on the local nodes.

I't is worst in a cluster with 8 slots per execution hosts with all the 
slots often starting jobs at the same time for the same user. It thus 
asks the automounter to mount the same home directory 8 times at 
virtualy the same time. Perhaps 1 job out of 1000 enters Eqw at startup.
It might be related to high server load since we often start a few 
hundred jobs at the same time but I have never been able to provoce the 
error by submitting a few hundred of my own test jobs - it only happens 
with important production jobs.

I'm inclined to blame the automounter but don't really now how to 
continue with this. We are running RedHat based Scientific Linux 5.2 on 
both servers and clients. Kernel 2.6.18-92.1.13.el5 and autofs 
5.0.1-0.rc2.88. My next step would be to install the latest autofs 
software from kernel.org.

The errors comes together with a syslog entry on the execution node like 
this:
  Oct 12 08:45:16 gr38 automount[2823]: do_mount_indirect: indirect 
trigger not valid or already mounted /home/griduser003

While SGE reports:
failed changing into working directory: ...
  error: can't chdir to /home/griduser003: No such file or directory

There is no related log messages on the NFS server.

Tore Sundqvist
Uppsala Universitet
Sweden


Neil Baker wrote:
> Hi Reuti,
> 
> We are already writing to local spool directories to reduce NFS traffic, but
> this error occurs when it is trying to write the error and output files back
> to the current working directory (currently on an NFS volume to gather up
> all the results into one location).
> 
> I suppose we could set the directory to be a local directory path, so that
> results are written to a local directory on each execution host?
> 
> Neil
> 
> -----Original Message-----
> From: Reuti [mailto:reuti at staff.uni-marburg.de] 
> Sent: 16 October 2008 15:25
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Eqw random errors when submitting batches of jobs
> 
> Hi,
> 
> Am 16.10.2008 um 13:01 schrieb Neil Baker:
> 
>> We seem to be experiencing random Eqw errors when submitting jobs.   
>> We've been experiencing this for years with Grid Engine 5.3, but we  
>> migrated to Grid Engine 6.1 a few months ago and this is the first  
>> time we've experienced it on this new version.
>>
>> This may be due to our environment, but as we've already done  
>> extensive testing with no successful solution on Grid Engine 5.3, I  
>> was wondering if this is a known bug in Grid Engine with either a  
>> solution or a work around?
>>
>> We're only talking about 1 job experiencing this problem for 9,000  
>> jobs submitted, but as our jobs have dependencies on results from  
>> previous jobs, an Eqw causes our jobs to stop completely.  As a  
>> group of jobs takes over a day to run, this is becoming a big  
>> headache for us.
>>
>> We're submitting 50 batches of 48 jobs.  When the jobs are  
>> submitted there are enough free slots on the grid to enable all the  
>> jobs to be run.
>>
>> On the Queue Master machine (stg-qmaster):
>>
>> stg-qmaster:~ # qstat | grep Eqw
>> 1443430 0.50617 qsub.20    aabella      Eqw   10/16/2008 06:09:43
>>
>> From: /rmt/sge61/default/spool/qmaster/messages
>>
>> 10/15/2008 14:14:01|qmaster|stg-qmaster|W|job 1435700.1 failed on  
>> host stg-dell31.crl.toshiba.co.uk general opening input/output file  
>> because: 10/15/2008 14:14:00 [715:32504]: error: can't open output  
>> file "/rmt/exp/enUK/aabella/ASR_expt/s2st_oct08": Is a directory
>> 10/15/2008 14:14:01|qmaster|stg-qmaster|W|rescheduling job 1435700.1
>>
>> Although it says it rescheduled the job at the same time as the  
>> error, we actually had to manually reschedule it.
>>
>> On the Execution Host machine (stg-dell31):
>>
>> /var/log/messages:
>> Oct 15 14:14:00 stg-dell31 automount[4107]: do_mount_indirect:  
>> indirect trigger not valid or already mounted /rmt/exp
> 
> seems to be a race condition. Can you use a static mount without  
> automounter? Or put at least the spool dircetory local on each node  
> and the qmaster:
> 
> http://gridengine.sunsource.net/howto/nfsreduce.html
> 
> -- Reuti
> 
> 
>> /local/sge/spool/stg-dell31/messages:
>> 10/15/2008 14:14:00|execd|stg-dell31|E|shepherd of job 1435700.1  
>> exited with exit status = 26
>> 10/15/2008 14:14:00|execd|stg-dell31|W|reaping job "1435700" ptf  
>> complains: Job does not exist
>> 10/15/2008 14:14:00|execd|stg-dell31|E|can't open usage file  
>> "active_jobs/1435700.1/usage" for job 1435700.1: No such file or  
>> directory
>> 10/15/2008 14:14:00|execd|stg-dell31|E|10/15/2008 14:14:00  
>> [715:32504]: error: can't open output file "/rmt/exp/enUK/aabella/ 
>> ASR_expt/s2st_oct08": Is a directory
>>
>> The directory sits on a very expensive NetApp NAS device and  
>> nothing is appearing in its logs to indicate a problem with it.   
>> However there are a fair amount of files in the directory:
>>
>> stg-dell31:~ # ls -l /rmt/exp/enUK/aabella/ASR_expt/s2st_oct08 | wc -l
>>
>> 9707 files
>>
>> The size of the directory is only 301K though so I don't think this  
>> is near any limit.  New jobs successfully write to that directory  
>> after this error occurs.  Well until the next time the same error  
>> occurs.
>>
>> Any advice would be great.
>>
>> Regards
>>
>> Neil
>>
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email 
> ______________________________________________________________________
> 
> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email 
> ______________________________________________________________________
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list