[GE users] Scheduler stops transferring queued jobs after GDI error

futurity neil at futurity.co.uk
Tue Jan 12 12:03:52 GMT 2010

Hi Reuti and fellow grid users,

I've just noticed that when I log in as the "sgeadmin" user, that it's home
directory is set to our old legacy grid path "/rmt/sge" and not the newer
grid's path "/rmt/sge61".  Also $SGE_ROOT isn't set in it's environment.  

Although processes are run as user "sgeadmin", the path to the binaries they
use seems to be correct.  e.g. "sh /rmt/sge61/dbwriter/util/dbwriter.sh"

I take it I should fix this home directory issue ASAP, although I guess its
always been like this and never caused us any problems over the last year or
so until now.

I've also looked for large files in the $SGE_ROOT directory structure in
case any files are full (although we have log rotation in place).  Below is
a list of files larger than 10MB found with the command "find ./default/
-type f -size +10240k":

94MB	./default/common/accounting
24MB	./default/common/schedule
93MB	./default/spool/qmaster/messages
19MB	./default/spool/dbwriter/dbwriter.log

I take it that these file sizes are fine as they haven't exceeded the OS
maximum file size?

Many thanks for your help,


-----Original Message-----
From: Neil Baker [mailto:futuritysolutions at googlemail.com] On Behalf Of Neil
Sent: 12 January 2010 10:13
To: 'users'
Subject: RE: [GE users] Scheduler stops transferring queued jobs after GDI

Hi Reuti,

Thank you for your reply and help.  We're a bit stick with this problem
being that the hardware, OS, grid configuration haven't been modified for
over 6 months other than adding a few submit hosts with the "qconf -as
<hostname>" command.

Our grid uses local spool directories (execution hosts storing their logs on
their own hard drives).

On the qmaster machine "df -h" dislays:

Filesystem            Size  Used Avail Use% Mounted on
/dev/sda2              65G  6.0G   56G  10% /
                       11G  3.1G  7.4G  30% /rmt/sge61
udev                  2.0G   76K  2.0G   1% /dev

"/home" isn't mounted or used by the qmaster machine.

The sgeadmin home directory is stored on the /rmt/sge61 remote file system
(a NetApp NAS device) mounted over NFS.

"/tmp", "/var", etc all sit on the "/" local volume and there is 56G free.

On Friday afternoon we upgraded our grid from 6.1u3 to 6.1u6 in case this
may resolve our problem.  Unfortunately the problem still came back,
although not for a few days (before it was hanging after only a few hours).

You mention waiting for 6.2u5.  Is there a reason why we should not use
6.2u4 and wait for 6.2u5?

Kind Regards


-----Original Message-----
From: reuti [mailto:reuti at staff.uni-marburg.de] 
Sent: 11 January 2010 15:08
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Scheduler stops transferring queued jobs after GDI


Am 08.01.2010 um 16:33 schrieb futurity:

> Hi,
> I was wondering if anyone may be able to help me?
> We're using Grid Engine 6.1u3 and experiencing problems where  
> queued jobs aren't transferred from state "qw / queue waiting" to  
> machines to be run.  This has been ongoing for the last few months  
> where this problem used to only occur once every 2 weeks at the  
> start, but since the new year its started to happen multiple times  
> a day.  Rebooting the host machines doesn't seem to stop it  
> happening any less frequently.
> When the qmaster is soft stopped and started again, the queued jobs  
> then transfer and run fine until the problem reoccurs.
> The sequence of events leading up the the problem are as follows:
> Everything on the grid is working fine.
> A user experiences an error message "error: failed receiving gdi  
> request".
> Subsequence job submission appear to work without the gdi error  
> being received.
> Jobs in state "qw" or jobs submitted after step 2 stay in state  
> "qw" and are never transferred.
> We haven't modified our grid configuration for 6 months, possibly a  
> year and its been running without any problems what so ever for  
> months before this started to happen.

are the spool directories local or on the file server?

> Disk space is fine (7GB free).

Where: in /tmp, /var? No disk quota in place in /home?

>  Top shows that the machine's load is nothing when the grid is  
> working fine and when in this problem state.
> Has anyone else experienced this problem or has any other suggestions?
> Would upgrading to 6.1u6 help?

I would wait for the 6.2u5 binaries being available. Although I can't  
guarantee, that it will solve your issue.

-- Reuti

> Kind Regards
> Neil


To unsubscribe from this discussion, e-mail:
[users-unsubscribe at gridengine.sunsource.net].


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list