[GE users] Scheduler stops transferring queued jobs after GDI error
reuti at staff.uni-marburg.de
Tue Jan 12 15:49:28 GMT 2010
Am 12.01.2010 um 13:03 schrieb futurity:
> Hi Reuti and fellow grid users,
> I've just noticed that when I log in as the "sgeadmin" user, that
> it's home
> directory is set to our old legacy grid path "/rmt/sge" and not the
> grid's path "/rmt/sge61". Also $SGE_ROOT isn't set in it's
> Although processes are run as user "sgeadmin", the path to the
> binaries they
> use seems to be correct. e.g. "sh /rmt/sge61/dbwriter/util/
but they were started by root? The output could be missleading, so
it's best to check both: the effective and the real user:
$ ps -e f -o user,ruser,command
> I take it I should fix this home directory issue ASAP, although I
> guess its
> always been like this and never caused us any problems over the
> last year or
> so until now.
> I've also looked for large files in the $SGE_ROOT directory
> structure in
> case any files are full (although we have log rotation in place).
> Below is
You mean SGE's logrotation scripts? You start them once a day via cron?
> a list of files larger than 10MB found with the command "find ./
> -type f -size +10240k":
> 94MB ./default/common/accounting
> 24MB ./default/common/schedule
> 93MB ./default/spool/qmaster/messages
> 19MB ./default/spool/dbwriter/dbwriter.log
> I take it that these file sizes are fine as they haven't exceeded
> the OS
> maximum file size?
> Many thanks for your help,
> -----Original Message-----
> From: Neil Baker [mailto:futuritysolutions at googlemail.com] On
> Behalf Of Neil
> Sent: 12 January 2010 10:13
> To: 'users'
> Subject: RE: [GE users] Scheduler stops transferring queued jobs
> after GDI
> Hi Reuti,
> Thank you for your reply and help. We're a bit stick with this
> being that the hardware, OS, grid configuration haven't been
> modified for
> over 6 months other than adding a few submit hosts with the "qconf -as
> <hostname>" command.
> Our grid uses local spool directories (execution hosts storing
> their logs on
> their own hard drives).
> On the qmaster machine "df -h" dislays:
> Filesystem Size Used Avail Use% Mounted on
> /dev/sda2 65G 6.0G 56G 10% /
> 11G 3.1G 7.4G 30% /rmt/sge61
> udev 2.0G 76K 2.0G 1% /dev
> "/home" isn't mounted or used by the qmaster machine.
> The sgeadmin home directory is stored on the /rmt/sge61 remote file
> (a NetApp NAS device) mounted over NFS.
> "/tmp", "/var", etc all sit on the "/" local volume and there is
> 56G free.
> On Friday afternoon we upgraded our grid from 6.1u3 to 6.1u6 in
> case this
> may resolve our problem. Unfortunately the problem still came back,
> although not for a few days (before it was hanging after only a few
> You mention waiting for 6.2u5. Is there a reason why we should not
> 6.2u4 and wait for 6.2u5?
> Kind Regards
> -----Original Message-----
> From: reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: 11 January 2010 15:08
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Scheduler stops transferring queued jobs
> after GDI
> Am 08.01.2010 um 16:33 schrieb futurity:
>> I was wondering if anyone may be able to help me?
>> We're using Grid Engine 6.1u3 and experiencing problems where
>> queued jobs aren't transferred from state "qw / queue waiting" to
>> machines to be run. This has been ongoing for the last few months
>> where this problem used to only occur once every 2 weeks at the
>> start, but since the new year its started to happen multiple times
>> a day. Rebooting the host machines doesn't seem to stop it
>> happening any less frequently.
>> When the qmaster is soft stopped and started again, the queued jobs
>> then transfer and run fine until the problem reoccurs.
>> The sequence of events leading up the the problem are as follows:
>> Everything on the grid is working fine.
>> A user experiences an error message "error: failed receiving gdi
>> Subsequence job submission appear to work without the gdi error
>> being received.
>> Jobs in state "qw" or jobs submitted after step 2 stay in state
>> "qw" and are never transferred.
>> We haven't modified our grid configuration for 6 months, possibly a
>> year and its been running without any problems what so ever for
>> months before this started to happen.
> are the spool directories local or on the file server?
>> Disk space is fine (7GB free).
> Where: in /tmp, /var? No disk quota in place in /home?
>> Top shows that the machine's load is nothing when the grid is
>> working fine and when in this problem state.
>> Has anyone else experienced this problem or has any other
>> Would upgrading to 6.1u6 help?
> I would wait for the 6.2u5 binaries being available. Although I can't
> guarantee, that it will solve your issue.
> -- Reuti
>> Kind Regards
> To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net].
> To unsubscribe from this discussion, e-mail: [users-
> unsubscribe at gridengine.sunsource.net].
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users