[GE users] Scheduler stops transferring queued jobs after GDI error

reuti reuti at staff.uni-marburg.de
Tue Jan 12 15:49:28 GMT 2010


Hi,

Am 12.01.2010 um 13:03 schrieb futurity:

> Hi Reuti and fellow grid users,
>
> I've just noticed that when I log in as the "sgeadmin" user, that  
> it's home
> directory is set to our old legacy grid path "/rmt/sge" and not the  
> newer
> grid's path "/rmt/sge61".  Also $SGE_ROOT isn't set in it's  
> environment.
>
> Although processes are run as user "sgeadmin", the path to the  
> binaries they
> use seems to be correct.  e.g. "sh /rmt/sge61/dbwriter/util/ 
> dbwriter.sh"

but they were started by root? The output could be missleading, so  
it's best to check both: the effective and the real user:

$ ps -e f -o user,ruser,command


> I take it I should fix this home directory issue ASAP, although I  
> guess its
> always been like this and never caused us any problems over the  
> last year or
> so until now.
>
> I've also looked for large files in the $SGE_ROOT directory  
> structure in
> case any files are full (although we have log rotation in place).   
> Below is

You mean SGE's logrotation scripts? You start them once a day via cron?

-- Reuti


> a list of files larger than 10MB found with the command "find ./ 
> default/
> -type f -size +10240k":
>
> 94MB	./default/common/accounting
> 24MB	./default/common/schedule
> 93MB	./default/spool/qmaster/messages
> 19MB	./default/spool/dbwriter/dbwriter.log
>
> I take it that these file sizes are fine as they haven't exceeded  
> the OS
> maximum file size?
>
> Many thanks for your help,
>
> Neil
>
> -----Original Message-----
> From: Neil Baker [mailto:futuritysolutions at googlemail.com] On  
> Behalf Of Neil
> Baker
> Sent: 12 January 2010 10:13
> To: 'users'
> Subject: RE: [GE users] Scheduler stops transferring queued jobs  
> after GDI
> error
>
> Hi Reuti,
>
> Thank you for your reply and help.  We're a bit stick with this  
> problem
> being that the hardware, OS, grid configuration haven't been  
> modified for
> over 6 months other than adding a few submit hosts with the "qconf -as
> <hostname>" command.
>
> Our grid uses local spool directories (execution hosts storing  
> their logs on
> their own hard drives).
>
> On the qmaster machine "df -h" dislays:
>
> Filesystem            Size  Used Avail Use% Mounted on
> /dev/sda2              65G  6.0G   56G  10% /
> stg-nas1c:/vol/sge/sge61
>                        11G  3.1G  7.4G  30% /rmt/sge61
> udev                  2.0G   76K  2.0G   1% /dev
>
> "/home" isn't mounted or used by the qmaster machine.
>
> The sgeadmin home directory is stored on the /rmt/sge61 remote file  
> system
> (a NetApp NAS device) mounted over NFS.
>
> "/tmp", "/var", etc all sit on the "/" local volume and there is  
> 56G free.
>
> On Friday afternoon we upgraded our grid from 6.1u3 to 6.1u6 in  
> case this
> may resolve our problem.  Unfortunately the problem still came back,
> although not for a few days (before it was hanging after only a few  
> hours).
>
> You mention waiting for 6.2u5.  Is there a reason why we should not  
> use
> 6.2u4 and wait for 6.2u5?
>
> Kind Regards
>
> Neil
>
> -----Original Message-----
> From: reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: 11 January 2010 15:08
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Scheduler stops transferring queued jobs  
> after GDI
> error
>
> Hi,
>
> Am 08.01.2010 um 16:33 schrieb futurity:
>
>> Hi,
>>
>> I was wondering if anyone may be able to help me?
>>
>> We're using Grid Engine 6.1u3 and experiencing problems where
>> queued jobs aren't transferred from state "qw / queue waiting" to
>> machines to be run.  This has been ongoing for the last few months
>> where this problem used to only occur once every 2 weeks at the
>> start, but since the new year its started to happen multiple times
>> a day.  Rebooting the host machines doesn't seem to stop it
>> happening any less frequently.
>>
>> When the qmaster is soft stopped and started again, the queued jobs
>> then transfer and run fine until the problem reoccurs.
>>
>> The sequence of events leading up the the problem are as follows:
>> Everything on the grid is working fine.
>> A user experiences an error message "error: failed receiving gdi
>> request".
>> Subsequence job submission appear to work without the gdi error
>> being received.
>> Jobs in state "qw" or jobs submitted after step 2 stay in state
>> "qw" and are never transferred.
>>
>> We haven't modified our grid configuration for 6 months, possibly a
>> year and its been running without any problems what so ever for
>> months before this started to happen.
>
> are the spool directories local or on the file server?
>
>
>> Disk space is fine (7GB free).
>
> Where: in /tmp, /var? No disk quota in place in /home?
>
>
>>  Top shows that the machine's load is nothing when the grid is
>> working fine and when in this problem state.
>>
>> Has anyone else experienced this problem or has any other  
>> suggestions?
>>
>> Would upgrading to 6.1u6 help?
>
> I would wait for the 6.2u5 binaries being available. Although I can't
> guarantee, that it will solve your issue.
>
> -- Reuti
>
>
>> Kind Regards
>>
>> Neil
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=2
> 38113
>
> To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=238312
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=238351

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list