[GE users] Scheduler stops transferring queued jobs after GDI error

futurity neil at futurity.co.uk
Tue Jan 12 16:21:48 GMT 2010


Hi Reuti and everyone,

Reuti, you are right in that the real user for qmaster and schedd is root.
Therefore I doubt the sgeadmin home directory not pointing to $SGE_ROOT will
be the cause of our scheduler problem.  

# ps -eo user,ruser,command | grep sge
sgeadmin root     /rmt/sge61/bin/lx24-x86/sge_qmaster

sgeadmin root     /rmt/sge61/bin/lx24-x86/sge_schedd

root     root     /bin/sh /etc/init.d/sgedbwriter start

sgeadmin sgeadmin sh /rmt/sge61/dbwriter/util/dbwriter.sh

sgeadmin sgeadmin sh /rmt/sge61/dbwriter/util/dbwriter.sh

sgeadmin sgeadmin /usr/lib/jvm/java-1.5.0-sun-1.5.0_update16/jre/bin/java
-server -Djava.library.path=/rmt/sge61/lib/lx24-x86 -classpath
/rmt/sge61/dbwriter/lib/arco_common.jar:/rmt/sge61/dbwriter/lib/dbwriter.jar
:/rmt/sge61/dbwriter/lib/jaxb-api.jar:/rmt/sge61/dbwriter/lib/jaxb-impl.jar:
/rmt/sge61/dbwriter/lib/jaxb-libs.jar:/rmt/sge61/dbwriter/lib/jaxb-xjc.jar:/
rmt/sge61/dbwriter/lib/jax-qname.jar:/rmt/sge61/dbwriter/lib/mysql-connector
-java-5.1.7-bin.jar:/rmt/sge61/dbwriter/lib/namespace.jar:/rmt/sge61/dbwrite
r/lib/postgresql-7.4.2.jar:/rmt/sge61/dbwriter/lib/relaxngDatatype.jar:/rmt/
sge61/dbwriter/lib/xsdlib.jar
com/sun/grid/reporting/dbwriter/ReportingDBWriter -pid
/rmt/sge61/default/spool/dbwriter/dbwriter.pid -logfile
/rmt/sge61/default/spool/dbwriter/dbwriter.log

I've been looking at the $SGE_ROOT/default/spool/qmaster/messages file.  The
problem reoccurred earlier today and nothing at all was logged for that
period.  Is it possible to increase the debug level for the messages file?

Many thanks again for you help.  The more we look into this problem, the
harder it seems to get an answer.

Neil


-----Original Message-----
From: reuti [mailto:reuti at staff.uni-marburg.de] 
Sent: 12 January 2010 15:49
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Scheduler stops transferring queued jobs after GDI
error

Hi,

Am 12.01.2010 um 13:03 schrieb futurity:

> Hi Reuti and fellow grid users,
>
> I've just noticed that when I log in as the "sgeadmin" user, that  
> it's home
> directory is set to our old legacy grid path "/rmt/sge" and not the  
> newer
> grid's path "/rmt/sge61".  Also $SGE_ROOT isn't set in it's  
> environment.
>
> Although processes are run as user "sgeadmin", the path to the  
> binaries they
> use seems to be correct.  e.g. "sh /rmt/sge61/dbwriter/util/ 
> dbwriter.sh"

but they were started by root? The output could be missleading, so  
it's best to check both: the effective and the real user:

$ ps -e f -o user,ruser,command


> I take it I should fix this home directory issue ASAP, although I  
> guess its
> always been like this and never caused us any problems over the  
> last year or
> so until now.
>
> I've also looked for large files in the $SGE_ROOT directory  
> structure in
> case any files are full (although we have log rotation in place).   
> Below is

You mean SGE's logrotation scripts? You start them once a day via cron?

-- Reuti


> a list of files larger than 10MB found with the command "find ./ 
> default/
> -type f -size +10240k":
>
> 94MB	./default/common/accounting
> 24MB	./default/common/schedule
> 93MB	./default/spool/qmaster/messages
> 19MB	./default/spool/dbwriter/dbwriter.log
>
> I take it that these file sizes are fine as they haven't exceeded  
> the OS
> maximum file size?
>
> Many thanks for your help,
>
> Neil
>
> -----Original Message-----
> From: Neil Baker [mailto:futuritysolutions at googlemail.com] On  
> Behalf Of Neil
> Baker
> Sent: 12 January 2010 10:13
> To: 'users'
> Subject: RE: [GE users] Scheduler stops transferring queued jobs  
> after GDI
> error
>
> Hi Reuti,
>
> Thank you for your reply and help.  We're a bit stick with this  
> problem
> being that the hardware, OS, grid configuration haven't been  
> modified for
> over 6 months other than adding a few submit hosts with the "qconf -as
> <hostname>" command.
>
> Our grid uses local spool directories (execution hosts storing  
> their logs on
> their own hard drives).
>
> On the qmaster machine "df -h" dislays:
>
> Filesystem            Size  Used Avail Use% Mounted on
> /dev/sda2              65G  6.0G   56G  10% /
> stg-nas1c:/vol/sge/sge61
>                        11G  3.1G  7.4G  30% /rmt/sge61
> udev                  2.0G   76K  2.0G   1% /dev
>
> "/home" isn't mounted or used by the qmaster machine.
>
> The sgeadmin home directory is stored on the /rmt/sge61 remote file  
> system
> (a NetApp NAS device) mounted over NFS.
>
> "/tmp", "/var", etc all sit on the "/" local volume and there is  
> 56G free.
>
> On Friday afternoon we upgraded our grid from 6.1u3 to 6.1u6 in  
> case this
> may resolve our problem.  Unfortunately the problem still came back,
> although not for a few days (before it was hanging after only a few  
> hours).
>
> You mention waiting for 6.2u5.  Is there a reason why we should not  
> use
> 6.2u4 and wait for 6.2u5?
>
> Kind Regards
>
> Neil
>
> -----Original Message-----
> From: reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: 11 January 2010 15:08
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Scheduler stops transferring queued jobs  
> after GDI
> error
>
> Hi,
>
> Am 08.01.2010 um 16:33 schrieb futurity:
>
>> Hi,
>>
>> I was wondering if anyone may be able to help me?
>>
>> We're using Grid Engine 6.1u3 and experiencing problems where
>> queued jobs aren't transferred from state "qw / queue waiting" to
>> machines to be run.  This has been ongoing for the last few months
>> where this problem used to only occur once every 2 weeks at the
>> start, but since the new year its started to happen multiple times
>> a day.  Rebooting the host machines doesn't seem to stop it
>> happening any less frequently.
>>
>> When the qmaster is soft stopped and started again, the queued jobs
>> then transfer and run fine until the problem reoccurs.
>>
>> The sequence of events leading up the the problem are as follows:
>> Everything on the grid is working fine.
>> A user experiences an error message "error: failed receiving gdi
>> request".
>> Subsequence job submission appear to work without the gdi error
>> being received.
>> Jobs in state "qw" or jobs submitted after step 2 stay in state
>> "qw" and are never transferred.
>>
>> We haven't modified our grid configuration for 6 months, possibly a
>> year and its been running without any problems what so ever for
>> months before this started to happen.
>
> are the spool directories local or on the file server?
>
>
>> Disk space is fine (7GB free).
>
> Where: in /tmp, /var? No disk quota in place in /home?
>
>
>>  Top shows that the machine's load is nothing when the grid is
>> working fine and when in this problem state.
>>
>> Has anyone else experienced this problem or has any other  
>> suggestions?
>>
>> Would upgrading to 6.1u6 help?
>
> I would wait for the 6.2u5 binaries being available. Although I can't
> guarantee, that it will solve your issue.
>
> -- Reuti
>
>
>> Kind Regards
>>
>> Neil
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=2
> 38113
>
> To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=238312
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=2
38351

To unsubscribe from this discussion, e-mail:
[users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=238355

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list