[GE users] Scheduler stops transferring queued jobs after GDI error

pollinger harald.pollinger at sun.com
Fri Jan 15 22:21:26 GMT 2010


futurity wrote:
> Hi Reuti and everyone,
> 
> Reuti, you are right in that the real user for qmaster and schedd is root.
> Therefore I doubt the sgeadmin home directory not pointing to $SGE_ROOT will
> be the cause of our scheduler problem.  
> 
> # ps -eo user,ruser,command | grep sge
> sgeadmin root     /rmt/sge61/bin/lx24-x86/sge_qmaster
> 
> sgeadmin root     /rmt/sge61/bin/lx24-x86/sge_schedd
> 
> root     root     /bin/sh /etc/init.d/sgedbwriter start
> 
> sgeadmin sgeadmin sh /rmt/sge61/dbwriter/util/dbwriter.sh
> 
> sgeadmin sgeadmin sh /rmt/sge61/dbwriter/util/dbwriter.sh
> 
> sgeadmin sgeadmin /usr/lib/jvm/java-1.5.0-sun-1.5.0_update16/jre/bin/java
> -server -Djava.library.path=/rmt/sge61/lib/lx24-x86 -classpath
> /rmt/sge61/dbwriter/lib/arco_common.jar:/rmt/sge61/dbwriter/lib/dbwriter.jar
> :/rmt/sge61/dbwriter/lib/jaxb-api.jar:/rmt/sge61/dbwriter/lib/jaxb-impl.jar:
> /rmt/sge61/dbwriter/lib/jaxb-libs.jar:/rmt/sge61/dbwriter/lib/jaxb-xjc.jar:/
> rmt/sge61/dbwriter/lib/jax-qname.jar:/rmt/sge61/dbwriter/lib/mysql-connector
> -java-5.1.7-bin.jar:/rmt/sge61/dbwriter/lib/namespace.jar:/rmt/sge61/dbwrite
> r/lib/postgresql-7.4.2.jar:/rmt/sge61/dbwriter/lib/relaxngDatatype.jar:/rmt/
> sge61/dbwriter/lib/xsdlib.jar
> com/sun/grid/reporting/dbwriter/ReportingDBWriter -pid
> /rmt/sge61/default/spool/dbwriter/dbwriter.pid -logfile
> /rmt/sge61/default/spool/dbwriter/dbwriter.log
> 
> I've been looking at the $SGE_ROOT/default/spool/qmaster/messages file.  The
> problem reoccurred earlier today and nothing at all was logged for that
> period.  Is it possible to increase the debug level for the messages file?

See "loglevel" in sge_conf(5).

Harald


> Many thanks again for you help.  The more we look into this problem, the
> harder it seems to get an answer.
> 
> Neil
> 
> 
> -----Original Message-----
> From: reuti [mailto:reuti at staff.uni-marburg.de] 
> Sent: 12 January 2010 15:49
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Scheduler stops transferring queued jobs after GDI
> error
> 
> Hi,
> 
> Am 12.01.2010 um 13:03 schrieb futurity:
> 
>> Hi Reuti and fellow grid users,
>>
>> I've just noticed that when I log in as the "sgeadmin" user, that  
>> it's home
>> directory is set to our old legacy grid path "/rmt/sge" and not the  
>> newer
>> grid's path "/rmt/sge61".  Also $SGE_ROOT isn't set in it's  
>> environment.
>>
>> Although processes are run as user "sgeadmin", the path to the  
>> binaries they
>> use seems to be correct.  e.g. "sh /rmt/sge61/dbwriter/util/ 
>> dbwriter.sh"
> 
> but they were started by root? The output could be missleading, so  
> it's best to check both: the effective and the real user:
> 
> $ ps -e f -o user,ruser,command
> 
> 
>> I take it I should fix this home directory issue ASAP, although I  
>> guess its
>> always been like this and never caused us any problems over the  
>> last year or
>> so until now.
>>
>> I've also looked for large files in the $SGE_ROOT directory  
>> structure in
>> case any files are full (although we have log rotation in place).   
>> Below is
> 
> You mean SGE's logrotation scripts? You start them once a day via cron?
> 
> -- Reuti
> 
> 
>> a list of files larger than 10MB found with the command "find ./ 
>> default/
>> -type f -size +10240k":
>>
>> 94MB	./default/common/accounting
>> 24MB	./default/common/schedule
>> 93MB	./default/spool/qmaster/messages
>> 19MB	./default/spool/dbwriter/dbwriter.log
>>
>> I take it that these file sizes are fine as they haven't exceeded  
>> the OS
>> maximum file size?
>>
>> Many thanks for your help,
>>
>> Neil
>>
>> -----Original Message-----
>> From: Neil Baker [mailto:futuritysolutions at googlemail.com] On  
>> Behalf Of Neil
>> Baker
>> Sent: 12 January 2010 10:13
>> To: 'users'
>> Subject: RE: [GE users] Scheduler stops transferring queued jobs  
>> after GDI
>> error
>>
>> Hi Reuti,
>>
>> Thank you for your reply and help.  We're a bit stick with this  
>> problem
>> being that the hardware, OS, grid configuration haven't been  
>> modified for
>> over 6 months other than adding a few submit hosts with the "qconf -as
>> <hostname>" command.
>>
>> Our grid uses local spool directories (execution hosts storing  
>> their logs on
>> their own hard drives).
>>
>> On the qmaster machine "df -h" dislays:
>>
>> Filesystem            Size  Used Avail Use% Mounted on
>> /dev/sda2              65G  6.0G   56G  10% /
>> stg-nas1c:/vol/sge/sge61
>>                        11G  3.1G  7.4G  30% /rmt/sge61
>> udev                  2.0G   76K  2.0G   1% /dev
>>
>> "/home" isn't mounted or used by the qmaster machine.
>>
>> The sgeadmin home directory is stored on the /rmt/sge61 remote file  
>> system
>> (a NetApp NAS device) mounted over NFS.
>>
>> "/tmp", "/var", etc all sit on the "/" local volume and there is  
>> 56G free.
>>
>> On Friday afternoon we upgraded our grid from 6.1u3 to 6.1u6 in  
>> case this
>> may resolve our problem.  Unfortunately the problem still came back,
>> although not for a few days (before it was hanging after only a few  
>> hours).
>>
>> You mention waiting for 6.2u5.  Is there a reason why we should not  
>> use
>> 6.2u4 and wait for 6.2u5?
>>
>> Kind Regards
>>
>> Neil
>>
>> -----Original Message-----
>> From: reuti [mailto:reuti at staff.uni-marburg.de]
>> Sent: 11 January 2010 15:08
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] Scheduler stops transferring queued jobs  
>> after GDI
>> error
>>
>> Hi,
>>
>> Am 08.01.2010 um 16:33 schrieb futurity:
>>
>>> Hi,
>>>
>>> I was wondering if anyone may be able to help me?
>>>
>>> We're using Grid Engine 6.1u3 and experiencing problems where
>>> queued jobs aren't transferred from state "qw / queue waiting" to
>>> machines to be run.  This has been ongoing for the last few months
>>> where this problem used to only occur once every 2 weeks at the
>>> start, but since the new year its started to happen multiple times
>>> a day.  Rebooting the host machines doesn't seem to stop it
>>> happening any less frequently.
>>>
>>> When the qmaster is soft stopped and started again, the queued jobs
>>> then transfer and run fine until the problem reoccurs.
>>>
>>> The sequence of events leading up the the problem are as follows:
>>> Everything on the grid is working fine.
>>> A user experiences an error message "error: failed receiving gdi
>>> request".
>>> Subsequence job submission appear to work without the gdi error
>>> being received.
>>> Jobs in state "qw" or jobs submitted after step 2 stay in state
>>> "qw" and are never transferred.
>>>
>>> We haven't modified our grid configuration for 6 months, possibly a
>>> year and its been running without any problems what so ever for
>>> months before this started to happen.
>> are the spool directories local or on the file server?
>>
>>
>>> Disk space is fine (7GB free).
>> Where: in /tmp, /var? No disk quota in place in /home?
>>
>>
>>>  Top shows that the machine's load is nothing when the grid is
>>> working fine and when in this problem state.
>>>
>>> Has anyone else experienced this problem or has any other  
>>> suggestions?
>>>
>>> Would upgrading to 6.1u6 help?
>> I would wait for the 6.2u5 binaries being available. Although I can't
>> guarantee, that it will solve your issue.
>>
>> -- Reuti
>>
>>
>>> Kind Regards
>>>
>>> Neil
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=2
>> 38113
>>
>> To unsubscribe from this discussion, e-mail:
>> [users-unsubscribe at gridengine.sunsource.net].
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=238312
>>
>> To unsubscribe from this discussion, e-mail: [users- 
>> unsubscribe at gridengine.sunsource.net].
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=2
> 38351
> 
> To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net].
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=238355
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].


-- 
Sun Microsystems GmbH         Harald Pollinger
Dr.-Leo-Ritter-Str. 7         Sun Grid Engine Engineering
D-93049 Regensburg            Phone: +49 (0)941 3075-209  (x60209)
Germany                       Fax: +49 (0)941 3075-222  (x60222)
http://www.sun.com/gridware
mailto:harald.pollinger at sun.com
Sitz der Gesellschaft:
Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Wolf Frenkel
Vorsitzender des Aufsichtsrates: Martin Haering

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=239045

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list