[GE users] Scheduler stops transferring queued jobs after GDI error

futurity neil at futurity.co.uk
Tue Jan 12 10:13:01 GMT 2010


Hi Reuti,

Thank you for your reply and help.  We're a bit stick with this problem
being that the hardware, OS, grid configuration haven't been modified for
over 6 months other than adding a few submit hosts with the "qconf -as
<hostname>" command.

Our grid uses local spool directories (execution hosts storing their logs on
their own hard drives).

On the qmaster machine "df -h" dislays:

Filesystem            Size  Used Avail Use% Mounted on
/dev/sda2              65G  6.0G   56G  10% /
stg-nas1c:/vol/sge/sge61
                       11G  3.1G  7.4G  30% /rmt/sge61
udev                  2.0G   76K  2.0G   1% /dev

"/home" isn't mounted or used by the qmaster machine.

The sgeadmin home directory is stored on the /rmt/sge61 remote file system
(a NetApp NAS device) mounted over NFS.

"/tmp", "/var", etc all sit on the "/" local volume and there is 56G free.

On Friday afternoon we upgraded our grid from 6.1u3 to 6.1u6 in case this
may resolve our problem.  Unfortunately the problem still came back,
although not for a few days (before it was hanging after only a few hours).

You mention waiting for 6.2u5.  Is there a reason why we should not use
6.2u4 and wait for 6.2u5?

Kind Regards

Neil

-----Original Message-----
From: reuti [mailto:reuti at staff.uni-marburg.de] 
Sent: 11 January 2010 15:08
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Scheduler stops transferring queued jobs after GDI
error

Hi,

Am 08.01.2010 um 16:33 schrieb futurity:

> Hi,
>
> I was wondering if anyone may be able to help me?
>
> We're using Grid Engine 6.1u3 and experiencing problems where  
> queued jobs aren't transferred from state "qw / queue waiting" to  
> machines to be run.  This has been ongoing for the last few months  
> where this problem used to only occur once every 2 weeks at the  
> start, but since the new year its started to happen multiple times  
> a day.  Rebooting the host machines doesn't seem to stop it  
> happening any less frequently.
>
> When the qmaster is soft stopped and started again, the queued jobs  
> then transfer and run fine until the problem reoccurs.
>
> The sequence of events leading up the the problem are as follows:
> Everything on the grid is working fine.
> A user experiences an error message "error: failed receiving gdi  
> request".
> Subsequence job submission appear to work without the gdi error  
> being received.
> Jobs in state "qw" or jobs submitted after step 2 stay in state  
> "qw" and are never transferred.
>
> We haven't modified our grid configuration for 6 months, possibly a  
> year and its been running without any problems what so ever for  
> months before this started to happen.

are the spool directories local or on the file server?


> Disk space is fine (7GB free).

Where: in /tmp, /var? No disk quota in place in /home?


>  Top shows that the machine's load is nothing when the grid is  
> working fine and when in this problem state.
>
> Has anyone else experienced this problem or has any other suggestions?
>
> Would upgrading to 6.1u6 help?

I would wait for the 6.2u5 binaries being available. Although I can't  
guarantee, that it will solve your issue.

-- Reuti


> Kind Regards
>
> Neil

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=2
38113

To unsubscribe from this discussion, e-mail:
[users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=238288

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list