[GE users] qmaster dying again....

Iwona Sakrejda isakrejda at lbl.gov
Wed Jul 18 15:36:37 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]



Andreas.Haas at Sun.COM wrote:
> Hi Iwona,
>
> On Tue, 17 Jul 2007, Iwona Sakrejda wrote:
>
>> Hi,
>>
>> A few days ago I upgraded from 6.0u4 to 6.0u11 and this morning my 
>> qmaster started dying.
>
> You did this as foreseen?
>
>    http://gridengine.sunsource.net/install60patch.txt
Yes, all went through ok, no problems encountered during the upgrade.
I was very happy about that.
>
>
>> When I look at the logs I see messages:
>>
>> 7/17/2007 10:37:24|qmaster|pc2533|I|qmaster hard descriptor limit is 
>> set to 8192
>> 07/17/2007 10:37:24|qmaster|pc2533|I|qmaster soft descriptor limit is 
>> set to 8192
>> 07/17/2007 10:37:24|qmaster|pc2533|I|qmaster will use max. 8172 file 
>> descriptors for communication
>> 07/17/2007 10:37:24|qmaster|pc2533|I|qmaster will accept max. 99 
>> dynamic event clients
>
> That is fine. It says qmaster got enough file descriptors available.
My cluster consists of ~250 nodes, 2CPUs each. We run 1 job/per CPU.
We routinely have a few thousand jobs pending and in peak it goes up to 
~15k.
I am not sure what file descriptors and dynamic events are used for....
>
>> Other than that nothing special.
>>
>> Also when I restart the qmaster I get messages:
>> [root at pc2533 qmaster]# /etc/rc.d/init.d/sgemaster start
>>  starting sge_qmaster
>>  starting sge_schedd
>> daemonize error: timeout while waiting for daemonize state
>
> That means scheduler is having some problem during start-up. From the 
> message one can not say what is causing the problems, but it could be 
> due to qmaster in-turn having problems.
I am restarting them after the crash when the cluster is full loaded. Is 
it possible that it just needs more time to re-read all
the info about running and pending jobs? Where would the scheduler print 
any messages about problems it is having?

>
>>  starting sge_shadowd
>> error: getting configuration: failed receiving gdi request
>
> Next indication for a crashed or sick qmaster.
>
>>  starting up GE 6.0u11 (lx24-x86)
>>
>> How bad is any of that, could crashes be related to it?
>
> Very likely.
>
>> I am running on RHEL3 .
>
> Have you tried some other OS?
We will be upgrading shortly but at this time I have no choice, I have 
to keep the cluster
running with the OS I have.

Yesterday I gathered some more empirical evidence about the crashes - 
might be just
a coincidence. The story is long and related to a filesystem we are 
using (GPFS)
but here is the part related to SGE.

Sometimes on the client host the filesystem daemons get killed and that 
leaves
the SGE processes on the client defunct - still there, but master cannot 
communicate
with them. qdel will not dispose of the user's job, the load is not 
reported.
The easiest is to just reboot the node - it does not happen very often,
just a few nodes per day at most.

But even if I reboot the node, the client will not start properly unless 
I clean
the local spool directory. I did not figure out which files are 
interfering, but
if I delete the whole local spool,  the directory gets recreated and 
everybody
is ok, so that's what I have been doing. Reboot, delete the local spool 
subdirectory,
restart the SGE client.

Yesterday I decided to streamline my procedure and delete that local
spool directory, before I reboot the node. The moment I delete that local
spool, the master that runs on a different host crashes right away.

I managed to crash it a few times, then I went to my old procedure
- first reboot, then remove the local scratch and all has been running well.

(the startup messages about problems are still there, but once started 
SGE run well and
I do not see any other problems).

Thank You,

Iwona

>
> Regards,
> Andreas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list