[GE users] qmaster dying again....

Iwona Sakrejda isakrejda at lbl.gov
Fri Jul 27 20:01:54 BST 2007


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Since my qmaster and the scheduler daemons toppled over lately for
"no good reason" I started watching their size. I started them ~27h
ago and they were at ~50MB each. Now they both tripled in size.

When I started there were about 4k jobs in the system. Now there are
about 9k. But during last 27h the number of jobs would sometimes decrease
and the daemons are slowly but steadily growing. I have only serial
jobs, about 450 running at any time on ~230 hosts, the rest is pending.

I run 6.0u11 on RHEL3.

Is that growth normal or should it be a reason for concern?
Does anybody run a comparable configuration and load?
I cannot enable reporting either. When I try those daemons
(the master and the scheduler) crash right away too.
I enabled core dumping so I hope to have more info next time
the system crashes.

Thank You,

Iwona


Andreas.Haas at Sun.COM wrote:
> Hi Iwona,
>
> On Wed, 18 Jul 2007, Iwona Sakrejda wrote:
>
>>
>>
>> Andreas.Haas at Sun.COM wrote:
>>> Hi Iwona,
>>>
>>> On Tue, 17 Jul 2007, Iwona Sakrejda wrote:
>>>
>>>> Hi,
>>>>
>>>> A few days ago I upgraded from 6.0u4 to 6.0u11 and this morning my 
>>>> qmaster started dying.
>>>
>>> You did this as foreseen?
>>>
>>>    http://gridengine.sunsource.net/install60patch.txt
>> Yes, all went through ok, no problems encountered during the upgrade.
>> I was very happy about that.
>
> Ok.
>
>>>
>>>
>>>> When I look at the logs I see messages:
>>>>
>>>> 7/17/2007 10:37:24|qmaster|pc2533|I|qmaster hard descriptor limit 
>>>> is set to 8192
>>>> 07/17/2007 10:37:24|qmaster|pc2533|I|qmaster soft descriptor limit 
>>>> is set to 8192
>>>> 07/17/2007 10:37:24|qmaster|pc2533|I|qmaster will use max. 8172 
>>>> file descriptors for communication
>>>> 07/17/2007 10:37:24|qmaster|pc2533|I|qmaster will accept max. 99 
>>>> dynamic event clients
>>>
>>> That is fine. It says qmaster got enough file descriptors available.
>> My cluster consists of ~250 nodes, 2CPUs each. We run 1 job/per CPU.
>> We routinely have a few thousand jobs pending and in peak it goes up 
>> to ~15k.
>> I am not sure what file descriptors and dynamic events are used for....
>
> Dynamic event clients are only needed for DRMAA clients and when
>
>    qsub -sync y
>
> is used. Usually the 99 default is ample amount. The same is true with 
> the 8192 file descriptors. If you estimate 1 file descriptor for each 
> node you still have 8192-250 spare fd's for client commands connecting 
> to qmaster. So this one can safely exclude as root of your qmaster 
> problem.
>
>>>
>>>> Other than that nothing special.
>>>>
>>>> Also when I restart the qmaster I get messages:
>>>> [root at pc2533 qmaster]# /etc/rc.d/init.d/sgemaster start
>>>>  starting sge_qmaster
>>>>  starting sge_schedd
>>>> daemonize error: timeout while waiting for daemonize state
>>>
>>> That means scheduler is having some problem during start-up. From 
>>> the message one can not say what is causing the problems, but it 
>>> could be due to qmaster in-turn having problems.
>> I am restarting them after the crash when the cluster is full loaded. 
>> Is it possible that it just needs more time to re-read all
>> the info about running and pending jobs?
>
> Actually this I would rule out.
>
>> Where would the scheduler print any messages about problems it is 
>> having?
>
> For investigating the problem I suggest you launch qmaster and 
> scheduler separately as binaries rather than using sgemaster script. 
> All you need is two root-shells with Grid Engine environment 
> (settings.{sh|csh}) be set.
>
> Then you do this:
>
>    # setenv SGE_ND
>    # $SGE_ROOT/bin/lx24-x86/sge_qmaster
>
> if you see everything wen't well with qmaster start-up (e.g. test 
> whether qhost gets you reasonable output) you continue with launching 
> the scheduler from the other shell:
>
>    # setenv SGE_ND
>    # $SGE_ROOT/bin/lx24-x86/sge_schedd
>
> but my expectation is already qmaster will report some problem and exit.
> Normally qmaster may not exit with SGE_ND in environemnt as it 
> prevents daemonizing.
>
>>>>  starting sge_shadowd
>>>> error: getting configuration: failed receiving gdi request
>>>
>>> Next indication for a crashed or sick qmaster.
>>>
>>>>  starting up GE 6.0u11 (lx24-x86)
>>>>
>>>> How bad is any of that, could crashes be related to it?
>>>
>>> Very likely.
>>>
>>>> I am running on RHEL3 .
>>>
>>> Have you tried some other OS?
>> We will be upgrading shortly but at this time I have no choice, I 
>> have to keep the cluster
>> running with the OS I have.
>>
>> Yesterday I gathered some more empirical evidence about the crashes - 
>> might be just
>> a coincidence. The story is long and related to a filesystem we are 
>> using (GPFS) but here is the part related to SGE.
>
> Actually I'm not aware of any problem with GPFS, but it could be related.
> Is qmaster spooling located on the GPFS volume? Are you using classic 
> or BDB spooling?
>
>
>> Sometimes on the client host the filesystem daemons get killed and 
>> that leaves the SGE processes on the client defunct - still there, 
>> but master cannot communicate with them. qdel will not dispose of the 
>> user's job, the load is not reported.
>> The easiest is to just reboot the node - it does not happen very often,
>> just a few nodes per day at most.
>>
>> But even if I reboot the node, the client will not start properly 
>> unless I clean the local spool directory. I did not figure out which 
>> files are interfering, but if I delete the whole local spool,  the 
>> directory gets recreated and everybody is ok, so that's what I have 
>> been doing. Reboot, delete the local spool subdirectory, restart the 
>> SGE client.
>
> Usually there are no problems with execution nodes if local spooling 
> is used. Ugh!
>
>
>> Yesterday I decided to streamline my procedure and delete that local
>> spool directory, before I reboot the node. The moment I delete that 
>> local
>> spool, the master that runs on a different host crashes right away.
>>
>> I managed to crash it a few times, then I went to my old procedure
>> - first reboot, then remove the local scratch and all has been 
>> running well.
>>
>> (the startup messages about problems are still there, but once 
>> started SGE run well and
>> I do not see any other problems).
>
> Bah, Ugh, Igitt!!! Well, it sounds as if it were a good idea to move away
> from GPFS ... at least for SGE spooling. Can't you switch to a more 
> conventional FS for that purpose?
>
> Regards,
> Andreas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list