[GE users] qmaster dying again....

Iwona Sakrejda isakrejda at lbl.gov
Thu Aug 9 18:16:07 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,

I have maintenance scheduled for the next Tuesday to look into those
scheduler and master daemon crashes I keep experiencing. I'll be able to
follow Andrease's suggestions then, but in the meantime I hoped to
learn something from the coredumps. Do they coredump when they crash?
Where to look for a coredump from the manager and the scheduler?

Thanks  again,

Iwona



Andreas.Haas at Sun.COM wrote:
> Hi Iwona,
>
> watching memory consumption patterns of deamons can be like tea leave 
> reading. Since
>
>    http://gridengine.sunsource.net/issues/show_bug.cgi?id=2187
>
> was fixed for 6.0u11 I have not heard of anything that sounds like a 
> memory leak and Andrea's memory consumption records disclose qmaster 
> was memory leak free already before 6.0u11.
>
> Below you say
>
>   "I cannot enable reporting either. When I try those daemons
>    (the master and the scheduler) crash right away too."
>
> or are you refering here to reporting(5) or is it the outcome of 
> running daemons undeamonized as I suggested it?
>
> Regards,
> Andreas
>
>
> On Tue, 31 Jul 2007, Iwona Sakrejda wrote:
>
>> Hi,
>>
>> Nobody picked up on this thread and today both the master and the 
>> scheduling
>> daemon are 0.5GB each. Is that normal? They have not crashed since 
>> 07/27,
>> but even if the load goes down they never shrink, they just grow slower.
>> That looks to me like a memory leak, but I am not sure how to approach
>> debugging of this problem.
>>
>> I can schedule maintenance period and try debugging, but would like to
>> have a better plan of what and how to debug.
>>
>>
>> Thank You,
>>
>> Iwona
>>
>> Iwona Sakrejda wrote:
>>> Since my qmaster and the scheduler daemons toppled over lately for
>>> "no good reason" I started watching their size. I started them ~27h
>>> ago and they were at ~50MB each. Now they both tripled in size.
>>>
>>> When I started there were about 4k jobs in the system. Now there are
>>> about 9k. But during last 27h the number of jobs would sometimes 
>>> decrease
>>> and the daemons are slowly but steadily growing. I have only serial
>>> jobs, about 450 running at any time on ~230 hosts, the rest is pending.
>>>
>>> I run 6.0u11 on RHEL3.
>>>
>>> Is that growth normal or should it be a reason for concern?
>>> Does anybody run a comparable configuration and load?
>>> I cannot enable reporting either. When I try those daemons
>>> (the master and the scheduler) crash right away too.
>>> I enabled core dumping so I hope to have more info next time
>>> the system crashes.
>>>
>>> Thank You,
>>>
>>> Iwona
>>>
>>>
>>> Andreas.Haas at Sun.COM wrote:
>>>> Hi Iwona,
>>>>
>>>> On Wed, 18 Jul 2007, Iwona Sakrejda wrote:
>>>>
>>>>>
>>>>>
>>>>> Andreas.Haas at Sun.COM wrote:
>>>>>> Hi Iwona,
>>>>>>
>>>>>> On Tue, 17 Jul 2007, Iwona Sakrejda wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> A few days ago I upgraded from 6.0u4 to 6.0u11 and this morning 
>>>>>>> my qmaster started dying.
>>>>>>
>>>>>> You did this as foreseen?
>>>>>>
>>>>>>    http://gridengine.sunsource.net/install60patch.txt
>>>>> Yes, all went through ok, no problems encountered during the upgrade.
>>>>> I was very happy about that.
>>>>
>>>> Ok.
>>>>
>>>>>>
>>>>>>
>>>>>>> When I look at the logs I see messages:
>>>>>>>
>>>>>>> 7/17/2007 10:37:24|qmaster|pc2533|I|qmaster hard descriptor 
>>>>>>> limit is set to 8192
>>>>>>> 07/17/2007 10:37:24|qmaster|pc2533|I|qmaster soft descriptor 
>>>>>>> limit is set to 8192
>>>>>>> 07/17/2007 10:37:24|qmaster|pc2533|I|qmaster will use max. 8172 
>>>>>>> file descriptors for communication
>>>>>>> 07/17/2007 10:37:24|qmaster|pc2533|I|qmaster will accept max. 99 
>>>>>>> dynamic event clients
>>>>>>
>>>>>> That is fine. It says qmaster got enough file descriptors available.
>>>>> My cluster consists of ~250 nodes, 2CPUs each. We run 1 job/per CPU.
>>>>> We routinely have a few thousand jobs pending and in peak it goes 
>>>>> up to ~15k.
>>>>> I am not sure what file descriptors and dynamic events are used 
>>>>> for....
>>>>
>>>> Dynamic event clients are only needed for DRMAA clients and when
>>>>
>>>>    qsub -sync y
>>>>
>>>> is used. Usually the 99 default is ample amount. The same is true 
>>>> with the 8192 file descriptors. If you estimate 1 file descriptor 
>>>> for each node you still have 8192-250 spare fd's for client 
>>>> commands connecting to qmaster. So this one can safely exclude as 
>>>> root of your qmaster problem.
>>>>
>>>>>>
>>>>>>> Other than that nothing special.
>>>>>>>
>>>>>>> Also when I restart the qmaster I get messages:
>>>>>>> [root at pc2533 qmaster]# /etc/rc.d/init.d/sgemaster start
>>>>>>>  starting sge_qmaster
>>>>>>>  starting sge_schedd
>>>>>>> daemonize error: timeout while waiting for daemonize state
>>>>>>
>>>>>> That means scheduler is having some problem during start-up. From 
>>>>>> the message one can not say what is causing the problems, but it 
>>>>>> could be due to qmaster in-turn having problems.
>>>>> I am restarting them after the crash when the cluster is full 
>>>>> loaded. Is it possible that it just needs more time to re-read all
>>>>> the info about running and pending jobs?
>>>>
>>>> Actually this I would rule out.
>>>>
>>>>> Where would the scheduler print any messages about problems it is 
>>>>> having?
>>>>
>>>> For investigating the problem I suggest you launch qmaster and 
>>>> scheduler separately as binaries rather than using sgemaster 
>>>> script. All you need is two root-shells with Grid Engine 
>>>> environment (settings.{sh|csh}) be set.
>>>>
>>>> Then you do this:
>>>>
>>>>    # setenv SGE_ND
>>>>    # $SGE_ROOT/bin/lx24-x86/sge_qmaster
>>>>
>>>> if you see everything wen't well with qmaster start-up (e.g. test 
>>>> whether qhost gets you reasonable output) you continue with 
>>>> launching the scheduler from the other shell:
>>>>
>>>>    # setenv SGE_ND
>>>>    # $SGE_ROOT/bin/lx24-x86/sge_schedd
>>>>
>>>> but my expectation is already qmaster will report some problem and 
>>>> exit.
>>>> Normally qmaster may not exit with SGE_ND in environemnt as it 
>>>> prevents daemonizing.
>>>>
>>>>>>>  starting sge_shadowd
>>>>>>> error: getting configuration: failed receiving gdi request
>>>>>>
>>>>>> Next indication for a crashed or sick qmaster.
>>>>>>
>>>>>>>  starting up GE 6.0u11 (lx24-x86)
>>>>>>>
>>>>>>> How bad is any of that, could crashes be related to it?
>>>>>>
>>>>>> Very likely.
>>>>>>
>>>>>>> I am running on RHEL3 .
>>>>>>
>>>>>> Have you tried some other OS?
>>>>> We will be upgrading shortly but at this time I have no choice, I 
>>>>> have to keep the cluster
>>>>> running with the OS I have.
>>>>>
>>>>> Yesterday I gathered some more empirical evidence about the 
>>>>> crashes - might be just
>>>>> a coincidence. The story is long and related to a filesystem we 
>>>>> are using (GPFS) but here is the part related to SGE.
>>>>
>>>> Actually I'm not aware of any problem with GPFS, but it could be 
>>>> related.
>>>> Is qmaster spooling located on the GPFS volume? Are you using 
>>>> classic or BDB spooling?
>>>>
>>>>
>>>>> Sometimes on the client host the filesystem daemons get killed and 
>>>>> that leaves the SGE processes on the client defunct - still there, 
>>>>> but master cannot communicate with them. qdel will not dispose of 
>>>>> the user's job, the load is not reported.
>>>>> The easiest is to just reboot the node - it does not happen very 
>>>>> often,
>>>>> just a few nodes per day at most.
>>>>>
>>>>> But even if I reboot the node, the client will not start properly 
>>>>> unless I clean the local spool directory. I did not figure out 
>>>>> which files are interfering, but if I delete the whole local 
>>>>> spool,  the directory gets recreated and everybody is ok, so 
>>>>> that's what I have been doing. Reboot, delete the local spool 
>>>>> subdirectory, restart the SGE client.
>>>>
>>>> Usually there are no problems with execution nodes if local 
>>>> spooling is used. Ugh!
>>>>
>>>>
>>>>> Yesterday I decided to streamline my procedure and delete that local
>>>>> spool directory, before I reboot the node. The moment I delete 
>>>>> that local
>>>>> spool, the master that runs on a different host crashes right away.
>>>>>
>>>>> I managed to crash it a few times, then I went to my old procedure
>>>>> - first reboot, then remove the local scratch and all has been 
>>>>> running well.
>>>>>
>>>>> (the startup messages about problems are still there, but once 
>>>>> started SGE run well and
>>>>> I do not see any other problems).
>>>>
>>>> Bah, Ugh, Igitt!!! Well, it sounds as if it were a good idea to 
>>>> move away
>>>> from GPFS ... at least for SGE spooling. Can't you switch to a more 
>>>> conventional FS for that purpose?
>>>>
>>>> Regards,
>>>> Andreas
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
> http://gridengine.info/
>
> Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 
> Kirchheim-Heimstetten
> Amtsgericht Muenchen: HRB 161028
> Geschaeftsfuehrer: Marcel Schneider, Wolfgang Engels, Dr. Roland Boemer
> Vorsitzender des Aufsichtsrates: Martin Haering
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list