[GE users] qmaster dying again....

Andreas.Haas at Sun.COM Andreas.Haas at Sun.COM
Wed Aug 1 13:50:53 BST 2007


Hi Iwona,

watching memory consumption patterns of deamons can be like tea 
leave reading. Since

    http://gridengine.sunsource.net/issues/show_bug.cgi?id=2187

was fixed for 6.0u11 I have not heard of anything that sounds like 
a memory leak and Andrea's memory consumption records disclose 
qmaster was memory leak free already before 6.0u11.

Below you say

   "I cannot enable reporting either. When I try those daemons
    (the master and the scheduler) crash right away too."

or are you refering here to reporting(5) or is it the outcome 
of running daemons undeamonized as I suggested it?

Regards,
Andreas


On Tue, 31 Jul 2007, Iwona Sakrejda wrote:

> Hi,
>
> Nobody picked up on this thread and today both the master and the scheduling
> daemon are 0.5GB each. Is that normal? They have not crashed since 07/27,
> but even if the load goes down they never shrink, they just grow slower.
> That looks to me like a memory leak, but I am not sure how to approach
> debugging of this problem.
>
> I can schedule maintenance period and try debugging, but would like to
> have a better plan of what and how to debug.
>
>
> Thank You,
>
> Iwona
>
> Iwona Sakrejda wrote:
>> Since my qmaster and the scheduler daemons toppled over lately for
>> "no good reason" I started watching their size. I started them ~27h
>> ago and they were at ~50MB each. Now they both tripled in size.
>> 
>> When I started there were about 4k jobs in the system. Now there are
>> about 9k. But during last 27h the number of jobs would sometimes decrease
>> and the daemons are slowly but steadily growing. I have only serial
>> jobs, about 450 running at any time on ~230 hosts, the rest is pending.
>> 
>> I run 6.0u11 on RHEL3.
>> 
>> Is that growth normal or should it be a reason for concern?
>> Does anybody run a comparable configuration and load?
>> I cannot enable reporting either. When I try those daemons
>> (the master and the scheduler) crash right away too.
>> I enabled core dumping so I hope to have more info next time
>> the system crashes.
>> 
>> Thank You,
>> 
>> Iwona
>> 
>> 
>> Andreas.Haas at Sun.COM wrote:
>>> Hi Iwona,
>>> 
>>> On Wed, 18 Jul 2007, Iwona Sakrejda wrote:
>>> 
>>>> 
>>>> 
>>>> Andreas.Haas at Sun.COM wrote:
>>>>> Hi Iwona,
>>>>> 
>>>>> On Tue, 17 Jul 2007, Iwona Sakrejda wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> A few days ago I upgraded from 6.0u4 to 6.0u11 and this morning my 
>>>>>> qmaster started dying.
>>>>> 
>>>>> You did this as foreseen?
>>>>>
>>>>>    http://gridengine.sunsource.net/install60patch.txt
>>>> Yes, all went through ok, no problems encountered during the upgrade.
>>>> I was very happy about that.
>>> 
>>> Ok.
>>> 
>>>>> 
>>>>> 
>>>>>> When I look at the logs I see messages:
>>>>>> 
>>>>>> 7/17/2007 10:37:24|qmaster|pc2533|I|qmaster hard descriptor limit is 
>>>>>> set to 8192
>>>>>> 07/17/2007 10:37:24|qmaster|pc2533|I|qmaster soft descriptor limit is 
>>>>>> set to 8192
>>>>>> 07/17/2007 10:37:24|qmaster|pc2533|I|qmaster will use max. 8172 file 
>>>>>> descriptors for communication
>>>>>> 07/17/2007 10:37:24|qmaster|pc2533|I|qmaster will accept max. 99 
>>>>>> dynamic event clients
>>>>> 
>>>>> That is fine. It says qmaster got enough file descriptors available.
>>>> My cluster consists of ~250 nodes, 2CPUs each. We run 1 job/per CPU.
>>>> We routinely have a few thousand jobs pending and in peak it goes up to 
>>>> ~15k.
>>>> I am not sure what file descriptors and dynamic events are used for....
>>> 
>>> Dynamic event clients are only needed for DRMAA clients and when
>>>
>>>    qsub -sync y
>>> 
>>> is used. Usually the 99 default is ample amount. The same is true with the 
>>> 8192 file descriptors. If you estimate 1 file descriptor for each node you 
>>> still have 8192-250 spare fd's for client commands connecting to qmaster. 
>>> So this one can safely exclude as root of your qmaster problem.
>>> 
>>>>> 
>>>>>> Other than that nothing special.
>>>>>> 
>>>>>> Also when I restart the qmaster I get messages:
>>>>>> [root at pc2533 qmaster]# /etc/rc.d/init.d/sgemaster start
>>>>>>  starting sge_qmaster
>>>>>>  starting sge_schedd
>>>>>> daemonize error: timeout while waiting for daemonize state
>>>>> 
>>>>> That means scheduler is having some problem during start-up. From the 
>>>>> message one can not say what is causing the problems, but it could be 
>>>>> due to qmaster in-turn having problems.
>>>> I am restarting them after the crash when the cluster is full loaded. Is 
>>>> it possible that it just needs more time to re-read all
>>>> the info about running and pending jobs?
>>> 
>>> Actually this I would rule out.
>>> 
>>>> Where would the scheduler print any messages about problems it is having?
>>> 
>>> For investigating the problem I suggest you launch qmaster and scheduler 
>>> separately as binaries rather than using sgemaster script. All you need is 
>>> two root-shells with Grid Engine environment (settings.{sh|csh}) be set.
>>> 
>>> Then you do this:
>>>
>>>    # setenv SGE_ND
>>>    # $SGE_ROOT/bin/lx24-x86/sge_qmaster
>>> 
>>> if you see everything wen't well with qmaster start-up (e.g. test whether 
>>> qhost gets you reasonable output) you continue with launching the 
>>> scheduler from the other shell:
>>>
>>>    # setenv SGE_ND
>>>    # $SGE_ROOT/bin/lx24-x86/sge_schedd
>>> 
>>> but my expectation is already qmaster will report some problem and exit.
>>> Normally qmaster may not exit with SGE_ND in environemnt as it prevents 
>>> daemonizing.
>>>
>>>>>>  starting sge_shadowd
>>>>>> error: getting configuration: failed receiving gdi request
>>>>> 
>>>>> Next indication for a crashed or sick qmaster.
>>>>>
>>>>>>  starting up GE 6.0u11 (lx24-x86)
>>>>>> 
>>>>>> How bad is any of that, could crashes be related to it?
>>>>> 
>>>>> Very likely.
>>>>> 
>>>>>> I am running on RHEL3 .
>>>>> 
>>>>> Have you tried some other OS?
>>>> We will be upgrading shortly but at this time I have no choice, I have to 
>>>> keep the cluster
>>>> running with the OS I have.
>>>> 
>>>> Yesterday I gathered some more empirical evidence about the crashes - 
>>>> might be just
>>>> a coincidence. The story is long and related to a filesystem we are using 
>>>> (GPFS) but here is the part related to SGE.
>>> 
>>> Actually I'm not aware of any problem with GPFS, but it could be related.
>>> Is qmaster spooling located on the GPFS volume? Are you using classic or 
>>> BDB spooling?
>>> 
>>> 
>>>> Sometimes on the client host the filesystem daemons get killed and that 
>>>> leaves the SGE processes on the client defunct - still there, but master 
>>>> cannot communicate with them. qdel will not dispose of the user's job, 
>>>> the load is not reported.
>>>> The easiest is to just reboot the node - it does not happen very often,
>>>> just a few nodes per day at most.
>>>> 
>>>> But even if I reboot the node, the client will not start properly unless 
>>>> I clean the local spool directory. I did not figure out which files are 
>>>> interfering, but if I delete the whole local spool,  the directory gets 
>>>> recreated and everybody is ok, so that's what I have been doing. Reboot, 
>>>> delete the local spool subdirectory, restart the SGE client.
>>> 
>>> Usually there are no problems with execution nodes if local spooling is 
>>> used. Ugh!
>>> 
>>> 
>>>> Yesterday I decided to streamline my procedure and delete that local
>>>> spool directory, before I reboot the node. The moment I delete that local
>>>> spool, the master that runs on a different host crashes right away.
>>>> 
>>>> I managed to crash it a few times, then I went to my old procedure
>>>> - first reboot, then remove the local scratch and all has been running 
>>>> well.
>>>> 
>>>> (the startup messages about problems are still there, but once started 
>>>> SGE run well and
>>>> I do not see any other problems).
>>> 
>>> Bah, Ugh, Igitt!!! Well, it sounds as if it were a good idea to move away
>>> from GPFS ... at least for SGE spooling. Can't you switch to a more 
>>> conventional FS for that purpose?
>>> 
>>> Regards,
>>> Andreas
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>> 
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

http://gridengine.info/

Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Marcel Schneider, Wolfgang Engels, Dr. Roland Boemer
Vorsitzender des Aufsichtsrates: Martin Haering

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list