[GE users] qmaster dying again....

Iwona Sakrejda isakrejda at lbl.gov
Tue Jul 31 18:47:56 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,

Nobody picked up on this thread and today both the master and the 
scheduling
daemon are 0.5GB each. Is that normal? They have not crashed since 07/27,
but even if the load goes down they never shrink, they just grow slower.
That looks to me like a memory leak, but I am not sure how to approach
debugging of this problem.

I can schedule maintenance period and try debugging, but would like to
have a better plan of what and how to debug.


Thank You,

Iwona

Iwona Sakrejda wrote:
> Since my qmaster and the scheduler daemons toppled over lately for
> "no good reason" I started watching their size. I started them ~27h
> ago and they were at ~50MB each. Now they both tripled in size.
>
> When I started there were about 4k jobs in the system. Now there are
> about 9k. But during last 27h the number of jobs would sometimes decrease
> and the daemons are slowly but steadily growing. I have only serial
> jobs, about 450 running at any time on ~230 hosts, the rest is pending.
>
> I run 6.0u11 on RHEL3.
>
> Is that growth normal or should it be a reason for concern?
> Does anybody run a comparable configuration and load?
> I cannot enable reporting either. When I try those daemons
> (the master and the scheduler) crash right away too.
> I enabled core dumping so I hope to have more info next time
> the system crashes.
>
> Thank You,
>
> Iwona
>
>
> Andreas.Haas at Sun.COM wrote:
>> Hi Iwona,
>>
>> On Wed, 18 Jul 2007, Iwona Sakrejda wrote:
>>
>>>
>>>
>>> Andreas.Haas at Sun.COM wrote:
>>>> Hi Iwona,
>>>>
>>>> On Tue, 17 Jul 2007, Iwona Sakrejda wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> A few days ago I upgraded from 6.0u4 to 6.0u11 and this morning my 
>>>>> qmaster started dying.
>>>>
>>>> You did this as foreseen?
>>>>
>>>>    http://gridengine.sunsource.net/install60patch.txt
>>> Yes, all went through ok, no problems encountered during the upgrade.
>>> I was very happy about that.
>>
>> Ok.
>>
>>>>
>>>>
>>>>> When I look at the logs I see messages:
>>>>>
>>>>> 7/17/2007 10:37:24|qmaster|pc2533|I|qmaster hard descriptor limit 
>>>>> is set to 8192
>>>>> 07/17/2007 10:37:24|qmaster|pc2533|I|qmaster soft descriptor limit 
>>>>> is set to 8192
>>>>> 07/17/2007 10:37:24|qmaster|pc2533|I|qmaster will use max. 8172 
>>>>> file descriptors for communication
>>>>> 07/17/2007 10:37:24|qmaster|pc2533|I|qmaster will accept max. 99 
>>>>> dynamic event clients
>>>>
>>>> That is fine. It says qmaster got enough file descriptors available.
>>> My cluster consists of ~250 nodes, 2CPUs each. We run 1 job/per CPU.
>>> We routinely have a few thousand jobs pending and in peak it goes up 
>>> to ~15k.
>>> I am not sure what file descriptors and dynamic events are used for....
>>
>> Dynamic event clients are only needed for DRMAA clients and when
>>
>>    qsub -sync y
>>
>> is used. Usually the 99 default is ample amount. The same is true 
>> with the 8192 file descriptors. If you estimate 1 file descriptor for 
>> each node you still have 8192-250 spare fd's for client commands 
>> connecting to qmaster. So this one can safely exclude as root of your 
>> qmaster problem.
>>
>>>>
>>>>> Other than that nothing special.
>>>>>
>>>>> Also when I restart the qmaster I get messages:
>>>>> [root at pc2533 qmaster]# /etc/rc.d/init.d/sgemaster start
>>>>>  starting sge_qmaster
>>>>>  starting sge_schedd
>>>>> daemonize error: timeout while waiting for daemonize state
>>>>
>>>> That means scheduler is having some problem during start-up. From 
>>>> the message one can not say what is causing the problems, but it 
>>>> could be due to qmaster in-turn having problems.
>>> I am restarting them after the crash when the cluster is full 
>>> loaded. Is it possible that it just needs more time to re-read all
>>> the info about running and pending jobs?
>>
>> Actually this I would rule out.
>>
>>> Where would the scheduler print any messages about problems it is 
>>> having?
>>
>> For investigating the problem I suggest you launch qmaster and 
>> scheduler separately as binaries rather than using sgemaster script. 
>> All you need is two root-shells with Grid Engine environment 
>> (settings.{sh|csh}) be set.
>>
>> Then you do this:
>>
>>    # setenv SGE_ND
>>    # $SGE_ROOT/bin/lx24-x86/sge_qmaster
>>
>> if you see everything wen't well with qmaster start-up (e.g. test 
>> whether qhost gets you reasonable output) you continue with launching 
>> the scheduler from the other shell:
>>
>>    # setenv SGE_ND
>>    # $SGE_ROOT/bin/lx24-x86/sge_schedd
>>
>> but my expectation is already qmaster will report some problem and exit.
>> Normally qmaster may not exit with SGE_ND in environemnt as it 
>> prevents daemonizing.
>>
>>>>>  starting sge_shadowd
>>>>> error: getting configuration: failed receiving gdi request
>>>>
>>>> Next indication for a crashed or sick qmaster.
>>>>
>>>>>  starting up GE 6.0u11 (lx24-x86)
>>>>>
>>>>> How bad is any of that, could crashes be related to it?
>>>>
>>>> Very likely.
>>>>
>>>>> I am running on RHEL3 .
>>>>
>>>> Have you tried some other OS?
>>> We will be upgrading shortly but at this time I have no choice, I 
>>> have to keep the cluster
>>> running with the OS I have.
>>>
>>> Yesterday I gathered some more empirical evidence about the crashes 
>>> - might be just
>>> a coincidence. The story is long and related to a filesystem we are 
>>> using (GPFS) but here is the part related to SGE.
>>
>> Actually I'm not aware of any problem with GPFS, but it could be 
>> related.
>> Is qmaster spooling located on the GPFS volume? Are you using classic 
>> or BDB spooling?
>>
>>
>>> Sometimes on the client host the filesystem daemons get killed and 
>>> that leaves the SGE processes on the client defunct - still there, 
>>> but master cannot communicate with them. qdel will not dispose of 
>>> the user's job, the load is not reported.
>>> The easiest is to just reboot the node - it does not happen very often,
>>> just a few nodes per day at most.
>>>
>>> But even if I reboot the node, the client will not start properly 
>>> unless I clean the local spool directory. I did not figure out which 
>>> files are interfering, but if I delete the whole local spool,  the 
>>> directory gets recreated and everybody is ok, so that's what I have 
>>> been doing. Reboot, delete the local spool subdirectory, restart the 
>>> SGE client.
>>
>> Usually there are no problems with execution nodes if local spooling 
>> is used. Ugh!
>>
>>
>>> Yesterday I decided to streamline my procedure and delete that local
>>> spool directory, before I reboot the node. The moment I delete that 
>>> local
>>> spool, the master that runs on a different host crashes right away.
>>>
>>> I managed to crash it a few times, then I went to my old procedure
>>> - first reboot, then remove the local scratch and all has been 
>>> running well.
>>>
>>> (the startup messages about problems are still there, but once 
>>> started SGE run well and
>>> I do not see any other problems).
>>
>> Bah, Ugh, Igitt!!! Well, it sounds as if it were a good idea to move 
>> away
>> from GPFS ... at least for SGE spooling. Can't you switch to a more 
>> conventional FS for that purpose?
>>
>> Regards,
>> Andreas
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list