[GE users] sge_qmaster bug - PE wildcards
jerry37 at seznam.cz
Tue Feb 9 18:51:37 GMT 2010
[ The following text is in the "utf-8" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some characters may be displayed incorrectly. ]
that seems to be it. So just to get this straight - this behaviour is caused by schedd_mes_find_others() as stated in issue 2464, when it tries to generate scheduling information messages for all of the currently scheduled jobs. Also its not necessarily linked only to PE wildcards, but any request that cause bigger number of messages to be generated.
Just one thing doesn't match up - I've been able to reproduce this with single queue configured + 6 PEs configured + 4 exec hosts in total, trying to schedule 16 jobs at once (while the queue was empty except these jobs). Considering that, its quite easy to get qmaster crash when having such a minimalistic configuration. There should be at least a warning regarding 'schedd_job_info' setting in sched_conf manpage or somewhere IMO. Especially, when - as far as I understand - the aim is to abolish this completely - http://wiki.gridengine.info/wiki/index.php/DispatchingDiagnosisOnDemand
> Is this bug http://gridengine.sunsource.net/issues/show_bug.cgi?id=2464 ?
> On 02/09/2010 02:01 PM, jerry37 wrote:
> > Hello,
> > I've encountered a bug in SGE that relates to jobs with parallel
> > environment request when using wildcard expansion.
> > It takes place when following conditions are met:
> > 1) schedd_job_info in scheduler configuration is set to true
> > 2) there are more PE available (reproducable with at least 6)
> > 3) certain number of jobs that use PE wildcards are submitted
> > *and* scheduled within the *same* scheduling interval (use around
> > 16+ jobs to reproduce easily)
> > In such scenario, the qmaster starts to allocate massive amounts
> > (easily up to 8GB) of system memory and usually it ends either
> > by OOM-killer taking place simple crash of sge_qmaster process.
> > This does either happen instantly (when there are bigger numbers of jobs
> > about to be scheduled - well .. 'bigger' meaning 10+) or after some time
> > while sge_qmaster process is struggling over system memory (the OOM case).
> > Within the OOM case, there is usually message in qmaster messages file:
> > 01/15/2010 18:44:34|event_|sgemaster1|W|acknowledge timeout after 600 seconds for event client (schedd:0) on host "sgemaster1"
> > 01/15/2010 18:44:34|event_|sgemaster1|E|removing event client (schedd:0) on host "sgemaster1" after acknowledge timeout from event client list
> > 01/15/2010 18:44:34|event_|sgemaster1|I|event client "scheduler" with id 1 deregistered
> > I've found out this issue on SGE 6.2u4 and was able to reproduce it also
> > in SGE 6.2u5. I am running it on Linux x86_64 with some new SuSE distro
> > using the courtesy binaries distribution.
> > The interesting thing is, that when one of the conditions is not met,
> > the bug cannot be reproduced in my scenario.
> > You can download sge_qmaster debug log file produced by setting 'dl 10'
> > and starting qmaster. It was started, jobs were submitted and it instantly
> > crashed with eating up all the system memory.
> > URL: http://jerry37.xf.cz/sgemaster1.log.bz2
> > Let me know if there is any other information I can provide to help tracking this down.
> > Best regards,
> > Jerry
> > ------------------------------------------------------
> > http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=244086
> > To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> Richard Ems mail: Richard.Ems at Cape-Horn-Eng.com
> Cape Horn Engineering S.L.
> C/ Dr. J.J. Dómine 1, 5? piso
> 46011 Valencia
> Tel : +34 96 3242923 / Fax 924
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users