[GE users] sge_qmaster bug - PE wildcards

rems0 Richard.Ems at cape-horn-eng.com
Tue Feb 9 15:53:38 GMT 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Is this bug http://gridengine.sunsource.net/issues/show_bug.cgi?id=2464 ?


On 02/09/2010 02:01 PM, jerry37 wrote:
> Hello,
> 
>   I've encountered a bug in SGE that relates to jobs with parallel
> environment request when using wildcard expansion.
> 
> It takes place when following conditions are met:
> 
> 1) schedd_job_info in scheduler configuration is set to true
> 2) there are more PE available (reproducable with at least 6)
> 3) certain number of jobs that use PE wildcards are submitted
>    *and* scheduled within the *same* scheduling interval (use around
>    16+ jobs to reproduce easily)
> 
>   In such scenario, the qmaster starts to allocate massive amounts
> (easily up to 8GB) of system memory and usually it ends either
> by OOM-killer taking place simple crash of sge_qmaster process.
>   This does either happen instantly (when there are bigger numbers of jobs
> about to be scheduled - well .. 'bigger' meaning 10+) or after some time
> while sge_qmaster process is struggling over system memory (the OOM case).
> Within the OOM case, there is usually message in qmaster messages file:
> 
> 01/15/2010 18:44:34|event_|sgemaster1|W|acknowledge timeout after 600 seconds for event client (schedd:0) on host "sgemaster1"
> 01/15/2010 18:44:34|event_|sgemaster1|E|removing event client (schedd:0) on host "sgemaster1" after acknowledge timeout from event client list
> 01/15/2010 18:44:34|event_|sgemaster1|I|event client "scheduler" with id 1 deregistered
> 
>   I've found out this issue on SGE 6.2u4 and was able to reproduce it also
> in SGE 6.2u5. I am running it on Linux x86_64 with some new SuSE distro
> using the courtesy binaries distribution.
> 
>   The interesting thing is, that when one of the conditions is not met,
> the bug cannot be reproduced in my scenario.
> 
>   You can download sge_qmaster debug log file produced by setting 'dl 10'
> and starting qmaster. It was started, jobs were submitted and it instantly
> crashed with eating up all the system memory.
> 
> URL: http://jerry37.xf.cz/sgemaster1.log.bz2
> 
> Let me know if there is any other information I can provide to help tracking this down.
> 
> Best regards,
> 
>   Jerry
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=244086
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> 


-- 
Richard Ems       mail: Richard.Ems at Cape-Horn-Eng.com

Cape Horn Engineering S.L.
C/ Dr. J.J. Dómine 1, 5? piso
46011 Valencia
Tel : +34 96 3242923 / Fax 924
http://www.cape-horn-eng.com

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=244104

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list