[GE users] sge_qmaster bug - PE wildcards
jerry37 at seznam.cz
Tue Feb 9 13:01:37 GMT 2010
I've encountered a bug in SGE that relates to jobs with parallel
environment request when using wildcard expansion.
It takes place when following conditions are met:
1) schedd_job_info in scheduler configuration is set to true
2) there are more PE available (reproducable with at least 6)
3) certain number of jobs that use PE wildcards are submitted
*and* scheduled within the *same* scheduling interval (use around
16+ jobs to reproduce easily)
In such scenario, the qmaster starts to allocate massive amounts
(easily up to 8GB) of system memory and usually it ends either
by OOM-killer taking place simple crash of sge_qmaster process.
This does either happen instantly (when there are bigger numbers of jobs
about to be scheduled - well .. 'bigger' meaning 10+) or after some time
while sge_qmaster process is struggling over system memory (the OOM case).
Within the OOM case, there is usually message in qmaster messages file:
01/15/2010 18:44:34|event_|sgemaster1|W|acknowledge timeout after 600 seconds for event client (schedd:0) on host "sgemaster1"
01/15/2010 18:44:34|event_|sgemaster1|E|removing event client (schedd:0) on host "sgemaster1" after acknowledge timeout from event client list
01/15/2010 18:44:34|event_|sgemaster1|I|event client "scheduler" with id 1 deregistered
I've found out this issue on SGE 6.2u4 and was able to reproduce it also
in SGE 6.2u5. I am running it on Linux x86_64 with some new SuSE distro
using the courtesy binaries distribution.
The interesting thing is, that when one of the conditions is not met,
the bug cannot be reproduced in my scenario.
You can download sge_qmaster debug log file produced by setting 'dl 10'
and starting qmaster. It was started, jobs were submitted and it instantly
crashed with eating up all the system memory.
Let me know if there is any other information I can provide to help tracking this down.
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users