Opened 9 years ago
Closed 7 years ago
#1409 closed defect (fixed)
Qmaster process uses all memory and gets killed by the OOM killer
Reported by: | bdeluca@… | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | sge | Version: | 8.0.0d |
Severity: | minor | Keywords: | |
Cc: |
Description
Hi!
I am experiencing a problem on 8.0.0c and later code. Where
the qmaster will suddenly increase the amount of ram it using and then
be killed by the oom killer. On 8.0.0c this removes the parrallel
environments causing all processing with PE's to stop, even when we
detect the failure and recover.
Working backwards to determine when this starts happening.
Change History (4)
comment:1 Changed 9 years ago by bdeluca@…
comment:2 Changed 9 years ago by bdeluca@…
I have further details in regards to this issue.
Our grid runs with PE, and consumable complexes.
I copy our default queue and submit a job that uses PE and consumables.
with 10 exec hosts eventually I run out of ram, I tried up to 12 gigs.
BT's from the qmaster when it is running out of ram looks like.
<CUTANDPASTEFAIL> lCreateElem (dp=0x7ffa0619af20) at
../libs/cull/cull_list.c:909
#1 0x000000000059d7be in lCopyElemHash (ep=0x7ffa061cc580,
isHash=true) at ../libs/cull/cull_list.c:176
#2 0x000000000059d8f0 in lCopyListHash (name=<value optimized out>,
src=<value optimized out>, hash=true)
at ../libs/cull/cull_list.c:1586
#3 0x00000000005a8a27 in lSelectHashPack (name=0x7ffa061c9438 "job
ids", slp=0x7ffa061af860, cp=0x0, enp=0x0, isHash=true, pb=0x0)
at ../libs/cull/cull_db.c:838
#4 0x000000000059bfe1 in lCopySwitchPack (sep=<value optimized out>,
dep=0x7ff95ad2d1c0, src_idx=<value optimized out>,
dst_idx=<value optimized out>, isHash=<value optimized out>,
ep=<value optimized out>, pb=0x0) at ../libs/cull/cull_list.c:398
#5 0x000000000059d807 in lCopyElemHash (ep=0x7ffa061cb700,
isHash=true) at ../libs/cull/cull_list.c:182
#6 0x000000000059d8f0 in lCopyListHash (name=<value optimized out>,
src=<value optimized out>, hash=true)
at ../libs/cull/cull_list.c:1586
#7 0x00000000004d0e54 in parallel_tag_queues_suitable4job
(a=0x7ffa34bfd110, use_category=<value optimized out>,
available_slots=0x7ffa34bfd3bc) at ../libs/sched/sge_select_queue.c:4207
#8 parallel_assignment (a=0x7ffa34bfd110, use_category=<value
optimized out>, available_slots=0x7ffa34bfd3bc)
at ../libs/sched/sge_select_queue.c:4976
#9 0x00000000004d3045 in parallel_maximize_slots_pe
(best=0x7ffa34bfd620, available_slots=0x7ffa34bfd3bc)
at ../libs/sched/sge_select_queue.c:927
#10 0x00000000004d421c in sge_select_parallel_environment
(best=0x7ffa34bfd620, pe_list=<value optimized out>)
at ../libs/sched/sge_select_queue.c:524
#11 0x000000000043b8f4 in select_assign_debit (evc=0x7ffa4195af60,
answer_list=0x7ffa34bfde50, lists=0x7ffa34bfdae0,
order=<value optimized out>) at ../daemons/qmaster/sge_sched_thread.c:1085
#12 dispatch_jobs (evc=0x7ffa4195af60, answer_list=0x7ffa34bfde50,
lists=0x7ffa34bfdae0, order=<value optimized out>)
at ../daemons/qmaster/sge_sched_thread.c:764
#13 scheduler_method (evc=0x7ffa4195af60, answer_list=0x7ffa34bfde50,
lists=0x7ffa34bfdae0, order=<value optimized out>)
at ../daemons/qmaster/sge_sched_thread.c:250
#14 0x00000000004337f4 in sge_scheduler_main (arg=0x7ffa38668f10) at
../daemons/qmaster/sge_thread_scheduler.c:866
#15 0x00000037efc06ccb in start_thread (arg=0x7ffa34bfe700) at
pthread_create.c:301
#16 0x00000037ef4e0c2d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:115
#0 lCopyElemHash (ep=0x7ff9d1809000, isHash=true) at
../libs/cull/cull_list.c:174
#1 0x000000000059d8f0 in lCopyListHash (name=<value optimized out>,
src=<value optimized out>, hash=true)
at ../libs/cull/cull_list.c:1586
#2 0x00000000005a8a27 in lSelectHashPack (name=0x7ff9d19ff4b0 "job
ids", slp=0x7ff9d19ec7d0, cp=0x0, enp=0x0, isHash=true, pb=0x0)
at ../libs/cull/cull_db.c:838
#3 0x000000000059bfe1 in lCopySwitchPack (sep=<value optimized out>,
dep=0x7ff922638b80, src_idx=<value optimized out>,
dst_idx=<value optimized out>, isHash=<value optimized out>,
ep=<value optimized out>, pb=0x0) at ../libs/cull/cull_list.c:398
#4 0x000000000059d807 in lCopyElemHash (ep=0x7ff9d18085c0,
isHash=true) at ../libs/cull/cull_list.c:182
#5 0x000000000059d8f0 in lCopyListHash (name=<value optimized out>,
src=<value optimized out>, hash=true)
at ../libs/cull/cull_list.c:1586
#6 0x00000000004d0e54 in parallel_tag_queues_suitable4job
(a=0x7ffa34bfd110, use_category=<value optimized out>,
available_slots=0x7ffa34bfd3bc) at ../libs/sched/sge_select_queue.c:4207
#7 parallel_assignment (a=0x7ffa34bfd110, use_category=<value
optimized out>, available_slots=0x7ffa34bfd3bc)
at ../libs/sched/sge_select_queue.c:4976
#8 0x00000000004d3045 in parallel_maximize_slots_pe
(best=0x7ffa34bfd620, available_slots=0x7ffa34bfd3bc)
at ../libs/sched/sge_select_queue.c:927
#9 0x00000000004d421c in sge_select_parallel_environment
(best=0x7ffa34bfd620, pe_list=<value optimized out>)
at ../libs/sched/sge_select_queue.c:524
#10 0x000000000043b8f4 in select_assign_debit (evc=0x7ffa4195af60,
answer_list=0x7ffa34bfde50, lists=0x7ffa34bfdae0,
order=<value optimized out>) at ../daemons/qmaster/sge_sched_thread.c:1085
#11 dispatch_jobs (evc=0x7ffa4195af60, answer_list=0x7ffa34bfde50,
lists=0x7ffa34bfdae0, order=<value optimized out>)
at ../daemons/qmaster/sge_sched_thread.c:764
#12 scheduler_method (evc=0x7ffa4195af60, answer_list=0x7ffa34bfde50,
lists=0x7ffa34bfdae0, order=<value optimized out>)
at ../daemons/qmaster/sge_sched_thread.c:250
#13 0x00000000004337f4 in sge_scheduler_main (arg=0x7ffa38668f10) at
../daemons/qmaster/sge_thread_scheduler.c:866
#14 0x00000037efc06ccb in start_thread (arg=0x7ffa34bfe700) at
pthread_create.c:301
#15 0x00000037ef4e0c2d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:115
Will look closer tomorrow
comment:3 Changed 9 years ago by bdeluca@…
It looks like,
if (use_category->use_category) {
lList *temp = schedd_mes_get_tmp_list();
lWriteListToStr(temp, &temp_string);
DPRINTF(("temp_list %s", sge_dstring_get_string(&temp_string)));
if (temp){
DPRINTF(("temp list processing\n"));
lSetList(use_category->cache, CCT_job_messages,
lschedCopyList(NULL, temp));
Seems to be the culprit some times the list from schedd_mes_get_tmp_list,
seems to be very large. That might contain.
I dont think the reservation code has any thing to do with it any more.
comment:4 Changed 7 years ago by dlove
- Resolution set to fixed
- Status changed from new to closed
Bother; that might have saved time. I failed to find this when searching for leak reports,
and must have forgotten about it. Should be fixed by [4735] anyhow.
sorry we are using 8.0.0d
not 8.0.0c
On Wed, Feb 8, 2012 at 11:51 AM, SGE <sge-bugs@…> wrote: