Ticket #1409 (new defect)

Opened 2 years ago

Last modified 2 years ago

Qmaster process uses all memory and gets killed by the OOM killer

Reported by: bdeluca@… Owned by:
Priority: normal Milestone:
Component: sge Version: 8.0.0d
Severity: minor Keywords:
Cc:

Description

Hi!

I am experiencing a problem on 8.0.0c and later code. Where

the qmaster will suddenly increase the amount of ram it using and then
be killed by the oom killer. On 8.0.0c this removes the parrallel
environments causing all processing with PE's to stop, even when we
detect the failure and recover.

Working backwards to determine when this starts happening.

Change History

comment:1 Changed 2 years ago by bdeluca@…

sorry we are using 8.0.0d

not 8.0.0c

On Wed, Feb 8, 2012 at 11:51 AM, SGE <sge-bugs@…> wrote:

#1409: Qmaster process uses all memory and gets killed by the OOM killer


Reporter:  bdeluca@…  |      Type:  defect

Status:  new        |  Priority:  normal

Component:  sge        |   Version:  8.0.0d

Severity:  minor      |


Hi!

I am experiencing a problem on 8.0.0c and later code. Where

the qmaster will suddenly increase the amount of ram it using and then
be killed by the oom killer. On 8.0.0c this removes the parrallel
environments causing all processing with PE's to stop, even when we
detect the failure and recover.

Working backwards to determine when this starts happening.

--
Ticket URL: <https://arc.liv.ac.uk/trac/SGE/ticket/1409>
SGE <https://arc.liv.ac.uk/trac/SGE>
Son of Grid Engine:  Community continuation of work on Grid Engine

comment:2 Changed 2 years ago by bdeluca@…

I have further details in regards to this issue.

Our grid runs with PE, and consumable complexes.
I copy our default queue and submit a job that uses PE and consumables.
with 10 exec hosts eventually I run out of ram, I tried up to 12 gigs.

BT's from the qmaster when it is running out of ram looks like.

<CUTANDPASTEFAIL> lCreateElem (dp=0x7ffa0619af20) at
../libs/cull/cull_list.c:909
#1 0x000000000059d7be in lCopyElemHash (ep=0x7ffa061cc580,
isHash=true) at ../libs/cull/cull_list.c:176
#2 0x000000000059d8f0 in lCopyListHash (name=<value optimized out>,
src=<value optimized out>, hash=true)

at ../libs/cull/cull_list.c:1586

#3 0x00000000005a8a27 in lSelectHashPack (name=0x7ffa061c9438 "job
ids", slp=0x7ffa061af860, cp=0x0, enp=0x0, isHash=true, pb=0x0)

at ../libs/cull/cull_db.c:838

#4 0x000000000059bfe1 in lCopySwitchPack (sep=<value optimized out>,
dep=0x7ff95ad2d1c0, src_idx=<value optimized out>,

dst_idx=<value optimized out>, isHash=<value optimized out>,

ep=<value optimized out>, pb=0x0) at ../libs/cull/cull_list.c:398
#5 0x000000000059d807 in lCopyElemHash (ep=0x7ffa061cb700,
isHash=true) at ../libs/cull/cull_list.c:182
#6 0x000000000059d8f0 in lCopyListHash (name=<value optimized out>,
src=<value optimized out>, hash=true)

at ../libs/cull/cull_list.c:1586

#7 0x00000000004d0e54 in parallel_tag_queues_suitable4job
(a=0x7ffa34bfd110, use_category=<value optimized out>,

available_slots=0x7ffa34bfd3bc) at ../libs/sched/sge_select_queue.c:4207

#8 parallel_assignment (a=0x7ffa34bfd110, use_category=<value
optimized out>, available_slots=0x7ffa34bfd3bc)

at ../libs/sched/sge_select_queue.c:4976

#9 0x00000000004d3045 in parallel_maximize_slots_pe
(best=0x7ffa34bfd620, available_slots=0x7ffa34bfd3bc)

at ../libs/sched/sge_select_queue.c:927

#10 0x00000000004d421c in sge_select_parallel_environment
(best=0x7ffa34bfd620, pe_list=<value optimized out>)

at ../libs/sched/sge_select_queue.c:524

#11 0x000000000043b8f4 in select_assign_debit (evc=0x7ffa4195af60,
answer_list=0x7ffa34bfde50, lists=0x7ffa34bfdae0,

order=<value optimized out>) at ../daemons/qmaster/sge_sched_thread.c:1085

#12 dispatch_jobs (evc=0x7ffa4195af60, answer_list=0x7ffa34bfde50,
lists=0x7ffa34bfdae0, order=<value optimized out>)

at ../daemons/qmaster/sge_sched_thread.c:764

#13 scheduler_method (evc=0x7ffa4195af60, answer_list=0x7ffa34bfde50,
lists=0x7ffa34bfdae0, order=<value optimized out>)

at ../daemons/qmaster/sge_sched_thread.c:250

#14 0x00000000004337f4 in sge_scheduler_main (arg=0x7ffa38668f10) at
../daemons/qmaster/sge_thread_scheduler.c:866
#15 0x00000037efc06ccb in start_thread (arg=0x7ffa34bfe700) at
pthread_create.c:301
#16 0x00000037ef4e0c2d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:115

#0 lCopyElemHash (ep=0x7ff9d1809000, isHash=true) at
../libs/cull/cull_list.c:174
#1 0x000000000059d8f0 in lCopyListHash (name=<value optimized out>,
src=<value optimized out>, hash=true)

at ../libs/cull/cull_list.c:1586

#2 0x00000000005a8a27 in lSelectHashPack (name=0x7ff9d19ff4b0 "job
ids", slp=0x7ff9d19ec7d0, cp=0x0, enp=0x0, isHash=true, pb=0x0)

at ../libs/cull/cull_db.c:838

#3 0x000000000059bfe1 in lCopySwitchPack (sep=<value optimized out>,
dep=0x7ff922638b80, src_idx=<value optimized out>,

dst_idx=<value optimized out>, isHash=<value optimized out>,

ep=<value optimized out>, pb=0x0) at ../libs/cull/cull_list.c:398
#4 0x000000000059d807 in lCopyElemHash (ep=0x7ff9d18085c0,
isHash=true) at ../libs/cull/cull_list.c:182
#5 0x000000000059d8f0 in lCopyListHash (name=<value optimized out>,
src=<value optimized out>, hash=true)

at ../libs/cull/cull_list.c:1586

#6 0x00000000004d0e54 in parallel_tag_queues_suitable4job
(a=0x7ffa34bfd110, use_category=<value optimized out>,

available_slots=0x7ffa34bfd3bc) at ../libs/sched/sge_select_queue.c:4207

#7 parallel_assignment (a=0x7ffa34bfd110, use_category=<value
optimized out>, available_slots=0x7ffa34bfd3bc)

at ../libs/sched/sge_select_queue.c:4976

#8 0x00000000004d3045 in parallel_maximize_slots_pe
(best=0x7ffa34bfd620, available_slots=0x7ffa34bfd3bc)

at ../libs/sched/sge_select_queue.c:927

#9 0x00000000004d421c in sge_select_parallel_environment
(best=0x7ffa34bfd620, pe_list=<value optimized out>)

at ../libs/sched/sge_select_queue.c:524

#10 0x000000000043b8f4 in select_assign_debit (evc=0x7ffa4195af60,
answer_list=0x7ffa34bfde50, lists=0x7ffa34bfdae0,

order=<value optimized out>) at ../daemons/qmaster/sge_sched_thread.c:1085

#11 dispatch_jobs (evc=0x7ffa4195af60, answer_list=0x7ffa34bfde50,
lists=0x7ffa34bfdae0, order=<value optimized out>)

at ../daemons/qmaster/sge_sched_thread.c:764

#12 scheduler_method (evc=0x7ffa4195af60, answer_list=0x7ffa34bfde50,
lists=0x7ffa34bfdae0, order=<value optimized out>)

at ../daemons/qmaster/sge_sched_thread.c:250

#13 0x00000000004337f4 in sge_scheduler_main (arg=0x7ffa38668f10) at
../daemons/qmaster/sge_thread_scheduler.c:866
#14 0x00000037efc06ccb in start_thread (arg=0x7ffa34bfe700) at
pthread_create.c:301
#15 0x00000037ef4e0c2d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Will look closer tomorrow

comment:3 Changed 2 years ago by bdeluca@…

It looks like,

if (use_category->use_category) {

lList *temp = schedd_mes_get_tmp_list();
lWriteListToStr(temp, &temp_string);
DPRINTF(("temp_list %s", sge_dstring_get_string(&temp_string)));
if (temp){
DPRINTF(("temp list processing\n"));

lSetList(use_category->cache, CCT_job_messages,

lschedCopyList(NULL, temp));

Seems to be the culprit some times the list from schedd_mes_get_tmp_list,
seems to be very large. That might contain.

I dont think the reservation code has any thing to do with it any more.

Note: See TracTickets for help on using tickets.