Opened 19 months ago

Last modified 15 months ago

#1622 new defect

Array jobs (tasks) being killed without log entries

Reported by: caldodge Owned by:
Priority: normal Milestone:
Component: sge Version: 8.1.9
Severity: major Keywords:
Cc:

Description

Running array jobs (4 nodes, 72 tasks per node). Out of 288 tasks, anywhere between 2-8 will mysteriously die partway through. No log entries associated with jobs, other than qmaster/messages entry:

01/10/2018 20:12:23|worker|CLUSTER|W|job 37376.17 failed on host HOSTNAME assumedly after job because: job 37376.17 died through signal KILL (9)
01/10/2018 20:12:23|worker|CLUSTER|W|job 37376.20 failed on host HOSTNAME assumedly after job because: job 37376.20 died through signal KILL (9)

No log messages in node SGE messages files related to failed tasks. There are other errors mentioned, but those reference tasks which DID finish.

01/10/2018 17:07:53| main|HOSTNAME|E|removing unreferenced job 37376.15 without job report from ptf
01/10/2018 17:08:12| main|HOSTNAME|W|reaping job "37376.14" PTF complains: Job does not exist
01/10/2018 17:12:23| main|HOSTNAME|W|reaping job "37376.21" PTF complains: Job does not exist
01/10/2018 17:13:23| main|HOSTNAME|E|removing unreferenced job 37376.23 without job report from ptf

Tasks are doing genome mapping, using statically compiled edition of STAR 2.5.3ab (https://github.com/
alexdobin/STAR).

1) genome is loaded into node memory (so nodes are specified in qsub)
2) qsub for mapping jobs is run. Each job references a different genome by name, does some genetic mapping and sorting, and creates output files.
3) 2-8 tasks mysteriously fail.

We've confirmed that system resources and run time are not the cause. We've gone all the way down to specifying only one array job at a time ("-tc 1") and telling STAR to use only 1 thread (so just one core of 12 core VM). We even ran each set using local filesystem, rather than NFS storage via ISILON.
Failed tasks run fine if done interactively, rather than via scheduler. And the failed tasks are never the same ones, so it's not a matter of bad data.
OS is Centos 7.3.1611. Kernel is 3.10.0-514.el7.x86_64. Gridengine is 8.1.9-1, installed via rpms from Fedora Project COPR (loveshack/SGE). Nodes are 96GB 12 core VMs running in VMWare. Customer's VMWare personnel say no errors are seen in virtual environment.

It's not just the STAR program - customer has run other array jobs with other software which have the same intermittent failures.

Can anyone suggest a possible cause, or a diagnostic step we haven't tried?

Attachments (8)

signature.asc (833 bytes) - added by w.hay@… 19 months ago.
Added by email2trac
signature-1.asc (833 bytes) - added by w.hay@… 19 months ago.
Added by email2trac
signature-2.asc (833 bytes) - added by w.hay@… 19 months ago.
Added by email2trac
signature-3.asc (833 bytes) - added by w.hay@… 19 months ago.
Added by email2trac
signature-4.asc (833 bytes) - added by w.hay@… 19 months ago.
Added by email2trac
signature-5.asc (833 bytes) - added by w.hay@… 19 months ago.
Added by email2trac
untitled-part.bin (6 bytes) - added by postmaster@… 19 months ago.
Added by email2trac
untitled-part.mht (6 bytes) - added by postmaster@… 19 months ago.
Added by email2trac

Download all attachments as: .zip

Change History (25)

comment:1 follow-up: Changed 19 months ago by w.hay@…

On Thu, Jan 11, 2018 at 09:16:28PM +0000, SGE wrote:

We've confirmed that system resources and run time are not the cause.

How did you verify system resources aren't the cause?

Can anyone suggest a possible cause, or a diagnostic step we haven't
tried?

Enable KEEP_ACTIVE and have a look in the job specific log files?

You have to be a bit careful with this though. IIRC it can cause issues
if jobs are set to restart so setting aside a single node for testing purposes
would be best.

This sort of thing is best initially brought up in sge-discuss@…
or (preferably) users@… in case it is working as designed even when
this isn't working as expected.

William

signature.asc

Changed 19 months ago by w.hay@…

Added by email2trac

comment:2 in reply to: ↑ 1 Changed 19 months ago by caldodge

Replying to w.hay@…:

On Thu, Jan 11, 2018 at 09:16:28PM +0000, SGE wrote:

We've confirmed that system resources and run time are not the cause.

How did you verify system resources aren't the cause?

LOTS of measuring during execution (memory, CPU, disk, open files, locks, etc.), as well as reducing the impact of active tasks (originally 4 3-thread tasks could run at the same time, and we reduced it gradually to 1 1-thread task). In addition, we've also tried monitoring the tasks, and manually rerunning a task which has failed (while other tasks are still being run via the scheduler). In every case the manual rerun finishes without a problem.

Can anyone suggest a possible cause, or a diagnostic step we haven't
tried?

Enable KEEP_ACTIVE and have a look in the job specific log files?

Thanks, we'll try that.

You have to be a bit careful with this though. IIRC it can cause issues
if jobs are set to restart so setting aside a single node for testing purposes
would be best.

Jobs are not set to restart, so we should be good there. Thanks for your advice.

This sort of thing is best initially brought up in sge-discuss@…
or (preferably) users@… in case it is working as designed even when
this isn't working as expected.

My apologies - I just figured I should submit a ticket to start things off.

comment:4 Changed 19 months ago by w.hay@…

On Fri, Jan 12, 2018 at 06:12:24PM +0000, SGE wrote:

#1622: Array jobs (tasks) being killed without log entries


Reporter: caldodge | Owner:

Type: defect | Status: new

Priority: normal | Milestone:

Component: sge | Version: 8.1.9

Severity: major | Resolution:
Keywords: |


Comment (by caldodge):

Replying to w.hay@???:

On Thu, Jan 11, 2018 at 09:16:28PM +0000, SGE wrote:

We've confirmed that system resources and run time are not the cause.

How did you verify system resources aren't the cause?

LOTS of measuring during execution (memory, CPU, disk, open files, locks,
etc.), as well as reducing the impact of active tasks (originally 4
3-thread tasks could run at the same time, and we reduced it gradually to
1 1-thread task). In addition, we've also tried monitoring the tasks, and
manually rerunning a task which has failed (while other tasks are still
being run via the scheduler). In every case the manual rerun finishes
without a problem.

The most obvious possibility is that you've run into one of grid engine's
configured limits. The erratic enforcement might be because you exceed the limit
only for a few seconds and sometimes grid engine doesn't notice. Most likely
candidate is h_vmem. A dump of the complex configuration, configuration of the
queue and host where the job ran and a qstat -j of a job that triggers the issue
would help verify if this is the case.

William

signature-1.asc

Changed 19 months ago by w.hay@…

Added by email2trac

comment:6 Changed 19 months ago by caldodge

It doesn't appear to be a limited complex value. Memory usage stays
low, especially with the current limit of "1 task at a time" and "1
thread per task".

[root@HEADNODE ~]# qconf -sc
#name shortcut type relop requestable consumable
default urgency
#--------------------------------------------------------------------------------------
arch a STRING == YES NO NONE 0
calendar c STRING == YES NO NONE 0
cpu cpu DOUBLE >= YES NO 0 0
display_win_gui dwg BOOL == YES NO 0 0
h_core h_core MEMORY <= YES NO 0 0
h_cpu h_cpu TIME <= YES NO 0:0:0 0
h_data h_data MEMORY <= YES NO 0 0
h_fsize h_fsize MEMORY <= YES NO 0 0
h_rss h_rss MEMORY <= YES NO 0 0
h_rt h_rt TIME <= YES NO 0:0:0 0
h_stack h_stack MEMORY <= YES NO 0 0
h_vmem h_vmem MEMORY <= YES NO 0 0
hostname h HOST == YES NO NONE 0
load_avg la DOUBLE >= NO NO 0 0
load_long ll DOUBLE >= NO NO 0 0
load_medium lm DOUBLE >= NO NO 0 0
load_short ls DOUBLE >= NO NO 0 0
m_core core INT <= YES NO 0 0
m_socket socket INT <= YES NO 0 0
m_thread thread INT <= YES NO 0 0
m_topology topo STRING == YES NO NONE 0
m_topology_inuse utopo STRING == YES NO NONE 0
mem_free mf MEMORY <= YES NO 0 0
mem_total mt MEMORY <= YES NO 0 0
mem_used mu MEMORY >= YES NO 0 0
min_cpu_interval mci TIME <= NO NO 0:0:0 0
np_load_avg nla DOUBLE >= NO NO 0 0
np_load_long nll DOUBLE >= NO NO 0 0
np_load_medium nlm DOUBLE >= NO NO 0 0
np_load_short nls DOUBLE >= NO NO 0 0
num_proc p INT == YES NO 0 0
qname q STRING == YES NO NONE 0
rerun re BOOL == NO NO 0 0
s_core s_core MEMORY <= YES NO 0 0
s_cpu s_cpu TIME <= YES NO 0:0:0 0
s_data s_data MEMORY <= YES NO 0 0
s_fsize s_fsize MEMORY <= YES NO 0 0
s_rss s_rss MEMORY <= YES NO 0 0
s_rt s_rt TIME <= YES NO 0:0:0 0
s_stack s_stack MEMORY <= YES NO 0 0
s_vmem s_vmem MEMORY <= YES NO 0 0
seq_no seq INT == NO NO 0 0
slots s INT <= YES YES
1 1000
swap_free sf MEMORY <= YES NO 0 0
swap_rate sr MEMORY >= YES NO 0 0
swap_rsvd srsv MEMORY >= YES NO 0 0
swap_total st MEMORY <= YES NO 0 0
swap_used su MEMORY >= YES NO 0 0
tmpdir tmp STRING == NO NO NONE 0
virtual_free vf MEMORY <= YES NO 0 0
virtual_total vt MEMORY <= YES NO 0 0
virtual_used vu MEMORY >= YES NO 0 0

[root@HEADNODE Test_Data]# qconf -sq highmem.q
qname highmem.q
hostlist @himemhosts
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list make smp mpi
rerun FALSE
slots 1,LISTOFNODES
tmpdir /tmp
shell /bin/sh
prolog NONE
epilog NONE
shell_start_mode posix_compliant
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists NONE
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt INFINITY
h_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY

[root@HEADNODE spool]# cat qmaster/exec_hosts/NODE1
# Version: 8.1.9
#
# DO NOT MODIFY THIS FILE MANUALLY!
#
hostname NODE1
load_scaling NONE
complex_values mem_free=94G
load_values
arch=lx-amd64,num_proc=12,mem_total=96520.636719M,swap_total=4095.996094M,virtual_total=100616.632812M,m_topology=SCSCSCSCSCSCSCSCSCSCSCSC,m_socket=12,m_core=12,m_thread=12
processors 12
reschedule_unknown_list NONE
user_lists NONE
xuser_lists NONE
projects NONE
xprojects NONE
usage_scaling NONE
report_variables NONE

Array jobs are invoked with "qsub -q highmem.q -l h=$NODENAME
-hold_jid $LOADMEMPID -N $JOBNAME -t $TASKCOUNT -tc 1 -cwd
$SCRIPTNAME+PARAMETERS"

It's almost never the same tasks which fail. The following are from one node:

12/26/2017 17:43:21|worker|csclprd1|W|job 36980.65 failed on host
NODE4 assumedly after job because: job 36980.65 died through signal
KILL (9)
12/26/2017 17:43:21|worker|csclprd1|W|job 36980.66 failed on host
NODE4 assumedly after job because: job 36980.66 died through signal
KILL (9)
12/27/2017 02:33:30|worker|csclprd1|W|job 37064.1 failed on host NODE4
assumedly after job because: job 37064.1 died through signal KILL (9)
01/04/2018 15:44:38|worker|csclprd1|W|job 37302.4 failed on host NODE4
assumedly after job because: job 37302.4 died through signal KILL (9)
01/04/2018 15:44:38|worker|csclprd1|W|job 37302.5 failed on host NODE4
assumedly after job because: job 37302.5 died through signal KILL (9)
01/10/2018 20:03:39|worker|csclprd1|W|job 37378.5 failed on host NODE4
assumedly after job because: job 37378.5 died through signal KILL (9)
01/10/2018 20:03:39|worker|csclprd1|W|job 37378.7 failed on host NODE4
assumedly after job because: job 37378.7 died through signal KILL (9)

For the last two failed tasks, here are the start time, end time, and
memory usage for them, and a successful task which ran at the same
time:

taskid 5
start_time Wed Jan 10 17:00:38 2018
end_time Wed Jan 10 17:03:38 2018
maxvmem 11.571GB
taskid 6
start_time Wed Jan 10 17:01:23 2018
end_time Wed Jan 10 17:03:38 2018
maxvmem 50.880GB
taskid 7
start_time Wed Jan 10 17:01:53 2018
end_time Wed Jan 10 17:03:38 2018
maxvmem 0.000B

Here are the same values for the two tasks which failed on 1/4:

taskid 1
start_time Thu Jan 4 12:42:23 2018
end_time Thu Jan 4 12:44:37 2018
maxvmem 512.801MB
taskid 4
start_time Thu Jan 4 12:42:53 2018
end_time Thu Jan 4 12:44:37 2018
maxvmem 15.249GB
taskid 5
start_time Thu Jan 4 12:44:08 2018
end_time Thu Jan 4 12:44:37 2018
maxvmem 6.291GB

As mentioned before, the nodes have 96G of RAM, so the jobs aren't
exhausting memory (especially 1/4).

But the common end time makes me suspicious. Is it possible that SGE
is killing jobs mistakenly?

On Mon, Jan 15, 2018 at 2:48 AM, SGE <sge-bugs@…> wrote:

#1622: Array jobs (tasks) being killed without log entries


Reporter: caldodge | Owner:

Type: defect | Status: new

Priority: normal | Milestone:

Component: sge | Version: 8.1.9

Severity: major | Resolution:
Keywords: |


Changes (by w.hay@…):

  • Attachment "signature-1.asc" added.

Added by email2trac

--
Ticket URL: <https://arc.liv.ac.uk/trac/SGE/ticket/1622>
SGE <https://arc.liv.ac.uk/trac/SGE>
Son of Grid Engine: Community continuation of work on Grid Engine

comment:7 Changed 19 months ago by w.hay@…

On Mon, Jan 15, 2018 at 09:10:09PM +0000, SGE wrote:

It doesn't appear to be a limited complex value. Memory usage stays
low, especially with the current limit of "1 task at a time" and "1
thread per task".

[root@HEADNODE Test_Data]# qconf -sq highmem.q
qname highmem.q
hostlist @himemhosts
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list make smp mpi
rerun FALSE
slots 1,LISTOFNODES

Can I assume that LISTOFNODES is actually a long list of nodes and values in square brackets?

Array jobs are invoked with "qsub -q highmem.q -l h=$NODENAME
-hold_jid $LOADMEMPID -N $JOBNAME -t $TASKCOUNT -tc 1 -cwd
$SCRIPTNAME+PARAMETERS"

The complete output of qstat -j on a job would be more useful than this. Jobs can pick up requests
from lots of different locations (sge_request,values inside the script,jsv twiddling). Best to ask grid engine what
requests apply to the job rather than try to work it out. If the output of qstat -j contains information you consider
confidential then edit it to remove just that information and indicate where you have done so.

Extracting the appropriate line from the accounting file for the job task that failed might also provide a clue.

As mentioned before, the nodes have 96G of RAM, so the jobs aren't
exhausting memory (especially 1/4).

But the common end time makes me suspicious. Is it possible that SGE
is killing jobs mistakenly?

That looks like the job is dying on startup for some reason. It is possible
that SGE is killing jobs mistakenly but in the past mysterious job deaths have
almost always turned out to be something else and fixes have been applied where
that isn't the case. In any case one still needs to investigate the circumstances
that provoke the job killing.

Are there any clues in the output of the job (.e and .o files)?

If the problem shows up frequently enough (and it sounds like it does) then running
the execd on a job in the foreground with debugging turned up high might provide some clues
as to what is going on. Unfortunately getting the debug level right (informative but not
a firehose) can be tricky. The man page for sge_dl should help you get started.

William

signature-2.asc

Changed 19 months ago by w.hay@…

Added by email2trac

comment:9 Changed 19 months ago by caldodge

William,

Thanks for your reply. FWIW, I sent a message to
users@…, and also registered with the mailing list on the
appropriate page, but have gotten no response from either action.

If you need data from "qstat -j" I'll need to run another test job,
then watch for failures during the job, since "qstat -j" returns
nothing after the job is finished. I'll run that test tomorrow.

There are no clues in the job output files - the "o" files for
failures differ from successes only in that the failed ones don't have
the last "job completed" line.

I ran strace on each task the last few times, but the output wasn't
enlightening, except in demonstrating that the job death could happen
at any point, rather than the same point in job execution. Based on
the strace output I have to say the jobs are NOT failing immediately
after being started.

"qacct -j" shows nothing remarkably different for the failed tasks,
but I'll include that output, too, after I run that test job tomorrow.
I'll also run sge_execd in the foreground with debugging, as you
suggested.

Thanks for your help. I hope this will help us work toward a solution
for this issue.

Sincerely,

Calvin Dodge

On Tue, Jan 16, 2018 at 3:32 AM, William Hay <w.hay@…> wrote:

On Mon, Jan 15, 2018 at 09:10:09PM +0000, SGE wrote:

It doesn't appear to be a limited complex value. Memory usage stays
low, especially with the current limit of "1 task at a time" and "1
thread per task".

[root@HEADNODE Test_Data]# qconf -sq highmem.q
qname highmem.q
hostlist @himemhosts
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list make smp mpi
rerun FALSE
slots 1,LISTOFNODES

Can I assume that LISTOFNODES is actually a long list of nodes and values in square brackets?

Array jobs are invoked with "qsub -q highmem.q -l h=$NODENAME
-hold_jid $LOADMEMPID -N $JOBNAME -t $TASKCOUNT -tc 1 -cwd
$SCRIPTNAME+PARAMETERS"

The complete output of qstat -j on a job would be more useful than this. Jobs can pick up requests
from lots of different locations (sge_request,values inside the script,jsv twiddling). Best to ask grid engine what
requests apply to the job rather than try to work it out. If the output of qstat -j contains information you consider
confidential then edit it to remove just that information and indicate where you have done so.

Extracting the appropriate line from the accounting file for the job task that failed might also provide a clue.

As mentioned before, the nodes have 96G of RAM, so the jobs aren't
exhausting memory (especially 1/4).

But the common end time makes me suspicious. Is it possible that SGE
is killing jobs mistakenly?

That looks like the job is dying on startup for some reason. It is possible
that SGE is killing jobs mistakenly but in the past mysterious job deaths have
almost always turned out to be something else and fixes have been applied where
that isn't the case. In any case one still needs to investigate the circumstances
that provoke the job killing.

Are there any clues in the output of the job (.e and .o files)?

If the problem shows up frequently enough (and it sounds like it does) then running
the execd on a job in the foreground with debugging turned up high might provide some clues
as to what is going on. Unfortunately getting the debug level right (informative but not
a firehose) can be tricky. The man page for sge_dl should help you get started.

William

comment:10 Changed 19 months ago by caldodge

"Can I assume that LISTOFNODES is actually a long list of nodes and
values in square brackets?"

Yes, it is.

Calvin Dodge

On Tue, Jan 16, 2018 at 3:32 AM, William Hay <w.hay@…> wrote:

On Mon, Jan 15, 2018 at 09:10:09PM +0000, SGE wrote:

It doesn't appear to be a limited complex value. Memory usage stays
low, especially with the current limit of "1 task at a time" and "1
thread per task".

[root@HEADNODE Test_Data]# qconf -sq highmem.q
qname highmem.q
hostlist @himemhosts
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list make smp mpi
rerun FALSE
slots 1,LISTOFNODES

Can I assume that LISTOFNODES is actually a long list of nodes and values in square brackets?

Array jobs are invoked with "qsub -q highmem.q -l h=$NODENAME
-hold_jid $LOADMEMPID -N $JOBNAME -t $TASKCOUNT -tc 1 -cwd
$SCRIPTNAME+PARAMETERS"

The complete output of qstat -j on a job would be more useful than this. Jobs can pick up requests
from lots of different locations (sge_request,values inside the script,jsv twiddling). Best to ask grid engine what
requests apply to the job rather than try to work it out. If the output of qstat -j contains information you consider
confidential then edit it to remove just that information and indicate where you have done so.

Extracting the appropriate line from the accounting file for the job task that failed might also provide a clue.

As mentioned before, the nodes have 96G of RAM, so the jobs aren't
exhausting memory (especially 1/4).

But the common end time makes me suspicious. Is it possible that SGE
is killing jobs mistakenly?

That looks like the job is dying on startup for some reason. It is possible
that SGE is killing jobs mistakenly but in the past mysterious job deaths have
almost always turned out to be something else and fixes have been applied where
that isn't the case. In any case one still needs to investigate the circumstances
that provoke the job killing.

Are there any clues in the output of the job (.e and .o files)?

If the problem shows up frequently enough (and it sounds like it does) then running
the execd on a job in the foreground with debugging turned up high might provide some clues
as to what is going on. Unfortunately getting the debug level right (informative but not
a firehose) can be tricky. The man page for sge_dl should help you get started.

William

comment:11 Changed 19 months ago by w.hay@…

On Wed, Jan 17, 2018 at 06:50:03PM -0600, Calvin Dodge wrote:

William,

Thanks for your reply. FWIW, I sent a message to
users@…, and also registered with the mailing list on the
appropriate page, but have gotten no response from either action.

It appears users@… was slightly broken for the last few
days. I mailed the chap who owns gridengine.org and it now appears to
be catching up but your message didn't show up. You may need to resend
your message and possibly rejoin the list (my messages sent during the
list outage showed up but I've been a member for a while).

If you need data from "qstat -j" I'll need to run another test job,
then watch for failures during the job, since "qstat -j" returns
nothing after the job is finished. I'll run that test tomorrow.

There are no clues in the job output files - the "o" files for
failures differ from successes only in that the failed ones don't have
the last "job completed" line.

I ran strace on each task the last few times, but the output wasn't
enlightening, except in demonstrating that the job death could happen
at any point, rather than the same point in job execution. Based on
the strace output I have to say the jobs are NOT failing immediately
after being started.

"qacct -j" shows nothing remarkably different for the failed tasks,
but I'll include that output, too, after I run that test job tomorrow.
I'll also run sge_execd in the foreground with debugging, as you
suggested.

Thanks for your help. I hope this will help us work toward a solution
for this issue.

We probably need a bit more info. The errors and trace files from
the active_jobs directory of the job (preserved after the end of
the job via KEEP_ACTIVE) might give us a clue.

William

signature-3.asc

Changed 19 months ago by w.hay@…

Added by email2trac

comment:12 Changed 19 months ago by caldodge

I enabled "KEEP_ACTIVE=TRUE", and ran test jobs. Interestingly, as long as I limited the tasks to "one at a time", all the jobs finished, and there were no mentions of "reaping job "XXXXX.XX" PTF complains: Job does not exist" in the compute node's SGE "messages" file.

When I ran a test with "-tc 4" (up to 4 tasks at a time), I had the usual job failures, and multiple "PTF complains: Job does not exist" messages.

What files would you like from the job spool directories? I did a quick diff between a failed task and a successful one, but other than the usual mysterious "killed by signal 9", I saw nothing that I recognized as a cause for the failure.

comment:14 Changed 19 months ago by w.hay@…

On Sat, Jan 20, 2018 at 12:01:56AM +0000, SGE wrote:

#1622: Array jobs (tasks) being killed without log entries


Reporter: caldodge | Owner:

Type: defect | Status: new

Priority: normal | Milestone:

Component: sge | Version: 8.1.9

Severity: major | Resolution:
Keywords: |


Comment (by caldodge):

I enabled "KEEP_ACTIVE=TRUE", and ran test jobs. Interestingly, as long
as I limited the tasks to "one at a time", all the jobs finished, and
there were no mentions of "reaping job "XXXXX.XX" PTF complains: Job does
not exist" in the compute node's SGE "messages" file.

When I ran a test with "-tc 4" (up to 4 tasks at a time), I had the usual
job failures, and multiple "PTF complains: Job does not exist" messages.

What files would you like from the job spool directories? I did a quick
diff between a failed task and a successful one, but other than the usual
mysterious "killed by signal 9", I saw nothing that I recognized as a
cause for the failure.

The error and trace files would probably be helpful.

William

--
Ticket URL: <https://arc.liv.ac.uk/trac/SGE/ticket/1622#comment:12>
SGE <https://arc.liv.ac.uk/trac/SGE>
Son of Grid Engine: Community continuation of work on Grid Engine

signature-4.asc

Changed 19 months ago by w.hay@…

Added by email2trac

comment:15 Changed 19 months ago by caldodge

My current working in theory is that this is a bug in the scheduler - either in the version installed, or a library, or something with the environment.

I say that because yesterday I rewrote the implementation script to run the genome jobs separately, rather than as tasks in an array job. I ran the test 3 times, each in a different filesystem (local, test NFS, and production NFS), and ALL the jobs finished without errors or being mysteriously killed. Also, there were no "PTF complains: Job does not exist" messages.

I'm going to run one more test this way using ALL of the nodes the customer was using. If that succeeds, I'll just suggest that the customer needs to stay away from array jobs.

comment:16 Changed 19 months ago by caldodge

The test using all four nodes failed. That is, each node had a few job failures. I looked in the job spool directories, but again found nothing that looked like a smoking gun. Other issues are leading me to suspect it's something with the VMs that are being used as nodes, so I'm running yet another test at reduced load (using 9 slots out of 12 available on each vm) to see if that helps.

comment:17 Changed 19 months ago by caldodge

We finally found a fix for this issue. Due to the software version and/or OS version, it assigns a supplemental group to each job for tracking purposes. While one such group matched an existing group ID, other jobs failed with different (non-existent) group IDs.

So our working theory is that the configured range of group IDs (20000-20100) was insufficient for queueing approximately 300 jobs. To cover both possibilities (insufficient groups OR overlapping actual group ID) we changed the range to 20200-21000. Since then we've run many tests, and not one job has failed.

The key to this WAS the suggestion that we set "KEEP_ACTIVE=TRUE", as examining the "trace" file to a failed job spurred a memory regarding supplemental groups. Thanks for the help!

comment:18 Changed 19 months ago by caldodge

I've been asked to inquire - is there any "best practice" regarding the gid_range for those older versions of SGE and OS? Is there any drawback to using a really big range?

comment:19 Changed 19 months ago by caldodge

The available documentation says "gid_range should be greater than the number of jobs which will run concurrently on one host". But our experience suggests that it needs to be greater than the total number of jobs which will be in the queue. Does anyone else have experience with this issue?

comment:20 Changed 19 months ago by caldodge

Here are selected portions of 3 trace files. They WERE all assigned different GIDs, yet the middle job (38266.1) obviously killed the other two.

38263.1/trace:01/22/2018 15:14:32 [0:1578]: setpgid(1578, 1578) returned 0
38263.1/trace:01/22/2018 15:14:32 [0:1578]: parent: forked "job" with pid 1579
38263.1/trace:01/22/2018 15:14:32 [0:1579]: pid=1579 pgrp=1579 sid=1579 old pgrp=1578 getlogin()=<no login set>
38263.1/trace:01/22/2018 15:14:32 [0:1578]: parent: job-pid: 1579
38263.1/trace:01/22/2018 15:14:32 [0:1579]: setting additional gid=20000
38263.1/trace:01/22/2018 15:15:48 [0:1578]: reaped "job" with pid 1579
38263.1/trace:01/22/2018 15:15:48 [0:1578]: now sending signal KILL to pid -1579
38263.1/trace:01/22/2018 15:15:48 [0:1578]: pdc_kill_addgrpid: 20000 9
38263.1/trace:01/22/2018 15:15:48 [0:1578]: killing pid 1659/2
38263.1/trace:01/22/2018 15:15:48 [0:1578]: killing pid 1663/2

38266.1/trace:01/22/2018 15:14:47 [0:1667]: setpgid(1667, 1667) returned 0
38266.1/trace:01/22/2018 15:14:47 [0:1667]: parent: forked "job" with pid 1668
38266.1/trace:01/22/2018 15:14:47 [0:1667]: parent: job-pid: 1668
38266.1/trace:01/22/2018 15:14:47 [0:1668]: pid=1668 pgrp=1668 sid=1668 old pgrp=1667 getlogin()=<no login set>
38266.1/trace:01/22/2018 15:14:47 [0:1668]: setting additional gid=20001
38266.1/trace:01/22/2018 15:15:48 [0:1667]: reaped "job" with pid 1668
38266.1/trace:01/22/2018 15:15:48 [0:1667]: now sending signal KILL to pid -1668
38266.1/trace:01/22/2018 15:15:48 [0:1667]: pdc_kill_addgrpid: 20001 9
38266.1/trace:01/22/2018 15:15:48 [0:1667]: killing pid 1579/2
38266.1/trace:01/22/2018 15:15:48 [0:1667]: killing pid 1659/2
38266.1/trace:01/22/2018 15:15:48 [0:1667]: killing pid 1663/2
38266.1/trace:01/22/2018 15:15:48 [0:1667]: killing pid 1766/2
38266.1/trace:01/22/2018 15:15:48 [0:1667]: killing pid 1846/2
38266.1/trace:01/22/2018 15:15:48 [0:1667]: killing pid 1850/2
38266.1/trace:01/22/2018 15:15:48 [0:1667]: killing pid 1851/2

38269.1/trace:01/22/2018 15:15:32 [0:1765]: setpgid(1765, 1765) returned 0
38269.1/trace:01/22/2018 15:15:32 [0:1765]: parent: forked "job" with pid 1766
38269.1/trace:01/22/2018 15:15:32 [0:1765]: parent: job-pid: 1766
38269.1/trace:01/22/2018 15:15:32 [0:1766]: pid=1766 pgrp=1766 sid=1766 old pgrp=1765 getlogin()=<no login set>
38269.1/trace:01/22/2018 15:15:32 [0:1766]: setting additional gid=20002
38269.1/trace:01/22/2018 15:15:48 [0:1765]: reaped "job" with pid 1766
38269.1/trace:01/22/2018 15:15:48 [0:1765]: now sending signal KILL to pid -1766
38269.1/trace:01/22/2018 15:15:48 [0:1765]: pdc_kill_addgrpid: 20002 9
38269.1/trace:01/22/2018 15:15:48 [0:1765]: killing pid 1846/2
38269.1/trace:01/22/2018 15:15:48 [0:1765]: killing pid 1850/2

Do you want the full trace files? I included just the above portions to (IMHO) increase the signal-to-noise ratio. Given the gids which were assigned to such jobs (20000 and a few above, or 20100 and 1 or 2 below), I suspect some sort of edge case when a counter wraps around.

comment:22 Changed 19 months ago by w.hay@…

On Thu, Jan 25, 2018 at 12:51:06AM +0000, SGE wrote:

#1622: Array jobs (tasks) being killed without log entries


Reporter: caldodge | Owner:

Type: defect | Status: new

Priority: normal | Milestone:

Component: sge | Version: 8.1.9

Severity: major | Resolution:
Keywords: |


Comment (by caldodge):

Here are selected portions of 3 trace files. They WERE all assigned
different GIDs, yet the middle job (38266.1) obviously killed the other
two.

38263.1/trace:01/22/2018 15:14:32 [0:1578]: setpgid(1578, 1578) returned 0
38263.1/trace:01/22/2018 15:14:32 [0:1578]: parent: forked "job" with pid
1579
38263.1/trace:01/22/2018 15:14:32 [0:1579]: pid=1579 pgrp=1579 sid=1579
old pgrp=1578 getlogin()=<no login set>
38263.1/trace:01/22/2018 15:14:32 [0:1578]: parent: job-pid: 1579
38263.1/trace:01/22/2018 15:14:32 [0:1579]: setting additional gid=20000
38263.1/trace:01/22/2018 15:15:48 [0:1578]: reaped "job" with pid 1579
38263.1/trace:01/22/2018 15:15:48 [0:1578]: now sending signal KILL to pid
-1579
38263.1/trace:01/22/2018 15:15:48 [0:1578]: pdc_kill_addgrpid: 20000 9
38263.1/trace:01/22/2018 15:15:48 [0:1578]: killing pid 1659/2
38263.1/trace:01/22/2018 15:15:48 [0:1578]: killing pid 1663/2

38266.1/trace:01/22/2018 15:14:47 [0:1667]: setpgid(1667, 1667) returned 0
38266.1/trace:01/22/2018 15:14:47 [0:1667]: parent: forked "job" with pid
1668
38266.1/trace:01/22/2018 15:14:47 [0:1667]: parent: job-pid: 1668
38266.1/trace:01/22/2018 15:14:47 [0:1668]: pid=1668 pgrp=1668 sid=1668
old pgrp=1667 getlogin()=<no login set>
38266.1/trace:01/22/2018 15:14:47 [0:1668]: setting additional gid=20001
38266.1/trace:01/22/2018 15:15:48 [0:1667]: reaped "job" with pid 1668
38266.1/trace:01/22/2018 15:15:48 [0:1667]: now sending signal KILL to pid
-1668
38266.1/trace:01/22/2018 15:15:48 [0:1667]: pdc_kill_addgrpid: 20001 9
38266.1/trace:01/22/2018 15:15:48 [0:1667]: killing pid 1579/2
38266.1/trace:01/22/2018 15:15:48 [0:1667]: killing pid 1659/2
38266.1/trace:01/22/2018 15:15:48 [0:1667]: killing pid 1663/2
38266.1/trace:01/22/2018 15:15:48 [0:1667]: killing pid 1766/2
38266.1/trace:01/22/2018 15:15:48 [0:1667]: killing pid 1846/2
38266.1/trace:01/22/2018 15:15:48 [0:1667]: killing pid 1850/2
38266.1/trace:01/22/2018 15:15:48 [0:1667]: killing pid 1851/2

38269.1/trace:01/22/2018 15:15:32 [0:1765]: setpgid(1765, 1765) returned 0
38269.1/trace:01/22/2018 15:15:32 [0:1765]: parent: forked "job" with pid
1766
38269.1/trace:01/22/2018 15:15:32 [0:1765]: parent: job-pid: 1766
38269.1/trace:01/22/2018 15:15:32 [0:1766]: pid=1766 pgrp=1766 sid=1766
old pgrp=1765 getlogin()=<no login set>
38269.1/trace:01/22/2018 15:15:32 [0:1766]: setting additional gid=20002
38269.1/trace:01/22/2018 15:15:48 [0:1765]: reaped "job" with pid 1766
38269.1/trace:01/22/2018 15:15:48 [0:1765]: now sending signal KILL to pid
-1766
38269.1/trace:01/22/2018 15:15:48 [0:1765]: pdc_kill_addgrpid: 20002 9
38269.1/trace:01/22/2018 15:15:48 [0:1765]: killing pid 1846/2
38269.1/trace:01/22/2018 15:15:48 [0:1765]: killing pid 1850/2

Do you want the full trace files? I included just the above portions to
(IMHO) increase the signal-to-noise ratio. Given the gids which were
assigned to such jobs (20000 and a few above, or 20100 and 1 or 2 below),
I suspect some sort of edge case when a counter wraps around.

We need to determine if:
1)Grid Engine is mistakenly killing a job that isn't in the group it is supposed
to kill.
OR
2)The jobs are in groups that they're not supposed to be (as far as grid engine is concerned).
Can you check:
i)That the gid(s) in question (20001 above) isn't otherwise used in your system.
ii)What groups the erroneously targeted jobs are actually in(as opposed to what
they are supposed to be in). As you are using Linux grep G /proc/self/status
as the first line of the job script should record this information.

As you are using 8.1.9 on a modern Linux then you could use USE_CGROUPS
rather than ENABLE_ADDGRP_KILL as a workaround although switching on a running
system might be a tad tricky.

William

signature-5.asc

Changed 19 months ago by w.hay@…

Added by email2trac

Changed 19 months ago by postmaster@…

Added by email2trac

Changed 19 months ago by postmaster@…

Added by email2trac

Note: See TracTickets for help on using tickets.