Opened 4 years ago

Last modified 4 years ago

#1567 new defect

Automatic job suspension, tight integration and bad cleanup

Reported by: matthieu Owned by:
Priority: normal Milestone:
Component: sge Version: current development
Severity: major Keywords:
Cc:

Description

Recently we have switched from a loose integration to a tight integration of our workhorse commercial simulation software in order to limit occurrences of runaway processes left on slave nodes in case of job termination failure.

Unfortunately, our lowp.q infrastructure that worked very well before that, started to refuse to run rescheduled jobs that were voluntarily suspended (with the queue suspend threshold feature based on a custom complex) to leave room for other higher priority jobs. By inspecting the SGE behavior and the source code, we think we have clearly identified the reason for this apparent bug.

Here is a minimal setup to reproduce the bug.

We have a tight integration parallel environment called mpi with its allocation rule set to 4:

$ qconf -sp mpi
pe_name            mpi
slots              216
user_lists         NONE
xuser_lists        NONE
start_proc_args    /import/gridengine/mpi/startmpi.sh -catch_rsh $pe_hostfile
stop_proc_args     /import/gridengine/mpi/stopmpi.sh
allocation_rule    4
control_slaves     TRUE
job_is_first_task  TRUE
urgency_slots      min
accounting_summary FALSE

We define a checkpointing environment:

$ qconf -sckpt ckpt-Test
ckpt_name          ckpt-Test
interface          APPLICATION-LEVEL
ckpt_command       NONE
migr_command       /import/gridengine/poi/scripts/Ckpt/test-suspend.sh
restart_command    NONE
clean_command      NONE
ckpt_dir           NONE
signal             NONE
when               xsr

The test-suspend.sh is a no-op:

$ cat test-suspend.sh
#!/bin/sh
exit 0

We define a test queue with instances on 2 nodes (cn100 and cn101) of 8 slots each. We attach the parallel environment and the checkpointing environment previously defined:

$ qconf -sq lowp.q_test
qname                 lowp.q_test
hostlist              @cn100-101
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:02:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             ckpt-Test
pe_list               mpi
rerun                 TRUE
slots                 8
tmpdir                /tmp
shell                 /bin/csh
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY

We run a waiting loop on 8 slots (so 4 slots on each node):

qsub -pe mpi 8 -ckpt ckpt-Test -q lowp.q_test waiting-loop.sh

Let's say the JOB_ID is 1000. It is scheduled and eventually runs as expected on 2 nodes of 4 slots each with cn100 being the master node and cn101 the slave. Then we decide to suspend the job:

qalter -sj 1000

After some delay, the job is actually suspended. As there is no other job running on the lowp.q_test and that this is the only queue instance defined on those 2 nodes, we could expect to have the job immediately rescheduled to run. Unfortunately, it waits forever and the qstat commands gives:

$ qstat -j 1000
# ...
cannot run on host "cn101" until clean up of an previous run has finished
cannot run in PE "mpi" because it only offers 4 slots

The job was a waiting loop, so technically, it never started processes on the slave node (cn101). It is somewhat surprising to see that it might have left some stuff to be cleanup on cn101...

If we try to schedule a second job on the same queue:

qsub -pe mpi 8 -ckpt ckpt-Test -q lowp.q_test waiting-loop.sh

It will never be scheduled either (with the same reason as the rescheduled job) until we qdel the first job.

The reason SGE thinks it needs to "cleanup" the first job, is because, the JOB_ID is included in the reschedule_unknown_list of the slave node:

$ cat /import/gridengine/poi/spool/qmaster/exec_hosts/cn101
# Version: 2011.11p1
# 
# DO NOT MODIFY THIS FILE MANUALLY!
# 
hostname              cn101
load_scaling          NONE
complex_values        exclusive=1
load_values           arch=linux-x64,num_proc=8,mem_total=64321.257812M,swap_total=32255.992188M,virtual_total=96577.250000M
processors            8
reschedule_unknown_list 1000=1=8
user_lists            NONE
xuser_lists           NONE
projects              NONE
xprojects             NONE
usage_scaling         NONE
report_variables      NONE

If we look at the code in https://arc.liv.ac.uk/trac/SGE/browser/sge/source/daemons/qmaster/sge_give_jobs.c#L1126 starting l.1126

case COMMIT_ST_RESCHEDULED:
case COMMIT_ST_USER_RESCHEDULED:
case COMMIT_ST_FAILED_AND_ERROR:
// ...

the if block (l.1142), later on:

if (pe && lGetBool(pe, PE_control_slaves)) {

makes sense, according to us, only if the job was rescheduled because qmaster lost the network communication with one of the job slave noded, and qmaster decided, after some configured delay (reschedule_unknown) to reschedule the job on other nodes.

This is not our main use case. We want to reschedule our jobs arbitrarily, and we have no problems to reschedule them on the same nodes as before. The if block should discriminate, if that's possible, between the voluntary suspension case (queue suspension due to some load threshold, or user suspension), and "accidental" suspension due to a loss of communication with one of the compute nodes.

Best regards

Change History (8)

comment:1 in reply to: ↑ description Changed 4 years ago by matthieu

Just a small correction to the previous message. To suspend a job, we do not use

Replying to matthieu:

qalter -sj 1000

but

qmod -sj 1000

comment:2 Changed 4 years ago by dlove

Thanks for the diagnosis. I'll look at it as soon as I can.

comment:3 Changed 4 years ago by bougui

Hello M Love,

any news on this issue, any way we can workaround this issue?

Could there be a cli that would allow us to clear a state on a node where we have previously suspend a job ?

Maybe a way to remove some node from the reschedule_unknown_list ?

TIA.

Guillaume

comment:4 Changed 4 years ago by dlove

any news on this issue, any way we can workaround this issue?

Sorry I neglected that. I see I added a comment in the code but wasn't
sure I understood.

Looking at it again, I'm definitely confused: Why do you expect the job
to be rescheduled when it is suspended? Was there a step missing in the
description? (For tight integration I wouldn't expect it to be relevant
whether or not there was actually a process running on a slave node.)

Could there be a cli that would allow us to clear a state on a node where
we have previously suspend a job ?

Maybe a way to remove some node from the reschedule_unknown_list ?

Not as far as I know.

comment:5 Changed 4 years ago by matthieu

Hello,

The necessity is the following (beginning of first message):

[...] jobs that were voluntarily suspended (with the queue suspend threshold feature based on a custom complex) to leave room for other higher priority jobs.

Once the high priority job(s) (in a business sense) have finished, suspended jobs can be restarted from where they were suspended

In the example I gave in my first ticket to reproduce the bug, I neglected this "automatically decide to make room for a high priority job that is coming", I just replaced it with a voluntary manual command that leads to the same problem we are seeing

qmod -sj JOB_ID

In the example I gave I would expect that the job restarts immediately, which happens when the job is loosely integrated. Admittedly in that example test case, it is not really useful. But again, it was just to show a minimal way of reproducing the bug.

This is really show stopper for us because on a small scale cluster (a dozen of nodes or so), when SGE have already suspended several tight integrated jobs, all waiting jobs (the very important ones and the low priority ones) are blocked for further execution because all our nodes will be in that reschedule_unknown_list.

So the workaround we have now is that all job candidates to be suspended have to be launched on a loosely integrated parallel environment. That is acceptable, but because those jobs have the potential to be stopped and restarted several times (10 to 20 times sometimes), they are the most at risk to go in a runaway state. That is why those low priority suspendable jobs would benefit the most to be launched with tight integrated PEs.

comment:6 Changed 4 years ago by dlove

The necessity is the following (beginning of first message):

[...] jobs that were voluntarily suspended (with the queue suspend

threshold feature based on a custom complex) to leave room for other
higher priority jobs.

Once the high priority job(s) (in a business sense) have finished,
suspended jobs can be restarted from where they were suspended

Is there some reason not to use a subordinate queue?

I confess I'm not very familiar with that as I threw subordinate queues
out here long ago and we can't afford to tie up resources of suspended
jobs.

In the example I gave in my first ticket to reproduce the bug, I neglected
this "automatically decide to make room for a high priority job that is
coming", I just replaced it with a voluntary manual command that leads to
the same problem we are seeing

qmod -sj JOB_ID

That's probably what was confusing me. I don't expect job suspension
and queue suspension to be the same.

I'll have another look at it when I can.

comment:7 Changed 4 years ago by matthieu

Is there some reason not to use a subordinate queue?

Well, essentially subordinate queue instances are suspended per hosts based on the number of slots occupied by jobs in the higher priority queue instances. It didn't give us the required flexibility. What we wanted to achieve was to target one specific job running in lowp.q based on business ($$$) criteria. The infrastructure to implement this decision taking process is pretty complex and hackish (it involves flag files, cron jobs, scripts launching qmod, custom complexes monitoring license usage, etc.) but overall surprisingly reliable (after some debugging obviously).

The lowp.q is configured to have a "suspension threshold" triggered by a custom complex. Once the targeted job is identified, the custom complex is automatically set to 1 on the lowp.q master node instance of the targeted job. Immediately, the target job is suspended. But unfortunately, the slave nodes are put into this reschedule_unknown_list. What we would like, is that SGE would be aware that the job was not suspended due to a loss of network communication but because it was asked to do so by SGE. See the last part of the first message on the source code to see where the logic flaw, according to us, occurs.

In any case, thank you a lot for your consideration

Matthieu

comment:8 Changed 4 years ago by matthieu

That's probably what was confusing me. I don't expect job suspension and queue suspension to be the same.

So, to be a little bit more explicit, indeed queue suspension (used by our lowp.q "infrastructure" to target a specific job) and job suspension (qmod -sj command used to minimally reproduce the bug) are different. But apparently both leads to the problem to have cluster nodes (hosts) put in the reschedule_unknown_list.

Note: See TracTickets for help on using tickets.