Opened 5 years ago
Last modified 5 years ago
#1567 new defect
Automatic job suspension, tight integration and bad cleanup
Reported by: | matthieu | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | sge | Version: | current development |
Severity: | major | Keywords: | |
Cc: |
Description
Recently we have switched from a loose integration to a tight integration of our workhorse commercial simulation software in order to limit occurrences of runaway processes left on slave nodes in case of job termination failure.
Unfortunately, our lowp.q infrastructure that worked very well before that, started to refuse to run rescheduled jobs that were voluntarily suspended (with the queue suspend threshold feature based on a custom complex) to leave room for other higher priority jobs. By inspecting the SGE behavior and the source code, we think we have clearly identified the reason for this apparent bug.
Here is a minimal setup to reproduce the bug.
We have a tight integration parallel environment called mpi with its allocation rule set to 4:
$ qconf -sp mpi pe_name mpi slots 216 user_lists NONE xuser_lists NONE start_proc_args /import/gridengine/mpi/startmpi.sh -catch_rsh $pe_hostfile stop_proc_args /import/gridengine/mpi/stopmpi.sh allocation_rule 4 control_slaves TRUE job_is_first_task TRUE urgency_slots min accounting_summary FALSE
We define a checkpointing environment:
$ qconf -sckpt ckpt-Test ckpt_name ckpt-Test interface APPLICATION-LEVEL ckpt_command NONE migr_command /import/gridengine/poi/scripts/Ckpt/test-suspend.sh restart_command NONE clean_command NONE ckpt_dir NONE signal NONE when xsr
The test-suspend.sh is a no-op:
$ cat test-suspend.sh #!/bin/sh exit 0
We define a test queue with instances on 2 nodes (cn100 and cn101) of 8 slots each. We attach the parallel environment and the checkpointing environment previously defined:
$ qconf -sq lowp.q_test qname lowp.q_test hostlist @cn100-101 seq_no 0 load_thresholds np_load_avg=1.75 suspend_thresholds NONE nsuspend 1 suspend_interval 00:02:00 priority 0 min_cpu_interval 00:05:00 processors UNDEFINED qtype BATCH INTERACTIVE ckpt_list ckpt-Test pe_list mpi rerun TRUE slots 8 tmpdir /tmp shell /bin/csh prolog NONE epilog NONE shell_start_mode posix_compliant starter_method NONE suspend_method NONE resume_method NONE terminate_method NONE notify 00:00:60 owner_list NONE user_lists NONE xuser_lists NONE subordinate_list NONE complex_values NONE projects NONE xprojects NONE calendar NONE initial_state default s_rt INFINITY h_rt INFINITY s_cpu INFINITY h_cpu INFINITY s_fsize INFINITY h_fsize INFINITY s_data INFINITY h_data INFINITY s_stack INFINITY h_stack INFINITY s_core INFINITY h_core INFINITY s_rss INFINITY h_rss INFINITY s_vmem INFINITY h_vmem INFINITY
We run a waiting loop on 8 slots (so 4 slots on each node):
qsub -pe mpi 8 -ckpt ckpt-Test -q lowp.q_test waiting-loop.sh
Let's say the JOB_ID is 1000. It is scheduled and eventually runs as expected on 2 nodes of 4 slots each with cn100 being the master node and cn101 the slave. Then we decide to suspend the job:
qalter -sj 1000
After some delay, the job is actually suspended. As there is no other job running on the lowp.q_test and that this is the only queue instance defined on those 2 nodes, we could expect to have the job immediately rescheduled to run. Unfortunately, it waits forever and the qstat commands gives:
$ qstat -j 1000 # ... cannot run on host "cn101" until clean up of an previous run has finished cannot run in PE "mpi" because it only offers 4 slots
The job was a waiting loop, so technically, it never started processes on the slave node (cn101). It is somewhat surprising to see that it might have left some stuff to be cleanup on cn101...
If we try to schedule a second job on the same queue:
qsub -pe mpi 8 -ckpt ckpt-Test -q lowp.q_test waiting-loop.sh
It will never be scheduled either (with the same reason as the rescheduled job) until we qdel the first job.
The reason SGE thinks it needs to "cleanup" the first job, is because, the JOB_ID is included in the reschedule_unknown_list of the slave node:
$ cat /import/gridengine/poi/spool/qmaster/exec_hosts/cn101 # Version: 2011.11p1 # # DO NOT MODIFY THIS FILE MANUALLY! # hostname cn101 load_scaling NONE complex_values exclusive=1 load_values arch=linux-x64,num_proc=8,mem_total=64321.257812M,swap_total=32255.992188M,virtual_total=96577.250000M processors 8 reschedule_unknown_list 1000=1=8 user_lists NONE xuser_lists NONE projects NONE xprojects NONE usage_scaling NONE report_variables NONE
If we look at the code in https://arc.liv.ac.uk/trac/SGE/browser/sge/source/daemons/qmaster/sge_give_jobs.c#L1126 starting l.1126
case COMMIT_ST_RESCHEDULED: case COMMIT_ST_USER_RESCHEDULED: case COMMIT_ST_FAILED_AND_ERROR: // ...
the if block (l.1142), later on:
if (pe && lGetBool(pe, PE_control_slaves)) {
makes sense, according to us, only if the job was rescheduled because qmaster lost the network communication with one of the job slave noded, and qmaster decided, after some configured delay (reschedule_unknown) to reschedule the job on other nodes.
This is not our main use case. We want to reschedule our jobs arbitrarily, and we have no problems to reschedule them on the same nodes as before. The if block should discriminate, if that's possible, between the voluntary suspension case (queue suspension due to some load threshold, or user suspension), and "accidental" suspension due to a loss of communication with one of the compute nodes.
Best regards
Change History (8)
comment:1 in reply to: ↑ description Changed 5 years ago by matthieu
comment:2 Changed 5 years ago by dlove
Thanks for the diagnosis. I'll look at it as soon as I can.
comment:3 Changed 5 years ago by bougui
Hello M Love,
any news on this issue, any way we can workaround this issue?
Could there be a cli that would allow us to clear a state on a node where we have previously suspend a job ?
Maybe a way to remove some node from the reschedule_unknown_list ?
TIA.
Guillaume
comment:4 Changed 5 years ago by dlove
any news on this issue, any way we can workaround this issue?
Sorry I neglected that. I see I added a comment in the code but wasn't
sure I understood.
Looking at it again, I'm definitely confused: Why do you expect the job
to be rescheduled when it is suspended? Was there a step missing in the
description? (For tight integration I wouldn't expect it to be relevant
whether or not there was actually a process running on a slave node.)
Could there be a cli that would allow us to clear a state on a node where
we have previously suspend a job ?
Maybe a way to remove some node from the reschedule_unknown_list ?
Not as far as I know.
comment:5 Changed 5 years ago by matthieu
Hello,
The necessity is the following (beginning of first message):
[...] jobs that were voluntarily suspended (with the queue suspend threshold feature based on a custom complex) to leave room for other higher priority jobs.
Once the high priority job(s) (in a business sense) have finished, suspended jobs can be restarted from where they were suspended
In the example I gave in my first ticket to reproduce the bug, I neglected this "automatically decide to make room for a high priority job that is coming", I just replaced it with a voluntary manual command that leads to the same problem we are seeing
qmod -sj JOB_ID
In the example I gave I would expect that the job restarts immediately, which happens when the job is loosely integrated. Admittedly in that example test case, it is not really useful. But again, it was just to show a minimal way of reproducing the bug.
This is really show stopper for us because on a small scale cluster (a dozen of nodes or so), when SGE have already suspended several tight integrated jobs, all waiting jobs (the very important ones and the low priority ones) are blocked for further execution because all our nodes will be in that reschedule_unknown_list.
So the workaround we have now is that all job candidates to be suspended have to be launched on a loosely integrated parallel environment. That is acceptable, but because those jobs have the potential to be stopped and restarted several times (10 to 20 times sometimes), they are the most at risk to go in a runaway state. That is why those low priority suspendable jobs would benefit the most to be launched with tight integrated PEs.
comment:6 Changed 5 years ago by dlove
The necessity is the following (beginning of first message):
[...] jobs that were voluntarily suspended (with the queue suspend
threshold feature based on a custom complex) to leave room for other
higher priority jobs.
Once the high priority job(s) (in a business sense) have finished,
suspended jobs can be restarted from where they were suspended
Is there some reason not to use a subordinate queue?
I confess I'm not very familiar with that as I threw subordinate queues
out here long ago and we can't afford to tie up resources of suspended
jobs.
In the example I gave in my first ticket to reproduce the bug, I neglected
this "automatically decide to make room for a high priority job that is
coming", I just replaced it with a voluntary manual command that leads to
the same problem we are seeing
qmod -sj JOB_ID
That's probably what was confusing me. I don't expect job suspension
and queue suspension to be the same.
I'll have another look at it when I can.
comment:7 Changed 5 years ago by matthieu
Is there some reason not to use a subordinate queue?
Well, essentially subordinate queue instances are suspended per hosts based on the number of slots occupied by jobs in the higher priority queue instances. It didn't give us the required flexibility. What we wanted to achieve was to target one specific job running in lowp.q based on business ($$$) criteria. The infrastructure to implement this decision taking process is pretty complex and hackish (it involves flag files, cron jobs, scripts launching qmod, custom complexes monitoring license usage, etc.) but overall surprisingly reliable (after some debugging obviously).
The lowp.q is configured to have a "suspension threshold" triggered by a custom complex. Once the targeted job is identified, the custom complex is automatically set to 1 on the lowp.q master node instance of the targeted job. Immediately, the target job is suspended. But unfortunately, the slave nodes are put into this reschedule_unknown_list. What we would like, is that SGE would be aware that the job was not suspended due to a loss of network communication but because it was asked to do so by SGE. See the last part of the first message on the source code to see where the logic flaw, according to us, occurs.
In any case, thank you a lot for your consideration
Matthieu
comment:8 Changed 5 years ago by matthieu
That's probably what was confusing me. I don't expect job suspension and queue suspension to be the same.
So, to be a little bit more explicit, indeed queue suspension (used by our lowp.q "infrastructure" to target a specific job) and job suspension (qmod -sj command used to minimally reproduce the bug) are different. But apparently both leads to the problem to have cluster nodes (hosts) put in the reschedule_unknown_list.
Just a small correction to the previous message. To suspend a job, we do not use
Replying to matthieu:
but