Opened 11 years ago
Closed 10 years ago
#775 closed defect (fixed)
IZ3233: slotwise preemption fails to unsuspend one job per host
Reported by: | stephendennis | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | sge | Version: | current |
Severity: | minor | Keywords: | scheduling |
Cc: |
Description
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=3233]
Issue #: 3233 Platform: All Reporter: stephendennis (stephendennis) Component: gridengine OS: All Subcomponent: scheduling Version: current CC: None defined Status: NEW Priority: P3 Resolution: Issue type: DEFECT Target milestone: --- Assigned to: andreas (andreas) QA Contact: andreas URL: * Summary: slotwise preemption fails to unsuspend one job per host Status whiteboard: Attachments: Issue 3233 blocks: Votes for issue 3233: 6 Opened: Wed Jan 27 13:59:00 -0700 2010 ------------------------ The following simple sequence leaves one job per host suspended after superordinate jobs have completed. There is a terminal capture demonstrating the bug after the following details. Changes to the subordinate_list in the form subordinate_list slots=4(low:0:sr), \ [@2slot=slots=2(low:0:sr)], \ [@4slot=slots=4(low:0:sr)] show the same flaw. Bug also demonstrable with multiple execution hosts. sd@pursuit:~$ qstat -help | head -1 SGE 6.2u5 sd@pursuit:~$ qconf -sq low |grep -v INFIN|grep -v NONE qname low hostlist @allhosts seq_no 0 load_thresholds np_load_avg=1.75 nsuspend 1 suspend_interval 00:05:00 priority 0 min_cpu_interval 00:05:00 processors UNDEFINED qtype BATCH INTERACTIVE pe_list make rerun FALSE slots 3 tmpdir /tmp shell /bin/csh shell_start_mode posix_compliant notify 00:00:60 initial_state default sd@pursuit:~$ qconf -sq high |grep -v INFIN|grep -v NONE qname high hostlist @allhosts seq_no 0 load_thresholds np_load_avg=1.75 nsuspend 1 suspend_interval 00:05:00 priority 0 min_cpu_interval 00:05:00 processors UNDEFINED qtype BATCH INTERACTIVE pe_list make rerun FALSE slots 3 tmpdir /tmp shell /bin/csh shell_start_mode posix_compliant notify 00:00:60 subordinate_list slots=3(low:1:sr) initial_state default sd@pursuit:~$ qstat -f queuename qtype resv/used/tot. load_avg arch states --------------------------------------------------------------------------------- all.q@pursuit BIP 0/0/2 0.22 lx24-amd64 --------------------------------------------------------------------------------- high@pursuit BIP 0/0/3 0.22 lx24-amd64 --------------------------------------------------------------------------------- low@pursuit BIP 0/0/3 0.22 lx24-amd64 sd@pursuit:~$ for i in `seq 1 3` ; do qsub -b y -q low sleep 1000; done Your job 14 ("sleep") has been submitted Your job 15 ("sleep") has been submitted Your job 16 ("sleep") has been submitted sd@pursuit:~$ for i in `seq 1 9` ; do qsub -b y -q high sleep 30; done Your job 17 ("sleep") has been submitted Your job 18 ("sleep") has been submitted Your job 19 ("sleep") has been submitted Your job 20 ("sleep") has been submitted Your job 21 ("sleep") has been submitted Your job 22 ("sleep") has been submitted Your job 23 ("sleep") has been submitted Your job 24 ("sleep") has been submitted Your job 25 ("sleep") has been submitted sd@pursuit:~$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 14 0.55500 sleep sd S 12/03/2009 16:43:23 low@pursuit 1 15 0.55500 sleep sd S 12/03/2009 16:43:23 low@pursuit 1 16 0.55500 sleep sd S 12/03/2009 16:43:23 low@pursuit 1 17 0.55500 sleep sd r 12/03/2009 16:43:55 high@pursuit 1 18 0.55500 sleep sd r 12/03/2009 16:43:55 high@pursuit 1 19 0.55500 sleep sd r 12/03/2009 16:43:55 high@pursuit 1 20 0.00000 sleep sd qw 12/03/2009 16:43:52 1 21 0.00000 sleep sd qw 12/03/2009 16:43:52 1 22 0.00000 sleep sd qw 12/03/2009 16:43:52 1 23 0.00000 sleep sd qw 12/03/2009 16:43:52 1 24 0.00000 sleep sd qw 12/03/2009 16:43:52 1 25 0.00000 sleep sd qw 12/03/2009 16:43:52 1 sd@pursuit:~$ sleep 300;qstat sd@pursuit:~$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 14 0.55500 sleep sd S 12/03/2009 16:43:23 low@pursuit 1 15 0.55500 sleep sd r 12/03/2009 16:43:23 low@pursuit 1 16 0.55500 sleep sd r 12/03/2009 16:43:23 low@pursuit 1 sd@pursuit:~$ sleep 300;qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 14 0.55500 sleep sd S 12/03/2009 16:43:23 low@pursuit 1 15 0.55500 sleep sd r 12/03/2009 16:43:23 low@pursuit 1 16 0.55500 sleep sd r 12/03/2009 16:43:23 low@pursuit 1 ------- Additional comments from aja Thu Jan 28 08:30:07 -0700 2010 ------- I couldn't reproduce the issue in case the job was deleted by qdel. Then the slotwise preemption behaves like it should, the number of running jobs corresponds to the number of slots defined by subordinate_list rule regardless of configuring just the simple rule (slots=2(low:0:sr)), or more complex rule (i.e. all.q,[@2slot=slots=2(low:0:sr)],[@4slot=slots=4(low:0:sr)] ) However when the job finished properly, which is the most common case, and there are some suspended jobs which should be unsuspended, then only (slot -1) jobs are running in total. This is not expected behavior and should be fixed. ------- Additional comments from stephendennis Tue Jul 20 20:10:44 -0700 2010 ------- Issue 3207 is probably a duplicate of this one. Alena already created a one line fix for this issue in January. It has been in production for sometime. Was fix included in 6.2u6? ------- Additional comments from rayson Tue Jul 20 20:18:43 -0700 2010 ------- Stephen, can you point me at the 1-line fix?? The fix is in SGE6.2u6: 3233 6920926 slotwise preemption fails to unsuspend one job per host But many people are still running SGE6.2u5 because SGE6.2u6 is not free. Thanks, Rayson
Change History (1)
comment:1 Changed 10 years ago by dlove
- Resolution set to fixed
- Severity set to minor
- Status changed from new to closed
Note: See
TracTickets for help on using
tickets.
Fixed by [3549].