Opened 9 years ago

Closed 8 years ago

#775 closed defect (fixed)

IZ3233: slotwise preemption fails to unsuspend one job per host

Reported by: stephendennis Owned by:
Priority: normal Milestone:
Component: sge Version: current
Severity: minor Keywords: scheduling
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=3233]

        Issue #:      3233             Platform:     All       Reporter: stephendennis (stephendennis)
       Component:     gridengine          OS:        All
     Subcomponent:    scheduling       Version:      current      CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    andreas (andreas)
      QA Contact:     andreas
          URL:
       * Summary:     slotwise preemption fails to unsuspend one job per host
   Status whiteboard:
      Attachments:

     Issue 3233 blocks:
   Votes for issue 3233:  6


   Opened: Wed Jan 27 13:59:00 -0700 2010 
------------------------


The following simple sequence leaves one job per host
suspended after superordinate jobs have completed.

There is a terminal capture demonstrating the bug after
the following details.

Changes to the subordinate_list in the form
subordinate_list      slots=4(low:0:sr), \
                        [@2slot=slots=2(low:0:sr)], \
                       [@4slot=slots=4(low:0:sr)]
show the same flaw.

Bug also demonstrable with multiple execution hosts.

 sd@pursuit:~$ qstat -help | head -1
 SGE 6.2u5
 sd@pursuit:~$ qconf -sq low |grep -v INFIN|grep -v NONE
 qname                 low
 hostlist              @allhosts
 seq_no                0
 load_thresholds       np_load_avg=1.75
 nsuspend              1
 suspend_interval      00:05:00
 priority              0
 min_cpu_interval      00:05:00
 processors            UNDEFINED
 qtype                 BATCH INTERACTIVE
 pe_list               make
 rerun                 FALSE
 slots                 3
 tmpdir                /tmp
 shell                 /bin/csh
 shell_start_mode      posix_compliant
 notify                00:00:60
 initial_state         default
 sd@pursuit:~$ qconf -sq high |grep -v INFIN|grep -v NONE
 qname                 high
 hostlist              @allhosts
 seq_no                0
 load_thresholds       np_load_avg=1.75
 nsuspend              1
 suspend_interval      00:05:00
 priority              0
 min_cpu_interval      00:05:00
 processors            UNDEFINED
 qtype                 BATCH INTERACTIVE
 pe_list               make
 rerun                 FALSE
 slots                 3
 tmpdir                /tmp
 shell                 /bin/csh
 shell_start_mode      posix_compliant
 notify                00:00:60
 subordinate_list      slots=3(low:1:sr)
 initial_state         default
 sd@pursuit:~$ qstat -f
 queuename                      qtype resv/used/tot. load_avg arch          states
 ---------------------------------------------------------------------------------
 all.q@pursuit                  BIP   0/0/2          0.22     lx24-amd64
 ---------------------------------------------------------------------------------
 high@pursuit                   BIP   0/0/3          0.22     lx24-amd64
 ---------------------------------------------------------------------------------
 low@pursuit                    BIP   0/0/3          0.22     lx24-amd64
 sd@pursuit:~$ for i in `seq 1 3` ; do qsub -b y -q low sleep 1000; done
 Your job 14 ("sleep") has been submitted
 Your job 15 ("sleep") has been submitted
 Your job 16 ("sleep") has been submitted
 sd@pursuit:~$ for i in `seq 1 9` ; do qsub -b y -q high sleep 30; done
 Your job 17 ("sleep") has been submitted
 Your job 18 ("sleep") has been submitted
 Your job 19 ("sleep") has been submitted
 Your job 20 ("sleep") has been submitted
 Your job 21 ("sleep") has been submitted
 Your job 22 ("sleep") has been submitted
 Your job 23 ("sleep") has been submitted
 Your job 24 ("sleep") has been submitted
 Your job 25 ("sleep") has been submitted
 sd@pursuit:~$ qstat
 job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
 -----------------------------------------------------------------------------------------------------------------
      14 0.55500 sleep      sd           S     12/03/2009 16:43:23 low@pursuit                        1
      15 0.55500 sleep      sd           S     12/03/2009 16:43:23 low@pursuit                        1
      16 0.55500 sleep      sd           S     12/03/2009 16:43:23 low@pursuit                        1
      17 0.55500 sleep      sd           r     12/03/2009 16:43:55 high@pursuit                       1
      18 0.55500 sleep      sd           r     12/03/2009 16:43:55 high@pursuit                       1
      19 0.55500 sleep      sd           r     12/03/2009 16:43:55 high@pursuit                       1
      20 0.00000 sleep      sd           qw    12/03/2009 16:43:52                                    1
      21 0.00000 sleep      sd           qw    12/03/2009 16:43:52                                    1
      22 0.00000 sleep      sd           qw    12/03/2009 16:43:52                                    1
      23 0.00000 sleep      sd           qw    12/03/2009 16:43:52                                    1
      24 0.00000 sleep      sd           qw    12/03/2009 16:43:52                                    1
      25 0.00000 sleep      sd           qw    12/03/2009 16:43:52                                    1
 sd@pursuit:~$ sleep 300;qstat
 sd@pursuit:~$ qstat
 job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
 -----------------------------------------------------------------------------------------------------------------
      14 0.55500 sleep      sd           S     12/03/2009 16:43:23 low@pursuit                        1
      15 0.55500 sleep      sd           r     12/03/2009 16:43:23 low@pursuit                        1
      16 0.55500 sleep      sd           r     12/03/2009 16:43:23 low@pursuit                        1
 sd@pursuit:~$ sleep 300;qstat
 job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
 -----------------------------------------------------------------------------------------------------------------
      14 0.55500 sleep      sd           S     12/03/2009 16:43:23 low@pursuit                        1
      15 0.55500 sleep      sd           r     12/03/2009 16:43:23 low@pursuit                        1
      16 0.55500 sleep      sd           r     12/03/2009 16:43:23 low@pursuit                        1

   ------- Additional comments from aja Thu Jan 28 08:30:07 -0700 2010 -------
I couldn't reproduce the issue in case the job was deleted by qdel. Then the slotwise preemption behaves like it should, the number of running
jobs corresponds to the number of slots defined by subordinate_list rule regardless of configuring just the simple rule (slots=2(low:0:sr)),
or more complex rule (i.e. all.q,[@2slot=slots=2(low:0:sr)],[@4slot=slots=4(low:0:sr)] )

However when the job finished properly, which is the most common case, and there are some suspended jobs which should be unsuspended, then
only (slot -1) jobs are running in total. This is not expected behavior and should be fixed.

   ------- Additional comments from stephendennis Tue Jul 20 20:10:44 -0700 2010 -------
Issue 3207 is probably a duplicate of this one.

Alena already created a one line fix for this issue in January.
It has been in production for sometime.
Was fix included in 6.2u6?

   ------- Additional comments from rayson Tue Jul 20 20:18:43 -0700 2010 -------
Stephen, can you point me at the 1-line fix??

The fix is in SGE6.2u6:
3233 6920926 slotwise preemption fails to unsuspend one job per host

But many people are still running SGE6.2u5 because SGE6.2u6 is not free.

Thanks,
Rayson

Change History (1)

comment:1 Changed 8 years ago by dlove

  • Resolution set to fixed
  • Severity set to minor
  • Status changed from new to closed

Fixed by [3549].

Note: See TracTickets for help on using tickets.