Opened 11 years ago

Closed 7 years ago

#532 closed defect (fixed)

IZ2628: Tasks held with array dependency may get deleted prematurely

Reported by: johna Owned by:
Priority: normal Milestone:
Component: sge Version: 6.2beta
Severity: minor Keywords: qmaster
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2628]

        Issue #:      2628             Platform:     All       Reporter: johna (johna)
       Component:     gridengine          OS:        All
     Subcomponent:    qmaster          Version:      6.2beta      CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    ernst (ernst)
      QA Contact:     ernst
          URL:
       * Summary:     Tasks held with array dependency may get deleted prematurely
   Status whiteboard:
      Attachments:

     Issue 2628 blocks:
   Votes for issue 2628:


   Opened: Tue Jun 24 21:25:00 -0700 2008 
------------------------


It seems to be that tasks in the JB_ja_a_h_ids hold range can get ignored,
leading to the parent job being deleted before they are scheduled to run.

This bug does not appear in the ARI branch and seems to only occur when the
dependent job held with -hold_jid_ad option has higher priority. This probably
means that the QA testing procedure does not detect this issue since it probably
does not submit the jobs with different priority.

This can be reproduced as follows (aimk options are '-spool-classic -parallel 3
-no-dump -debug -no-secure -no-jni -no-java'):

[root@xen-grid1 johna]# qsub -t 1-10 -p -100 -b y /bin/sleep 20
Your job-array 1.1-10:1 ("sleep") has been submitted
[root@xen-grid1 johna]# qsub -t 1-10 -p 100 -hold_jid_ad 1 -b y /bin/sleep 20
Your job-array 2.1-10:1 ("sleep") has been submitted
[root@xen-grid1 johna]# qstat
job-ID  prior   name       user         state submit/start at     queue
                 slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
      1 0.50617 sleep      root         r     06/25/2008 12:52:18
all.q@xen-grid1.rsp.com.au         1 1
      1 0.00000 sleep      root         qw    06/25/2008 12:52:13
                     1 2-10:1
      2 0.00000 sleep      root         hqw   06/25/2008 12:52:20
                     1 1-10:1
[root@xen-grid1 johna]# qstat
job-ID  prior   name       user         state submit/start at     queue
                 slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
      1 0.50617 sleep      root         qw    06/25/2008 12:52:13
                     1 2-10:1
      2 0.00000 sleep      root         qw    06/25/2008 12:52:20
                     1 1
      2 0.00000 sleep      root         hqw   06/25/2008 12:52:20
                     1 2-10:1
[root@xen-grid1 johna]# qstat
job-ID  prior   name       user         state submit/start at     queue
                 slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
      2 0.60383 sleep      root         r     06/25/2008 12:52:48
all.q@xen-grid1.rsp.com.au         1 1
      1 0.50617 sleep      root         qw    06/25/2008 12:52:13
                     1 2-10:1
      2 0.00000 sleep      root         hqw   06/25/2008 12:52:20
                     1 2-10:1
[root@xen-grid1 johna]# qstat
job-ID  prior   name       user         state submit/start at     queue
                 slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
      2 0.00000 sleep      root         hqw   06/25/2008 12:52:20
                     1 2-10:1
      1 0.50617 sleep      root         qw    06/25/2008 12:52:13
                     1 2-10:1
[root@xen-grid1 johna]# qstat
job-ID  prior   name       user         state submit/start at     queue
                 slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
      1 0.50617 sleep      root         r     06/25/2008 12:53:18
all.q@xen-grid1.rsp.com.au         1 2
      1 0.50617 sleep      root         qw    06/25/2008 12:52:13
                     1 3-10:1

End result, job 2 is "gone" despite it having some tasks left that are held with
AD. A preliminary investigation on MT has found some missing code lines in
sge_job_qmaster.c, but I have not as yet been able to isolate this defect.

Change History (1)

comment:1 Changed 7 years ago by dlove

  • Resolution set to fixed
  • Severity set to minor
  • Status changed from new to closed

JA-2008-07-11-0

Note: See TracTickets for help on using tickets.