Opened 11 years ago

Closed 9 years ago

#802 closed defect (worksforme)

IZ3265: array jobs with PE and dependencies killing qmaster

Reported by: kisielk Owned by:
Priority: normal Milestone:
Component: sge Version: 6.2u5
Severity: minor Keywords: qmaster


[Imported from gridengine issuezilla]

        Issue #:      3265             Platform:     All      Reporter: kisielk (kisielk)
       Component:     gridengine          OS:        All
     Subcomponent:    qmaster          Version:      6.2u5       CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    ernst (ernst)
      QA Contact:     ernst
       * Summary:     array jobs with PE and dependencies killing qmaster
   Status whiteboard:

     Issue 3265 blocks:
   Votes for issue 3265:

   Opened: Mon Apr 26 09:41:00 -0700 2010 

I'm able to reproduce this rather consistently in my 6.2u5 install.

If a an array job is submitted that uses a PE, and it has jobs dependant on it, the qmaster process will crash when the tasks in the array job are completing.

The messages log shows:

04/26/2010 09:27:57|worker|master|C|!!!!!!!!!! JB_ja_tasks not found in element !!!!!!!!!!

Restarting the qmaster just causes it to crash again. Sometimes there is enough time for me to fire off a qdel, but other times I have to manually delete the job directory in the
qmaster spool.

I have a copy of the spool directory of a job that exhibits this behaviour if that would help in diagnosing the problem.

   ------- Additional comments from kisielk Mon Apr 26 10:45:58 -0700 2010 -------
We did some further experiments. It seems this only happens if the dependant job is also an array job that uses -hold_jid_ad to depend on the job using a PE. If the dependant job uses
jut -hold_jid, there is no problem.

Change History (1)

comment:1 Changed 9 years ago by dlove

  • Resolution set to worksforme
  • Severity set to minor
  • Status changed from new to closed

Probably fixed by [3511]

Note: See TracTickets for help on using tickets.