Opened 11 years ago
Closed 9 years ago
#802 closed defect (worksforme)
IZ3265: array jobs with PE and dependencies killing qmaster
Reported by: | kisielk | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | sge | Version: | 6.2u5 |
Severity: | minor | Keywords: | qmaster |
Cc: |
Description
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=3265]
Issue #: 3265 Platform: All Reporter: kisielk (kisielk) Component: gridengine OS: All Subcomponent: qmaster Version: 6.2u5 CC: None defined Status: NEW Priority: P3 Resolution: Issue type: DEFECT Target milestone: --- Assigned to: ernst (ernst) QA Contact: ernst URL: * Summary: array jobs with PE and dependencies killing qmaster Status whiteboard: Attachments: Issue 3265 blocks: Votes for issue 3265: Opened: Mon Apr 26 09:41:00 -0700 2010 ------------------------ I'm able to reproduce this rather consistently in my 6.2u5 install. If a an array job is submitted that uses a PE, and it has jobs dependant on it, the qmaster process will crash when the tasks in the array job are completing. The messages log shows: 04/26/2010 09:27:57|worker|master|C|!!!!!!!!!! JB_ja_tasks not found in element !!!!!!!!!! Restarting the qmaster just causes it to crash again. Sometimes there is enough time for me to fire off a qdel, but other times I have to manually delete the job directory in the qmaster spool. I have a copy of the spool directory of a job that exhibits this behaviour if that would help in diagnosing the problem. ------- Additional comments from kisielk Mon Apr 26 10:45:58 -0700 2010 ------- We did some further experiments. It seems this only happens if the dependant job is also an array job that uses -hold_jid_ad to depend on the job using a PE. If the dependant job uses jut -hold_jid, there is no problem.
Change History (1)
comment:1 Changed 9 years ago by dlove
- Resolution set to worksforme
- Severity set to minor
- Status changed from new to closed
Note: See
TracTickets for help on using
tickets.
Probably fixed by [3511]