Opened 10 years ago

#1347 new defect

reschedule_unknown settings can result in all jobs being killed

Reported by: dlove Owned by:
Priority: normal Milestone:
Component: sge Version: 6.2u5
Severity: major Keywords:


See (Stuart Barkley).

To summarize:

I'm seeing an issue where SGE appears to be killing all jobs with (in
the qmaster messages file):

  07/09/2011 02:14:07|worker|betsy-qmaster|E| reports running job (16648.32/master) in queue "" that was not supposed to be there - killing

All jobs are killed on all nodes in the cluster.  This occurs about 15
minutes after a node dies.

I have (qconf -sconf) settings:
  load_report_time             00:00:40
  max_unheard                  00:05:00
  reschedule_unknown           00:15:00
  qmaster_params               ENABLE_RESCHEDULE_KILL=true \
Other Notes:
  Running SUN SGE 6.2u5.
  Compute nodes are diskless and do not mount a shared sge_root.

My partial solution was to restore reschedule_unknown and
qmaster_params to their default values:

  reschedule_unknown           00:00:00
  qmaster_params               none

This seems to have solved my immediate problem.  I changed both
variables and didn't attempt to see which specific setting was causing
the problem.

Change History (0)

Note: See TracTickets for help on using tickets.