Opened 8 years ago

#1347 new defect

reschedule_unknown settings can result in all jobs being killed

Reported by: dlove Owned by:
Priority: normal Milestone:
Component: sge Version: 6.2u5
Severity: major Keywords:
Cc:

Description

See http://gridengine.org/pipermail/users/2011-August/001454.html (Stuart Barkley).

To summarize:

I'm seeing an issue where SGE appears to be killing all jobs with (in
the qmaster messages file):

  07/09/2011 02:14:07|worker|betsy-qmaster|E|execd@bc098.fda.gov reports running job (16648.32/master) in queue "green@bc098.fda.gov" that was not supposed to be there - killing

All jobs are killed on all nodes in the cluster.  This occurs about 15
minutes after a node dies.

I have (qconf -sconf) settings:
  load_report_time             00:00:40
  max_unheard                  00:05:00
  reschedule_unknown           00:15:00
  qmaster_params               ENABLE_RESCHEDULE_KILL=true \
                               ENABLE_RESCHEDULE_SLAVE=true
Other Notes:
  Running SUN SGE 6.2u5.
  Compute nodes are diskless and do not mount a shared sge_root.

My partial solution was to restore reschedule_unknown and
qmaster_params to their default values:

  reschedule_unknown           00:00:00
  qmaster_params               none

This seems to have solved my immediate problem.  I changed both
variables and didn't attempt to see which specific setting was causing
the problem.

Change History (0)

Note: See TracTickets for help on using tickets.