Opened 14 years ago

Last modified 9 years ago

#249 new defect

IZ1634: Suspend/Resume Problems on RedHet 3.0

Reported by: templedf Owned by:
Priority: normal Milestone:
Component: sge Version: 6.0u4
Severity: Keywords: PC Linux execution
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=1634]

        Issue #:      1634             Platform:     PC       Reporter: templedf (templedf)
       Component:     gridengine          OS:        Linux
     Subcomponent:    execution        Version:      6.0u4       CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    andreas (andreas)
      QA Contact:     pollinger
          URL:
       * Summary:     Suspend/Resume Problems on RedHet 3.0
   Status whiteboard:
      Attachments:

     Issue 1634 blocks:
   Votes for issue 1634:


   Opened: Thu May 26 08:38:00 -0700 2005 
------------------------


I just had a very odd issue with the 6.0u4 courtesy binaries on REL 3.0/Athelon.
 Here's the story.

I installed the cluster with the Max scheduler setting and 7 exec hosts and no
shadowd.  2 of the exec hosts were offline.  I set up a custom complex value and
added load sensors to each exec host to produce the complex value.  I created a
queue that existed only on one host, with no load threshold, a suspend threshold
that depended on the complex value, and a suspend method script.  I created a
second queue that existed only on that same host, with no load threshold, a
suspend threshold that depended on the complex value, and a resume method
script.  I changed the scheduler internal to 1s.  I then submitted an array job
with two tasks to each of the two queues.
I caused the custom complex value to increase, causing the queues to suspend one
of their jobs.  When the suspend threshold was triggered on the queue with the
suspend method, the suspend script was called *twice*.
With the resume method queue in alarm state and the suspend method queue not in
alarm state, I deleted the second array task of the job in each queue.  Both
tasks were immediately rescheduled.  No matter how many times I deleted them,
they kept reappearing in the pending job list.
I set a suspend and resume method for both queues.  Only the suspend method for
the queue which originally had a suspend method was ever called.  The suspend
method for the other queue and both resume methods were never called.
I deleted all jobs in the system and submitted 2 regular jobs to each queue, and
then the suspend and resume methods worked for a little while.  Then,
spontaneously, they stopped working again.  Deleting and resubmitting the jobs
and/or restarting the qmaster does not help.

I have tested the suspend and resume scripts exhaustively.  They work when run
by hand.  They work occasionally when run by the execd.  If I set the suspend
method on both queues to the same path, it still only works for one of the
queues.  If I create a new queue, the suspend method works, but the resume
method does not.  This is clearly not a bug in the scripts.

   ------- Additional comments from andreas Fri Jun 17 06:59:51 -0700 2005 -------
Changing to execution.

Change History (0)

Note: See TracTickets for help on using tickets.