Opened 10 years ago

Last modified 9 years ago

#939 new defect

IZ713: Resource is not moved despite there is a need for it

Reported by: rhierlmeier Owned by:
Priority: normal Milestone:
Component: hedeby Version: 1.0u5
Severity: Keywords: resource_provider
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=713]

        Issue #:      713                     Platform:     All         Reporter: rhierlmeier (rhierlmeier)
       Component:     hedeby                     OS:        All
     Subcomponent:    resource_provider       Version:      1.0u5          CC:    None defined
        Status:       NEW                     Priority:     P3
      Resolution:                            Issue type:    DEFECT
                                          Target milestone: 1.0u5next
      Assigned to:    adoerr (adoerr)
      QA Contact:     adoerr
          URL:
       * Summary:     Resource is not moved despite there is a need for it
   Status whiteboard:
      Attachments:


     Issue 713 blocks:
   Votes for issue 713:     Vote for this issue


   Opened: Thu Dec 10 07:49:00 -0700 2009 
------------------------


   Description

   In ran into a problem while testing a MaxPendingJobsSLO use case. I had three
   services in the system. On cloud service, on GE service and one spare_pool.
   I had the following SLO setup:

   o MJP SLO at ge service (max=1, averageSlotsPerHost=1, urgency=13)
   o PermanentRequestSLO at cloud service (urgency=2, quantity=1000)
   o PermanentRequestSLO at spare_pool (urgency=1, quantity=10)

   After submitting some jobs into the GE cluster resources from cloud service
   moved into the GE service. The jobs ran and most resources moved correctly back
   to cloud service.

   Two resources moved to spare_pool and stayed there for 2 minutes despite there
   was a resource request of the PermanentRequestSLO of cloud service. I would
   expect that the resources move immediately back into cloud service.


   Evaluation:

   The issue has minor effect on the SDM system. After a short waiting time (two
   minutes) the resource will move to the correct service. The problem is only
   visible to the customer of the quantity of the PermanentRequestSLO is very high
   so that a resource request has a long lifetime.

   Analysis:


   The problem occurs whenever a order for a resource movement has been canceled.
   In the concrete scenario the resource at GE service went into ERROR state
   during uninstallation of execd. The order had been canceled. The following
   entries were found in the history:


   12/10/2009 10:49:50.53  RESOURCE_REMOVE             sge            foo(res#204)
       Uninstalling execd
   12/10/2009 10:49:51.153 RESOURCE_ERROR              sge            foo(res#204)
       Script uninstall_execd_sim.sh failed with status 126...
   12/10/2009 10:50:05.10  RESOURCE_ADDED              sge            foo(res#204)
       Got execd update event
   12/10/2009 10:50:05.25  RESOURCE_REMOVE             sge            foo(res#204)
       Uninstalling execd
   12/10/2009 10:50:08.11  RESOURCE_REMOVED            sge            foo(res#204)
       Execd is not running
   12/10/2009 10:50:08.17  RESOURCE_ADD                spare_pool     foo(res#204)
                           RESOURCE_PROPERTIES_CHANGED spare_pool     foo(res#204)
       [[U:usage=0->inf]]
   12/10/2009 10:50:08.18  RESOURCE_ADDED              spare_pool     foo(res#204)
   12/10/2009 10:50:08.19  RESOURCE_PROPERTIES_CHANGED spare_pool     foo(res#204)
       [[U:usage=inf->1]]
   12/10/2009 10:51:36.65  RESOURCE_REMOVE             spare_pool     foo(res#204)
                           RESOURCE_REMOVED            spare_pool     foo(res#204)
   12/10/2009 10:51:36.71  RESOURCE_ADD                cloud          foo(res#204)
   12/10/2009 10:51:36.72  RESOURCE_PROPERTIES_CHANGED cloud          foo(res#204)
       [[U:annotation=Shutting down resource]]
   12/10/2009 10:51:36.73  RESOURCE_PROPERTIES_CHANGED cloud          foo(res#204)
       [[U:annotation=Shutting down cloudhost]]
   12/10/2009 10:51:36.74  RESOURCE_PROPERTIES_CHANGED cloud          foo(res#204)
       [[U:usage=inf->2]]
   12/10/2009 10:51:36.274 RESOURCE_PROPERTIES_CHANGED cloud
   cloud47(res#204) [[D:resourceHostname=foo]]
                           RESOURCE_ADDED              cloud
   cloud47(res#204) Resource was shutdown

   Before 10:49:50 the resource res#204 has been requested by PermanentRequestSLO
   of cloud service. The uninstall started. At 10:49:51 the resource moved into
   ERROR state. However qmaster reported the execd again and resource went into
   ASSIGNED state again.  Due to the ERROR state resource provider canceled the
   order for this resource which has been placed for the PermanentRequestSLO of
   cloud service. The resource was no longer considered for this resource request.

   After the resource went back into ASSIGNED state the PermanentRequestSLO of
   spare_pool requested the resource. The PermanentRequestSLO of cloud service did
   not consider this resource because the order associated to the request has been
   canceled.

   The cause of the problem is the way how the class NeedProcessors handles the
   situation. If RP receives an ResourceErrorEvent from service is executes the
   ResourceErrorAction:

   private class ResourceErrorAction extends AbstractAction {
      ...
      protected boolean doExecute() {
          // We must cancel orders which has the service as source

   orders.cancelOrders(OrderVariableResolver.newResourceIdAndSourceFilter(bad.getId(),
   source));
          return true;
      }
   }

   The OrderStore#cancelOrders method finds the order and cancels it. The
   NeedProcessor of the order is informed that the order has been canceled. It
   increments the canceled order count and does no longer consider the resource
   for the request.

   How to fix

   In this particular case that the Resource goes into ERROR state. The
   NeedProcessor must be informed that the order will not be executed. However the
   awaited resource must be removed from the NeedProcessor and not canceled. Once
   the resource goes back into ASSIGNED state it can be reconsidered for this
   resource request.

   The issue can be seen as side effect of the fix of issue 707. Before 707 with
   each new resource request new NeedProcessors had been created. The orders and
   the old NeedProcessor was still available, however the where not associated to
   the new resource request. With fix of 707 new requests will be re associated
   with existing orders and NeedProcessors. A canceled order and the NeedProcessor
   lives longer.


   Workaround:

   The problem will be not visible if the quantity of PermanentRequestSLOs
   is low (~ 10). The quantity of the SLO will end in a quantity of a need. The
   quantity of a need is directly proportional to the lifetime of a resource
   request. If resource request lives only for a short time the issue will be not
   visible in the system.

   How to test

   Write a TS test:

     o  one sge service one spare_pool is active
     o  modify the execd configuration of sge service that the execd uninstall
        procedure fails if a certain resource property is set
     o  Make all resource except one in sge service as static
     o  reconfigure SLO setup:
        o  sge_service has fixed usage SLO with urgency 20
        o  spare_pool as PermanentRequestSLO with urgency 50
     o  Check that resource goes into error state
     o  Reset the resource property of the resource so that next execd uninstall
        succeeds
     o  Reset the resource
     o  Check that resource moves to spare_pool

   It is important that the test does not restart the services after the setup.
   With a restart all orders will be purged and the issue can not be tested.


   Enhance the

   ETC: 3PD (with TS test)

Change History (0)

Note: See TracTickets for help on using tickets.