Opened 11 years ago
Last modified 10 years ago
#939 new defect
IZ713: Resource is not moved despite there is a need for it
Reported by: | rhierlmeier | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | hedeby | Version: | 1.0u5 |
Severity: | Keywords: | resource_provider | |
Cc: |
Description
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=713]
Issue #: 713 Platform: All Reporter: rhierlmeier (rhierlmeier) Component: hedeby OS: All Subcomponent: resource_provider Version: 1.0u5 CC: None defined Status: NEW Priority: P3 Resolution: Issue type: DEFECT Target milestone: 1.0u5next Assigned to: adoerr (adoerr) QA Contact: adoerr URL: * Summary: Resource is not moved despite there is a need for it Status whiteboard: Attachments: Issue 713 blocks: Votes for issue 713: Vote for this issue Opened: Thu Dec 10 07:49:00 -0700 2009 ------------------------ Description In ran into a problem while testing a MaxPendingJobsSLO use case. I had three services in the system. On cloud service, on GE service and one spare_pool. I had the following SLO setup: o MJP SLO at ge service (max=1, averageSlotsPerHost=1, urgency=13) o PermanentRequestSLO at cloud service (urgency=2, quantity=1000) o PermanentRequestSLO at spare_pool (urgency=1, quantity=10) After submitting some jobs into the GE cluster resources from cloud service moved into the GE service. The jobs ran and most resources moved correctly back to cloud service. Two resources moved to spare_pool and stayed there for 2 minutes despite there was a resource request of the PermanentRequestSLO of cloud service. I would expect that the resources move immediately back into cloud service. Evaluation: The issue has minor effect on the SDM system. After a short waiting time (two minutes) the resource will move to the correct service. The problem is only visible to the customer of the quantity of the PermanentRequestSLO is very high so that a resource request has a long lifetime. Analysis: The problem occurs whenever a order for a resource movement has been canceled. In the concrete scenario the resource at GE service went into ERROR state during uninstallation of execd. The order had been canceled. The following entries were found in the history: 12/10/2009 10:49:50.53 RESOURCE_REMOVE sge foo(res#204) Uninstalling execd 12/10/2009 10:49:51.153 RESOURCE_ERROR sge foo(res#204) Script uninstall_execd_sim.sh failed with status 126... 12/10/2009 10:50:05.10 RESOURCE_ADDED sge foo(res#204) Got execd update event 12/10/2009 10:50:05.25 RESOURCE_REMOVE sge foo(res#204) Uninstalling execd 12/10/2009 10:50:08.11 RESOURCE_REMOVED sge foo(res#204) Execd is not running 12/10/2009 10:50:08.17 RESOURCE_ADD spare_pool foo(res#204) RESOURCE_PROPERTIES_CHANGED spare_pool foo(res#204) [[U:usage=0->inf]] 12/10/2009 10:50:08.18 RESOURCE_ADDED spare_pool foo(res#204) 12/10/2009 10:50:08.19 RESOURCE_PROPERTIES_CHANGED spare_pool foo(res#204) [[U:usage=inf->1]] 12/10/2009 10:51:36.65 RESOURCE_REMOVE spare_pool foo(res#204) RESOURCE_REMOVED spare_pool foo(res#204) 12/10/2009 10:51:36.71 RESOURCE_ADD cloud foo(res#204) 12/10/2009 10:51:36.72 RESOURCE_PROPERTIES_CHANGED cloud foo(res#204) [[U:annotation=Shutting down resource]] 12/10/2009 10:51:36.73 RESOURCE_PROPERTIES_CHANGED cloud foo(res#204) [[U:annotation=Shutting down cloudhost]] 12/10/2009 10:51:36.74 RESOURCE_PROPERTIES_CHANGED cloud foo(res#204) [[U:usage=inf->2]] 12/10/2009 10:51:36.274 RESOURCE_PROPERTIES_CHANGED cloud cloud47(res#204) [[D:resourceHostname=foo]] RESOURCE_ADDED cloud cloud47(res#204) Resource was shutdown Before 10:49:50 the resource res#204 has been requested by PermanentRequestSLO of cloud service. The uninstall started. At 10:49:51 the resource moved into ERROR state. However qmaster reported the execd again and resource went into ASSIGNED state again. Due to the ERROR state resource provider canceled the order for this resource which has been placed for the PermanentRequestSLO of cloud service. The resource was no longer considered for this resource request. After the resource went back into ASSIGNED state the PermanentRequestSLO of spare_pool requested the resource. The PermanentRequestSLO of cloud service did not consider this resource because the order associated to the request has been canceled. The cause of the problem is the way how the class NeedProcessors handles the situation. If RP receives an ResourceErrorEvent from service is executes the ResourceErrorAction: private class ResourceErrorAction extends AbstractAction { ... protected boolean doExecute() { // We must cancel orders which has the service as source orders.cancelOrders(OrderVariableResolver.newResourceIdAndSourceFilter(bad.getId(), source)); return true; } } The OrderStore#cancelOrders method finds the order and cancels it. The NeedProcessor of the order is informed that the order has been canceled. It increments the canceled order count and does no longer consider the resource for the request. How to fix In this particular case that the Resource goes into ERROR state. The NeedProcessor must be informed that the order will not be executed. However the awaited resource must be removed from the NeedProcessor and not canceled. Once the resource goes back into ASSIGNED state it can be reconsidered for this resource request. The issue can be seen as side effect of the fix of issue 707. Before 707 with each new resource request new NeedProcessors had been created. The orders and the old NeedProcessor was still available, however the where not associated to the new resource request. With fix of 707 new requests will be re associated with existing orders and NeedProcessors. A canceled order and the NeedProcessor lives longer. Workaround: The problem will be not visible if the quantity of PermanentRequestSLOs is low (~ 10). The quantity of the SLO will end in a quantity of a need. The quantity of a need is directly proportional to the lifetime of a resource request. If resource request lives only for a short time the issue will be not visible in the system. How to test Write a TS test: o one sge service one spare_pool is active o modify the execd configuration of sge service that the execd uninstall procedure fails if a certain resource property is set o Make all resource except one in sge service as static o reconfigure SLO setup: o sge_service has fixed usage SLO with urgency 20 o spare_pool as PermanentRequestSLO with urgency 50 o Check that resource goes into error state o Reset the resource property of the resource so that next execd uninstall succeeds o Reset the resource o Check that resource moves to spare_pool It is important that the test does not restart the services after the setup. With a restart all orders will be purged and the issue can not be tested. Enhance the ETC: 3PD (with TS test)
Note: See
TracTickets for help on using
tickets.