Opened 50 years ago

Last modified 9 years ago

#937 new defect

IZ711: Restart of service can cause that ERROR resource is released by service

Reported by: zwierzak Owned by:
Priority: normal Milestone:
Component: hedeby Version: 1.0u5_Beta
Severity: Keywords: Sun gridengine_adapter
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=711]

        Issue #:      711                      Platform:     Sun          Reporter: zwierzak (zwierzak)
       Component:     hedeby                      OS:        All
     Subcomponent:    gridengine_adapter       Version:      1.0u5_Beta      CC:    None defined
        Status:       NEW                      Priority:     P3
      Resolution:                             Issue type:    DEFECT
                                           Target milestone: 1.0u5next
      Assigned to:    rhierlmeier (rhierlmeier)
      QA Contact:     rhierlmeier
          URL:
       * Summary:     Restart of service can cause that ERROR resource is released by service
   Status whiteboard:
      Attachments:


     Issue 711 blocks:
   Votes for issue 711:     Vote for this issue


   Opened: Tue Dec 8 08:19:00 -0700 2009 
------------------------


   Description:

   The problem can be hit by following situation. Try to move resource to GE
   adapter. (If ASSIGNING resource will go to ERROR state in that service on some
   error during installation - NOTE: Installation has to file in that way that
   resource will not be reported by qmaster).
   Trigger sdmadm uc -c ge_adapeter. The ERROR resource will disappear from GE
   adapter and will be moved to resource provider and set to UNASSIGNED state.
   Later this service is added to first service that makes request.

   #sdmadm shist -r res#11

   12/07/2009 17:26:24.142 RESOURCE_ERROR              CrashGrid
   domU-12-31-39-00-CC-82(res#11) Script install_execd_cloud.sh failed, executor
   did not return an exit code, probably it timed out
   12/07/2009 17:27:54.587 RESOURCE_REMOVE             CrashGrid
   domU-12-31-39-00-CC-82(res#11) Disappeared while service was down
                          RESOURCE_REMOVED            CrashGrid
   domU-12-31-39-00-CC-82(res#11) Disappeared while service was down
   12/07/2009 17:27:54.638 RESOURCE_ADD                resource_provider
   domU-12-31-39-00-CC-82(res#11)
   12/07/2009 17:27:54.639 RESOURCE_ADDED              resource_provider
   domU-12-31-39-00-CC-82(res#11)
   12/07/2009 17:27:54.640 RESOURCE_REMOVE             resource_provider
   domU-12-31-39-00-CC-82(res#11)
                          RESOURCE_REMOVED            resource_provider
   domU-12-31-39-00-CC-82(res#11)
   12/07/2009 17:27:54.642 RESOURCE_ADD                resource_provider
   domU-12-31-39-00-CC-82(res#11)
   12/07/2009 17:27:54.643 RESOURCE_ADDED              resource_provider
   domU-12-31-39-00-CC-82(res#11)
   12/07/2009 17:28:51.197 RESOURCE_REMOVE             resource_provider
   domU-12-31-39-00-CC-82(res#11)
                          RESOURCE_REMOVED            resource_provider
   domU-12-31-39-00-CC-82(res#11)
   12/07/2009 17:28:51.222 RESOURCE_ADD                ExternalCloud
   domU-12-31-39-00-CC-82(res#11)
   "sdmadm uc" was performed between 17:26:24.142 and 17:27:54.587.

   Evaluation:

   The ERROR resource is not lost. It stays in system and is moved to first service
   that sends request to RP. P3.

   Suggested fix:
   The resource should stay in GE service after restart, no matter if it is
   ASSIGNED or ERROR.

   Analysis:

   The problem is in GE adapter. Implementation problem is in
   AbstractServiceAdapter.java. During start up/reload of Service in StateHandle is
   doing merge of resources (resources spooled vs resources reported by service).
   As resource in ERROR state is not reported by service (RESOURCE_REMOVE is
   triggered). See mergeResources() method in AbstractServiceAdapter 2047-2093. The
   fix should ensure that resources are not release from service.

   For proper behavior mergeResources should add spooled resources and resources
   reported by service. It should not remove any resources??

   Test:

   Testsuite test reproducing situation from description using Ge adapter.
   1. Prepare execd_install.sh script that will cause ASIGNING > ERROR
   2. Move resource from spare pool to GE
   3. Check that resource is in GE in ERROR state
   4. sdmadm uc -c GE
   5. Check that resource stays in GE in ERROR state

   ETC: 4PD (2 fix + 2 ts test)

Change History (0)

Note: See TracTickets for help on using tickets.