Opened 50 years ago

Last modified 9 years ago

#877 new defect

IZ522: resource provider ignores administrative removal of resource

Reported by: easymf Owned by:
Priority: normal Milestone:
Component: hedeby Version: 1.0u1
Severity: Keywords: Sun resource_provider
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=522]

        Issue #:      522                     Platform:     Sun         Reporter: easymf (easymf)
       Component:     hedeby                     OS:        All
     Subcomponent:    resource_provider       Version:      1.0u1          CC:    None defined
        Status:       NEW                     Priority:     P3
      Resolution:                            Issue type:    DEFECT
                                          Target milestone: 1.0u5next
      Assigned to:    easymf (easymf)
      QA Contact:     adoerr
          URL:
       * Summary:     resource provider ignores administrative removal of resource
   Status whiteboard:
      Attachments:


     Issue 522 blocks:
   Votes for issue 522:     Vote for this issue


   Opened: Sun Jul 20 06:12:00 -0700 2008 
------------------------


   if resources are removed in a batch, resource provider may ignore removing of
   resource - actually, it removes the resource from current owner and then
   immediately assigns it to service, instead of removing the resource from system.

   it seems like a problem with Order processing
               ------- Additional comments from easymf Sun Jul 20 06:14:54 -0700 2008 -------
   it's mine
               ------- Additional comments from easymf Sun Jul 20 06:22:10 -0700 2008 -------
   bad priority .. it has easy workaround.
               ------- Additional comments from adoerr Wed Aug 20 07:40:36 -0700 2008 -------
   New target milestone.
               ------- Additional comments from easymf Thu Aug 28 04:46:41 -0700 2008 -------
   Description:

   If a system is under a heavy load, it may happen that resource provider will
   ignore administrative removal of a resource. The problem can be reproduced using
   following steps (to avoid ge adapter interference, use only 2 spare pools):

   1. have a system with at least 1500 resources and have them all assigned to ONE
   spare pool
   2. remove all resources at once from the spare pool (using a command "sdmadm rr
   -r `sdmadm sr -o res_ids`")
   3. wait until the command run is finished
   4. check that all resources were removed from system - "sdmadm sr" output should
   show no resource
   5. if you see any resource, you hit the bug

   Evaluation:

   The problem may occur if the system is under heavy load, which should not happen
   too often - system is designed to distribute the load (thanks to immediate
   processing of requests), thus the high load may occur only sporadically. There
   is also an easy workaround, so the priority of the issue is P3.

   Suggested Fix / Work Around:

   To workaround the problem, either remove resources one-after-another or if the
   resource is not removed in the first step, remove it again.

   Fix has to ensure, that resource removed using "sdmadm rr -r RES" will not
   remain in the system.

   Analysis:

   The problem is caused by lack of strict atomicity when RP is synchronizing the
   remote service using SCP (distributed transaction). Exactly, these conditions
   lead to problems described in the issue:

   1. remote service (S) sends pair of resource events (E1,E2) that hold
   information that resource (R) is removed (remove resource, resource removed)
   2. E1,E2 are not yet received by RP (SCP)
   3. RP performs synchronization with S
   4. as a result of step 4, resource R is removed from cached
   5. just after step 4, RP (SCP) will receive resource events E1,E2 and as a
   result of it it thinks, that resource was added and then removed from service
   again (processing of RemoveResourceEvent).

   The problem actually affects whole process of service synchronization - if RP
   misses (e.g. due to network problems) following set of events "ADD, ADDED,
   REMOVE,REMOVED", resource is lost for RP - full service synchronization is not
   able to find out that resource info was lost, only that some events were lost.

   To properly fix the root cause, we need to introduce distributed transactions
   (topic for technical post mortem).

   The fix the problem described in the issue (ignoring of administrative resource
   removal) should be enough to change handling of RemoveResourceEvent in a case,
   when SCP does not cache the subject resource (currently, SCP will forward ADD,
   ADDED, REMOVE events like in resource auto-discovery case). Instead of
   forwarding "ADD, ADDED, REMOVE" events, SCP should do a full refresh.

   How to test:

   JUnit testing - adjust the SCP Junit tests (they cover current functionality).
   TestSuite test - possibly only in performance cluster, as the issue is hard to
   spot when less than 1000 resources is used. The TS test could look like this:

   1. move all resoruces to spare pool
   2. remove all resources from system
   3. check that all resource were removed
   4. reset TS
   5. repeat the test xxx times

   Manually - add several thousand resources to system (using the printypres.sh
   tool), remove all resources from system. watch the output.

   ETC:

   3 PD for basic fix
   ? PD for introducing distributed transactions

   ATC:

   2.5 PD
               ------- Additional comments from easymf Thu Aug 28 05:42:45 -0700 2008 -------
   forgot to raise priority and change the version
               ------- Additional comments from easymf Fri Aug 29 02:54:43 -0700 2008 -------
   priority was still p4 ..
               ------- Additional comments from adoerr Fri Nov 21 02:25:27 -0700 2008 -------
   Target milestone update.

               ------- Additional comments from rhierlmeier Tue Apr 7 00:07:04 -0700 2009 -------
   It is possible that this problem has been fixed with issue 595 because the full
   refresh mechaism of SCP has been redesigned.


               ------- Additional comments from rhierlmeier Tue Apr 7 00:07:21 -0700 2009 -------
   It is possible that this problem has been fixed with issue 595 because the full
   refresh mechanism of SCP has been redesigned.


               ------- Additional comments from easymf Wed Jul 15 07:26:54 -0700 2009 -------
   comment:

   TS test could leverage the simhost feature. By generating the list of resources
   (using a 'util/printypres.sh' from sdm sources) the test should add no less than
   2000 resources to system and then remove them by calling:

   'sdmadm rr -r `sdmadm sr`'

   The test could have more runlevels (short, long) - in short, the test is
   performed only once, in long runlevel it is performed multiple times (3-5).

               ------- Additional comments from torsten Fri Nov 27 00:40:10 -0700 2009 -------
   changed milestone to 1.0u5next
               ------- Additional comments from torsten Fri Nov 27 03:32:36 -0700 2009 -------
   This should be tested in TS with some changes to cloud_simhost test, possibly
   moving the resource over to a GE service and at the end of the test removing all
   resources instead of simply removing the service.

Change History (0)

Note: See TracTickets for help on using tickets.