Opened 50 years ago
Last modified 9 years ago
#877 new defect
IZ522: resource provider ignores administrative removal of resource
Reported by: | easymf | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | hedeby | Version: | 1.0u1 |
Severity: | Keywords: | Sun resource_provider | |
Cc: |
Description
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=522]
Issue #: 522 Platform: Sun Reporter: easymf (easymf) Component: hedeby OS: All Subcomponent: resource_provider Version: 1.0u1 CC: None defined Status: NEW Priority: P3 Resolution: Issue type: DEFECT Target milestone: 1.0u5next Assigned to: easymf (easymf) QA Contact: adoerr URL: * Summary: resource provider ignores administrative removal of resource Status whiteboard: Attachments: Issue 522 blocks: Votes for issue 522: Vote for this issue Opened: Sun Jul 20 06:12:00 -0700 2008 ------------------------ if resources are removed in a batch, resource provider may ignore removing of resource - actually, it removes the resource from current owner and then immediately assigns it to service, instead of removing the resource from system. it seems like a problem with Order processing ------- Additional comments from easymf Sun Jul 20 06:14:54 -0700 2008 ------- it's mine ------- Additional comments from easymf Sun Jul 20 06:22:10 -0700 2008 ------- bad priority .. it has easy workaround. ------- Additional comments from adoerr Wed Aug 20 07:40:36 -0700 2008 ------- New target milestone. ------- Additional comments from easymf Thu Aug 28 04:46:41 -0700 2008 ------- Description: If a system is under a heavy load, it may happen that resource provider will ignore administrative removal of a resource. The problem can be reproduced using following steps (to avoid ge adapter interference, use only 2 spare pools): 1. have a system with at least 1500 resources and have them all assigned to ONE spare pool 2. remove all resources at once from the spare pool (using a command "sdmadm rr -r `sdmadm sr -o res_ids`") 3. wait until the command run is finished 4. check that all resources were removed from system - "sdmadm sr" output should show no resource 5. if you see any resource, you hit the bug Evaluation: The problem may occur if the system is under heavy load, which should not happen too often - system is designed to distribute the load (thanks to immediate processing of requests), thus the high load may occur only sporadically. There is also an easy workaround, so the priority of the issue is P3. Suggested Fix / Work Around: To workaround the problem, either remove resources one-after-another or if the resource is not removed in the first step, remove it again. Fix has to ensure, that resource removed using "sdmadm rr -r RES" will not remain in the system. Analysis: The problem is caused by lack of strict atomicity when RP is synchronizing the remote service using SCP (distributed transaction). Exactly, these conditions lead to problems described in the issue: 1. remote service (S) sends pair of resource events (E1,E2) that hold information that resource (R) is removed (remove resource, resource removed) 2. E1,E2 are not yet received by RP (SCP) 3. RP performs synchronization with S 4. as a result of step 4, resource R is removed from cached 5. just after step 4, RP (SCP) will receive resource events E1,E2 and as a result of it it thinks, that resource was added and then removed from service again (processing of RemoveResourceEvent). The problem actually affects whole process of service synchronization - if RP misses (e.g. due to network problems) following set of events "ADD, ADDED, REMOVE,REMOVED", resource is lost for RP - full service synchronization is not able to find out that resource info was lost, only that some events were lost. To properly fix the root cause, we need to introduce distributed transactions (topic for technical post mortem). The fix the problem described in the issue (ignoring of administrative resource removal) should be enough to change handling of RemoveResourceEvent in a case, when SCP does not cache the subject resource (currently, SCP will forward ADD, ADDED, REMOVE events like in resource auto-discovery case). Instead of forwarding "ADD, ADDED, REMOVE" events, SCP should do a full refresh. How to test: JUnit testing - adjust the SCP Junit tests (they cover current functionality). TestSuite test - possibly only in performance cluster, as the issue is hard to spot when less than 1000 resources is used. The TS test could look like this: 1. move all resoruces to spare pool 2. remove all resources from system 3. check that all resource were removed 4. reset TS 5. repeat the test xxx times Manually - add several thousand resources to system (using the printypres.sh tool), remove all resources from system. watch the output. ETC: 3 PD for basic fix ? PD for introducing distributed transactions ATC: 2.5 PD ------- Additional comments from easymf Thu Aug 28 05:42:45 -0700 2008 ------- forgot to raise priority and change the version ------- Additional comments from easymf Fri Aug 29 02:54:43 -0700 2008 ------- priority was still p4 .. ------- Additional comments from adoerr Fri Nov 21 02:25:27 -0700 2008 ------- Target milestone update. ------- Additional comments from rhierlmeier Tue Apr 7 00:07:04 -0700 2009 ------- It is possible that this problem has been fixed with issue 595 because the full refresh mechaism of SCP has been redesigned. ------- Additional comments from rhierlmeier Tue Apr 7 00:07:21 -0700 2009 ------- It is possible that this problem has been fixed with issue 595 because the full refresh mechanism of SCP has been redesigned. ------- Additional comments from easymf Wed Jul 15 07:26:54 -0700 2009 ------- comment: TS test could leverage the simhost feature. By generating the list of resources (using a 'util/printypres.sh' from sdm sources) the test should add no less than 2000 resources to system and then remove them by calling: 'sdmadm rr -r `sdmadm sr`' The test could have more runlevels (short, long) - in short, the test is performed only once, in long runlevel it is performed multiple times (3-5). ------- Additional comments from torsten Fri Nov 27 00:40:10 -0700 2009 ------- changed milestone to 1.0u5next ------- Additional comments from torsten Fri Nov 27 03:32:36 -0700 2009 ------- This should be tested in TS with some changes to cloud_simhost test, possibly moving the resource over to a GE service and at the end of the test removing all resources instead of simply removing the service.
Note: See
TracTickets for help on using
tickets.