[Hedeby users] Re: [GE users] SDM issues

cbyun cbyun at ll.mit.edu
Thu Jul 30 21:28:11 BST 2009


Hi Richard,

Your procedure for cleaning up the cloud service corruption worked well.
Now I was able to make some more progress that the power startup process is working and be able to server SGE jobs when jobs were submitted.

However, it seems that SDM cloud service turns on the power of the cloud  hosts one at a time although there are a lot more hosts needed to server the jobs.   

How can this behavior be modified? It is not clear for me when I looked at the cloud service configuration.


07/30/2009 15:38:55|21|.cloud.CloudScriptInterface.startupCloudHosts|I|Service power: Starting up 1 additional cloud host(s)
07/30/2009 15:42:52|21|.cloud.CloudScriptInterface.startupCloudHosts|I|Service power: Started up 1 cloud hosts: [[hostname: blade-0-1.local, instanceId: i-blade-0-1.local, launchTime: 2009-07-30T15:42:52.000Z] ].
07/30/2009 15:44:15|21|.cloud.CloudScriptInterface.startupCloudHosts|I|Service power: Starting up 1 additional cloud host(s)
07/30/2009 15:48:17|21|.cloud.CloudScriptInterface.startupCloudHosts|I|Service power: Started up 1 cloud hosts: [[hostname: blade-0-2.local, instanceId: i-blade-0-2.local, launchTime: 2009-07-30T15:48:17.000Z] ].

Another problem happened with one particular cloud host.  This particular host (blade-0-2) was in an error state. 

# sdmadm sr
service id              state    type flags usage annotation                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
--------------------------------------------------------------
gesvc2  blade-0-0.local ASSIGNED host       60    Got execd ..                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
        blade-0-1.local ASSIGNED host       60    Got execd ..                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
        blade-0-2.local ERROR    host SA    1     Resource is used by two or more services                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
power   blade-0-2.local ERROR    host       2     Service power:Could not shut down the cloud host.

So I shut down the power service and reset resource in the gesvc2 service.
However, when I start up power service after manually turning off the power of blade-0-2, it got into the following error. 

Any suggestions for cleaning up this issue?

07/30/2009 16:18:05|30|ectionNotificationListener.handleNotification|W|Connection to JVM cs_vm at blade-0-2_local[0] has been closed remotely
07/30/2009 16:18:26|31|.cloud.CloudServiceAdapterImpl.doStartService|I|Service power:Started cloud service adapter.
07/30/2009 16:18:26|28|.grm.util.EventListenerSupport$Worker.deliver|E|Event delivery problem: Timer already cancelled.
07/30/2009 16:18:26|28|.grm.util.EventListenerSupport$Worker.deliver|E|Event delivery problem: Timer already cancelled.
07/30/2009 16:18:53|32|vice.impl.cloud.CloudSnapshot.checkCloudState|W|Service power:The registered set of cloud host does not match the reported set! Registered mismatches [[hostname: blade-0-2.local, instanceId: i-blade-0-2.local, launchTime: 2009-07-30T15:48:17.000Z] ]. Reported mismatches []
07/30/2009 16:18:53|32|e.impl.cloud.CloudResourceAutoRecoverTask.run|W|Service power:Case NOT_REPORTED__REGISTERED__RESOURCE: It seems that the the cloud host blade-0-2.local was shutted down externally. Trying to remove the resource from the system!
07/30/2009 16:19:21|32|vice.impl.cloud.CloudSnapshot.checkCloudState|W|Service power:The registered set of cloud host does not match the reported set! Registered mismatches [[hostname: blade-0-2.local, instanceId: i-blade-0-2.local, launchTime: 2009-07-30T15:48:17.000Z] ]. Reported mismatches []
07/30/2009 16:19:21|32|ice.impl.cloud.CloudResourceAmountOptTask.run|I|Service power:Service is in error recovery mode. Skipping resource amount optimization cylce.

07/30/2009 16:19:49|33|ResourceManagerImpl.maintenanceRemoveResource|W|SCP power: will not remove resource blade-0-2.local, resource is in ERROR state.
07/30/2009 16:19:49|34|mpl.cloud.CloudResourceAdapter.prepareDestroy|E|Service power:RemoveResourceCommand for blade-0-2.local returned: SCP power: will not remove resource blade-0-2.local, resource is in ERROR state..
07/30/2009 16:19:49|32|pterImpl.startDestroyCloudResourceRAOperation|W|Service power:Unexpected problems trying to remove resource [hostname: blade-0-2.local, instanceId: i-blade-0-2.local, launchTime: 2009-07-30T15:48:17.000Z]  from the system. Caused by Service power:Prepare destroy failed! Caused by Service power:Error executing RemoveResourceCommand for blade-0-2.local. Caused by SCP power: will not remove resource blade-0-2.local, resource is in ERROR state.

07/30/2009 16:24:21|32|vice.impl.cloud.CloudServiceAdapterImpl$2.run|E|Service power:Failed waiting for removal of resource [hostname: blade-0-2.local, instanceId: i-blade-0-2.local, launchTime: 2009-07-30T15:48:17.000Z]  from system!


Thanks,
- Chansup

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=210307

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list