[Hedeby users] Re: [GE users] SDM issues

rhierlmeier richard.hierlmeier at sun.com
Fri Jul 31 12:25:52 BST 2009


Hi Chansup,

cbyun wrote:
> Hi Richard,
>
> Your procedure for cleaning up the cloud service corruption worked well.
> Now I was able to make some more progress that the power startup process is working and be able to server SGE jobs when jobs were submitted.
>
> However, it seems that SDM cloud service turns on the power of the cloud  hosts one at a time although there are a lot more hosts needed to server the jobs.
>
> How can this behavior be modified? It is not clear for me when I looked at the cloud service configuration.
>

This is a limitation of SDM version 1.0u3. The GE service can only request
existing resources. The cloud service adds only resources to the system once the
startup script has been successfully finished. After running the shutdown script
it removes the resources from the system. This means other services can not
request virtual (shutdown hosts), they can only request existing resources.

For the next feature release we will support virtual resources. The Cloud
Adapter will have for any virtual cloud host a resource. If a virtual cloud host
is unassigned from the cloud service, the service starts the provisioning (by
running the startup scripts) of the cloud host. The cloud service will give the
resource away once the provisioning is finished. Other services will be able to
request resources even if the virtual host is stopped or in the case of the
power saving use case if the host is shutdown).

In SDM1.0u3 you can only play the caching strategy of the cloud service. Each
cloud service has a resource amount optimizer. This component is responsible for
shuting down/starting up cloud resources. You can define in the configuration of
the cloud adapter how many resources should the resource amount optimizer should
keep alive by setting the min/max attributes.

  see
http://wikis.sun.com/display/gridengine62u3/Configuring+the+Cloud+Adapter#ConfiguringtheCloudAdapter-ConfiguringtheResourceAmountOptimizer).

If you set min/max=2 cloud adapter will always keep 2 resources alive and the GE
service can request 2 resources.

>
> 07/30/2009 15:38:55|21|.cloud.CloudScriptInterface.startupCloudHosts|I|Service power: Starting up 1 additional cloud host(s)
> 07/30/2009 15:42:52|21|.cloud.CloudScriptInterface.startupCloudHosts|I|Service power: Started up 1 cloud hosts: [[hostname: blade-0-1.local, instanceId: i-blade-0-1.local, launchTime: 2009-07-30T15:42:52.000Z] ].
> 07/30/2009 15:44:15|21|.cloud.CloudScriptInterface.startupCloudHosts|I|Service power: Starting up 1 additional cloud host(s)
> 07/30/2009 15:48:17|21|.cloud.CloudScriptInterface.startupCloudHosts|I|Service power: Started up 1 cloud hosts: [[hostname: blade-0-2.local, instanceId: i-blade-0-2.local, launchTime: 2009-07-30T15:48:17.000Z] ].
>
> Another problem happened with one particular cloud host.  This particular host (blade-0-2) was in an error state.
>
> # sdmadm sr
> service id              state    type flags usage annotation
> --------------------------------------------------------------
> gesvc2  blade-0-0.local ASSIGNED host       60    Got execd ..
>         blade-0-1.local ASSIGNED host       60    Got execd ..
>         blade-0-2.local ERROR    host SA    1     Resource is used by two or more services
> power   blade-0-2.local ERROR    host       2     Service power:Could not shut down the cloud host.
>
> So I shut down the power service and reset resource in the gesvc2 service.



> However, when I start up power service after manually turning off the power of blade-0-2, it got into the following error.
>
> Any suggestions for cleaning up this issue?
>
> 07/30/2009 16:18:05|30|ectionNotificationListener.handleNotification|W|Connection to JVM cs_vm at blade-0-2_local[0] has been closed remotely
> 07/30/2009 16:18:26|31|.cloud.CloudServiceAdapterImpl.doStartService|I|Service power:Started cloud service adapter.
> 07/30/2009 16:18:26|28|.grm.util.EventListenerSupport$Worker.deliver|E|Event delivery problem: Timer already cancelled.
> 07/30/2009 16:18:26|28|.grm.util.EventListenerSupport$Worker.deliver|E|Event delivery problem: Timer already cancelled.
> 07/30/2009 16:18:53|32|vice.impl.cloud.CloudSnapshot.checkCloudState|W|Service power:The registered set of cloud host does not match the reported set! Registered mismatches [[hostname: blade-0-2.local, instanceId: i-blade-0-2.local, launchTime: 2009-07-30T15:48:17.000Z] ]. Reported mismatches []
> 07/30/2009 16:18:53|32|e.impl.cloud.CloudResourceAutoRecoverTask.run|W|Service power:Case NOT_REPORTED__REGISTERED__RESOURCE: It seems that the the cloud host blade-0-2.local was shutted down externally. Trying to remove the resource from the system!
> 07/30/2009 16:19:21|32|vice.impl.cloud.CloudSnapshot.checkCloudState|W|Service power:The registered set of cloud host does not match the reported set! Registered mismatches [[hostname: blade-0-2.local, instanceId: i-blade-0-2.local, launchTime: 2009-07-30T15:48:17.000Z] ]. Reported mismatches []
> 07/30/2009 16:19:21|32|ice.impl.cloud.CloudResourceAmountOptTask.run|I|Service power:Service is in error recovery mode. Skipping resource amount optimization cylce.
>
> 07/30/2009 16:19:49|33|ResourceManagerImpl.maintenanceRemoveResource|W|SCP power: will not remove resource blade-0-2.local, resource is in ERROR state.
> 07/30/2009 16:19:49|34|mpl.cloud.CloudResourceAdapter.prepareDestroy|E|Service power:RemoveResourceCommand for blade-0-2.local returned: SCP power: will not remove resource blade-0-2.local, resource is in ERROR state..
> 07/30/2009 16:19:49|32|pterImpl.startDestroyCloudResourceRAOperation|W|Service power:Unexpected problems trying to remove resource [hostname: blade-0-2.local, instanceId: i-blade-0-2.local, launchTime: 2009-07-30T15:48:17.000Z]  from the system. Caused by Service power:Prepare destroy failed! Caused by Service power:Error executing RemoveResourceCommand for blade-0-2.local. Caused by SCP power: will not remove resource blade-0-2.local, resource is in ERROR state.
>
> 07/30/2009 16:24:21|32|vice.impl.cloud.CloudServiceAdapterImpl$2.run|E|Service power:Failed waiting for removal of resource [hostname: blade-0-2.local, instanceId: i-blade-0-2.local, launchTime: 2009-07-30T15:48:17.000Z]  from system!
>

You ran into a bug. Reseting an ambiguous resource should be possible. However
the reset_resource command has no -s switch. The command will always fail with
the error that it can not reset an ambiguous resource.

For a manual cleanup please start/install the exec manually on the host
blade-0-2.local. The resource (owned by the GE service) will go into ASSIGNED
state. Finally call

   sdmadm remove_resource -r blade-0-2.local -s gesvc2

The command will remove the resource from the GE service. The A flag (ambiguous)
will disappear. After restarting the cloud service the resource at cloud service
will also go into assigned state.

In the current system it is also not possible to remove a resource which is in
ERROR state. However we have a work package for the next release that will solve
this problem.


Richard


>
> Thanks,
> - Chansup
>
> ------------------------------------------------------
> http://hedeby.sunsource.net/ds/viewMessage.do?dsForumId=160&dsMessageId=210306
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at hedeby.sunsource.net].


--
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Richard Hierlmeier           Phone: ++49 (0)941 3075-223
Software Engineering         Fax:   ++49 (0)941 3075-222
Sun Microsystems GmbH
Dr.-Leo-Ritter-Str. 7        mailto: richard.hierlmeier at sun.com
D-93049 Regensburg           http://www.sun.com/grid

Sitz der Gesellschaft:
Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
Vorsitzender des Aufsichtsrates: Martin Haering

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=210408

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list