[Hedeby users] Re: [GE users] SDM issues

cbyun cbyun at ll.mit.edu
Fri Jul 31 14:26:53 BST 2009


Hi Richard,

I appreciate your detailed explanation.
I was able to clean up the second issue that I reported below.
I think I also understand what's going on between the cloud and ge adapter service.

However, my question in the first issue was about the startupCloudHosts hook in the cloud service.

According to your blog, it says that:

startupCloudHosts - This scripting hook is used to startup one or more hosts in the cloud. The Cloud Adapter passes the number of needed hosts as the first parameter to this script.

However, based on my monitoring of sdmadm master daemon, the cloud adapter passes always 1 hosts instead of what is actually needed, which is 50 hosts. This is based on the fact that each machine has 2 slots and I have submitted 100 tasks.

# sdmadm sslo
service    slo                 quantity urgency request
--------------------------------------------------------------------
gesvc2     fixed_usage         0        0       SLO has no needs
           maxPendingJobs      50       60      true
power      PermanentRequestSLO 10       2       type = "host" & owner = "power"
spare_pool PermanentRequestSLO 5        1       type = "host"


If I read correctly, your script can start up the zones (in my case, turn on the power) for the needed amount of hosts. Then, in a loop, start installing SDM managed hosts, which in turn, allows to be used by GE service one host after another.

But, I observed that the startupCloudHosts hook only wake up one host at a time instead of multiple hosts.   So I am wondering how the cloud startup hook calculates the number of needed hosts.

Thanks,
- Chansup



> -----Original Message-----
> From: Richard.Hierlmeier at sun.com [mailto:Richard.Hierlmeier at sun.com]
> Sent: Friday, July 31, 2009 7:26 AM
> To: users at hedeby.sunsource.net
> Cc: users at gridengine.sunsource.net
> Subject: Re: [Hedeby users] Re: [GE users] SDM issues
>
> Hi Chansup,
>
> cbyun wrote:
> > Hi Richard,
> >
> > Your procedure for cleaning up the cloud service corruption worked well.
> > Now I was able to make some more progress that the power startup process
> is working and be able to server SGE jobs when jobs were submitted.
> >
> > However, it seems that SDM cloud service turns on the power of the cloud
> hosts one at a time although there are a lot more hosts needed to server
> the jobs.
> >
> > How can this behavior be modified? It is not clear for me when I looked
> at the cloud service configuration.
> >
>
> This is a limitation of SDM version 1.0u3. The GE service can only request
> existing resources. The cloud service adds only resources to the system
> once the
> startup script has been successfully finished. After running the shutdown
> script
> it removes the resources from the system. This means other services can
> not
> request virtual (shutdown hosts), they can only request existing resources.
>
> For the next feature release we will support virtual resources. The Cloud
> Adapter will have for any virtual cloud host a resource. If a virtual
> cloud host
> is unassigned from the cloud service, the service starts the provisioning
> (by
> running the startup scripts) of the cloud host. The cloud service will
> give the
> resource away once the provisioning is finished. Other services will be
> able to
> request resources even if the virtual host is stopped or in the case of
> the
> power saving use case if the host is shutdown).
>
> In SDM1.0u3 you can only play the caching strategy of the cloud service.
> Each
> cloud service has a resource amount optimizer. This component is
> responsible for
> shuting down/starting up cloud resources. You can define in the
> configuration of
> the cloud adapter how many resources should the resource amount optimizer
> should
> keep alive by setting the min/max attributes.
>
>   see
> http://wikis.sun.com/display/gridengine62u3/Configuring+the+Cloud+Adapter#
> ConfiguringtheCloudAdapter-ConfiguringtheResourceAmountOptimizer).
>
> If you set min/max=2 cloud adapter will always keep 2 resources alive and
> the GE
> service can request 2 resources.
>
> >
> > 07/30/2009
> 15:38:55|21|.cloud.CloudScriptInterface.startupCloudHosts|I|Service power:
> Starting up 1 additional cloud host(s)
> > 07/30/2009
> 15:42:52|21|.cloud.CloudScriptInterface.startupCloudHosts|I|Service power:
> Started up 1 cloud hosts: [[hostname: blade-0-1.local, instanceId: i-
> blade-0-1.local, launchTime: 2009-07-30T15:42:52.000Z] ].
> > 07/30/2009
> 15:44:15|21|.cloud.CloudScriptInterface.startupCloudHosts|I|Service power:
> Starting up 1 additional cloud host(s)
> > 07/30/2009
> 15:48:17|21|.cloud.CloudScriptInterface.startupCloudHosts|I|Service power:
> Started up 1 cloud hosts: [[hostname: blade-0-2.local, instanceId: i-
> blade-0-2.local, launchTime: 2009-07-30T15:48:17.000Z] ].
> >
> > Another problem happened with one particular cloud host.  This
> particular host (blade-0-2) was in an error state.
> >
> > # sdmadm sr
> > service id              state    type flags usage annotation
> > --------------------------------------------------------------
> > gesvc2  blade-0-0.local ASSIGNED host       60    Got execd ..
> >         blade-0-1.local ASSIGNED host       60    Got execd ..
> >         blade-0-2.local ERROR    host SA    1     Resource is used by
> two or more services
> > power   blade-0-2.local ERROR    host       2     Service power:Could
> not shut down the cloud host.
> >
> > So I shut down the power service and reset resource in the gesvc2
> service.
>
>
>
> > However, when I start up power service after manually turning off the
> power of blade-0-2, it got into the following error.
> >
> > Any suggestions for cleaning up this issue?
> >
> > 07/30/2009
> 16:18:05|30|ectionNotificationListener.handleNotification|W|Connection to
> JVM cs_vm at blade-0-2_local[0] has been closed remotely
> > 07/30/2009
> 16:18:26|31|.cloud.CloudServiceAdapterImpl.doStartService|I|Service
> power:Started cloud service adapter.
> > 07/30/2009
> 16:18:26|28|.grm.util.EventListenerSupport$Worker.deliver|E|Event delivery
> problem: Timer already cancelled.
> > 07/30/2009
> 16:18:26|28|.grm.util.EventListenerSupport$Worker.deliver|E|Event delivery
> problem: Timer already cancelled.
> > 07/30/2009
> 16:18:53|32|vice.impl.cloud.CloudSnapshot.checkCloudState|W|Service
> power:The registered set of cloud host does not match the reported set!
> Registered mismatches [[hostname: blade-0-2.local, instanceId: i-blade-0-
> 2.local, launchTime: 2009-07-30T15:48:17.000Z] ]. Reported mismatches []
> > 07/30/2009
> 16:18:53|32|e.impl.cloud.CloudResourceAutoRecoverTask.run|W|Service
> power:Case NOT_REPORTED__REGISTERED__RESOURCE: It seems that the the cloud
> host blade-0-2.local was shutted down externally. Trying to remove the
> resource from the system!
> > 07/30/2009
> 16:19:21|32|vice.impl.cloud.CloudSnapshot.checkCloudState|W|Service
> power:The registered set of cloud host does not match the reported set!
> Registered mismatches [[hostname: blade-0-2.local, instanceId: i-blade-0-
> 2.local, launchTime: 2009-07-30T15:48:17.000Z] ]. Reported mismatches []
> > 07/30/2009
> 16:19:21|32|ice.impl.cloud.CloudResourceAmountOptTask.run|I|Service
> power:Service is in error recovery mode. Skipping resource amount
> optimization cylce.
> >
> > 07/30/2009
> 16:19:49|33|ResourceManagerImpl.maintenanceRemoveResource|W|SCP power:
> will not remove resource blade-0-2.local, resource is in ERROR state.
> > 07/30/2009
> 16:19:49|34|mpl.cloud.CloudResourceAdapter.prepareDestroy|E|Service
> power:RemoveResourceCommand for blade-0-2.local returned: SCP power: will
> not remove resource blade-0-2.local, resource is in ERROR state..
> > 07/30/2009
> 16:19:49|32|pterImpl.startDestroyCloudResourceRAOperation|W|Service
> power:Unexpected problems trying to remove resource [hostname: blade-0-
> 2.local, instanceId: i-blade-0-2.local, launchTime: 2009-07-
> 30T15:48:17.000Z]  from the system. Caused by Service power:Prepare
> destroy failed! Caused by Service power:Error executing
> RemoveResourceCommand for blade-0-2.local. Caused by SCP power: will not
> remove resource blade-0-2.local, resource is in ERROR state.
> >
> > 07/30/2009
> 16:24:21|32|vice.impl.cloud.CloudServiceAdapterImpl$2.run|E|Service
> power:Failed waiting for removal of resource [hostname: blade-0-2.local,
> instanceId: i-blade-0-2.local, launchTime: 2009-07-30T15:48:17.000Z]  from
> system!
> >
>
> You ran into a bug. Reseting an ambiguous resource should be possible.
> However
> the reset_resource command has no -s switch. The command will always fail
> with
> the error that it can not reset an ambiguous resource.
>
> For a manual cleanup please start/install the exec manually on the host
> blade-0-2.local. The resource (owned by the GE service) will go into
> ASSIGNED
> state. Finally call
>
>    sdmadm remove_resource -r blade-0-2.local -s gesvc2
>
> The command will remove the resource from the GE service. The A flag
> (ambiguous)
> will disappear. After restarting the cloud service the resource at cloud
> service
> will also go into assigned state.
>
> In the current system it is also not possible to remove a resource which
> is in
> ERROR state. However we have a work package for the next release that will
> solve
> this problem.
>
>
> Richard
>
>
> >
> > Thanks,
> > - Chansup
> >
> > ------------------------------------------------------
> >
> http://hedeby.sunsource.net/ds/viewMessage.do?dsForumId=160&dsMessageId=21
> 0306
> >
> > To unsubscribe from this discussion, e-mail: [users-
> unsubscribe at hedeby.sunsource.net].
>
>
> --
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> -
> Richard Hierlmeier           Phone: ++49 (0)941 3075-223
> Software Engineering         Fax:   ++49 (0)941 3075-222
> Sun Microsystems GmbH
> Dr.-Leo-Ritter-Str. 7      mailto: richard.hierlmeier at sun.com
> D-93049 Regensburg           http://www.sun.com/grid
>
> Sitz der Gesellschaft:
> Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
> Amtsgericht Muenchen: HRB 161028
> Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
> Vorsitzender des Aufsichtsrates: Martin Haering
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId
> =210408
>
> To unsubscribe from this discussion, e-mail: [users-
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=210429

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list