[Hedeby users] Re: [GE users] SDM issues

rhierlmeier richard.hierlmeier at sun.com
Fri Jul 31 15:46:07 BST 2009


Hi Chansup,

cbyun wrote:
> Hi Richard,
>
> I appreciate your detailed explanation.
> I was able to clean up the second issue that I reported below.

Great

> I think I also understand what's going on between the cloud and ge adapter service.
>
> However, my question in the first issue was about the startupCloudHosts hook in the cloud service.
>

Ok, we had a missunderstanding.

> According to your blog, it says that:
>
> startupCloudHosts - This scripting hook is used to startup one or more hosts in the cloud. The Cloud Adapter passes the number of needed hosts as the first parameter to this script.
>
> However, based on my monitoring of sdmadm master daemon, the cloud adapter passes always 1 hosts instead of what is actually needed, which is 50 hosts. This is based on the fact that each machine has 2 slots and I have submitted 100 tasks.
>
> # sdmadm sslo
> service    slo                 quantity urgency request
> --------------------------------------------------------------------
> gesvc2     fixed_usage         0        0       SLO has no needs
>            maxPendingJobs      50       60      true
> power      PermanentRequestSLO 10       2       type = "host" & owner = "power"
> spare_pool PermanentRequestSLO 5        1       type = "host"
>
>
> If I read correctly, your script can start up the zones (in my case, turn on the power) for the needed amount of hosts. Then, in a loop, start installing SDM managed hosts, which in turn, allows to be used by GE service one host after another.
>

Correct.

> But, I observed that the startupCloudHosts hook only wake up one host at a time instead of multiple hosts.   So I am wondering how the cloud startup hook calculates the number of needed hosts.
>

Try to set the min/max parameter in the cloud adapter config to lets say 5.
Cloud Adapter will start in a first step one cloud host. The first cloud host is
treated a vpn server and it is started always as a single first host . In a
second step it will start the next 4 missing hosts in a single action (by
calling startupCloudsHost 4).

The GE adapter (if it requests 100 hosts) will get first the 1 host (vpn server)
and later the 4 additional hosts. If cloud adapter has no more active hosts
(=resources) the resource request of  GE adapter can not be fulfilled. GE
adapter must wait until the resource amount optimizer has provided the
additional hosts. When a resource is moved out of the cloud adapter the resource
amount optimizer tries to fill the gab. I will startup additional cloud hosts.
It can start more than one host in a single action.

Have a nice weekend

Richard

> Thanks,
> - Chansup
>
>
>
>> -----Original Message-----
>> From: Richard.Hierlmeier at sun.com [mailto:Richard.Hierlmeier at sun.com]
>> Sent: Friday, July 31, 2009 7:26 AM
>> To: users at hedeby.sunsource.net
>> Cc: users at gridengine.sunsource.net
>> Subject: Re: [Hedeby users] Re: [GE users] SDM issues
>>
>> Hi Chansup,
>>
>> cbyun wrote:
>>> Hi Richard,
>>>
>>> Your procedure for cleaning up the cloud service corruption worked well.
>>> Now I was able to make some more progress that the power startup process
>> is working and be able to server SGE jobs when jobs were submitted.
>>> However, it seems that SDM cloud service turns on the power of the cloud
>> hosts one at a time although there are a lot more hosts needed to server
>> the jobs.
>>> How can this behavior be modified? It is not clear for me when I looked
>> at the cloud service configuration.
>> This is a limitation of SDM version 1.0u3. The GE service can only request
>> existing resources. The cloud service adds only resources to the system
>> once the
>> startup script has been successfully finished. After running the shutdown
>> script
>> it removes the resources from the system. This means other services can
>> not
>> request virtual (shutdown hosts), they can only request existing resources.
>>
>> For the next feature release we will support virtual resources. The Cloud
>> Adapter will have for any virtual cloud host a resource. If a virtual
>> cloud host
>> is unassigned from the cloud service, the service starts the provisioning
>> (by
>> running the startup scripts) of the cloud host. The cloud service will
>> give the
>> resource away once the provisioning is finished. Other services will be
>> able to
>> request resources even if the virtual host is stopped or in the case of
>> the
>> power saving use case if the host is shutdown).
>>
>> In SDM1.0u3 you can only play the caching strategy of the cloud service.
>> Each
>> cloud service has a resource amount optimizer. This component is
>> responsible for
>> shuting down/starting up cloud resources. You can define in the
>> configuration of
>> the cloud adapter how many resources should the resource amount optimizer
>> should
>> keep alive by setting the min/max attributes.
>>
>>   see
>> http://wikis.sun.com/display/gridengine62u3/Configuring+the+Cloud+Adapter#
>> ConfiguringtheCloudAdapter-ConfiguringtheResourceAmountOptimizer).
>>
>> If you set min/max=2 cloud adapter will always keep 2 resources alive and
>> the GE
>> service can request 2 resources.
>>
>>> 07/30/2009
>> 15:38:55|21|.cloud.CloudScriptInterface.startupCloudHosts|I|Service power:
>> Starting up 1 additional cloud host(s)
>>> 07/30/2009
>> 15:42:52|21|.cloud.CloudScriptInterface.startupCloudHosts|I|Service power:
>> Started up 1 cloud hosts: [[hostname: blade-0-1.local, instanceId: i-
>> blade-0-1.local, launchTime: 2009-07-30T15:42:52.000Z] ].
>>> 07/30/2009
>> 15:44:15|21|.cloud.CloudScriptInterface.startupCloudHosts|I|Service power:
>> Starting up 1 additional cloud host(s)
>>> 07/30/2009
>> 15:48:17|21|.cloud.CloudScriptInterface.startupCloudHosts|I|Service power:
>> Started up 1 cloud hosts: [[hostname: blade-0-2.local, instanceId: i-
>> blade-0-2.local, launchTime: 2009-07-30T15:48:17.000Z] ].
>>> Another problem happened with one particular cloud host.  This
>> particular host (blade-0-2) was in an error state.
>>> # sdmadm sr
>>> service id              state    type flags usage annotation
>>> --------------------------------------------------------------
>>> gesvc2  blade-0-0.local ASSIGNED host       60    Got execd ..
>>>         blade-0-1.local ASSIGNED host       60    Got execd ..
>>>         blade-0-2.local ERROR    host SA    1     Resource is used by
>> two or more services
>>> power   blade-0-2.local ERROR    host       2     Service power:Could
>> not shut down the cloud host.
>>> So I shut down the power service and reset resource in the gesvc2
>> service.
>>
>>
>>
>>> However, when I start up power service after manually turning off the
>> power of blade-0-2, it got into the following error.
>>> Any suggestions for cleaning up this issue?
>>>
>>> 07/30/2009
>> 16:18:05|30|ectionNotificationListener.handleNotification|W|Connection to
>> JVM cs_vm at blade-0-2_local[0] has been closed remotely
>>> 07/30/2009
>> 16:18:26|31|.cloud.CloudServiceAdapterImpl.doStartService|I|Service
>> power:Started cloud service adapter.
>>> 07/30/2009
>> 16:18:26|28|.grm.util.EventListenerSupport$Worker.deliver|E|Event delivery
>> problem: Timer already cancelled.
>>> 07/30/2009
>> 16:18:26|28|.grm.util.EventListenerSupport$Worker.deliver|E|Event delivery
>> problem: Timer already cancelled.
>>> 07/30/2009
>> 16:18:53|32|vice.impl.cloud.CloudSnapshot.checkCloudState|W|Service
>> power:The registered set of cloud host does not match the reported set!
>> Registered mismatches [[hostname: blade-0-2.local, instanceId: i-blade-0-
>> 2.local, launchTime: 2009-07-30T15:48:17.000Z] ]. Reported mismatches []
>>> 07/30/2009
>> 16:18:53|32|e.impl.cloud.CloudResourceAutoRecoverTask.run|W|Service
>> power:Case NOT_REPORTED__REGISTERED__RESOURCE: It seems that the the cloud
>> host blade-0-2.local was shutted down externally. Trying to remove the
>> resource from the system!
>>> 07/30/2009
>> 16:19:21|32|vice.impl.cloud.CloudSnapshot.checkCloudState|W|Service
>> power:The registered set of cloud host does not match the reported set!
>> Registered mismatches [[hostname: blade-0-2.local, instanceId: i-blade-0-
>> 2.local, launchTime: 2009-07-30T15:48:17.000Z] ]. Reported mismatches []
>>> 07/30/2009
>> 16:19:21|32|ice.impl.cloud.CloudResourceAmountOptTask.run|I|Service
>> power:Service is in error recovery mode. Skipping resource amount
>> optimization cylce.
>>> 07/30/2009
>> 16:19:49|33|ResourceManagerImpl.maintenanceRemoveResource|W|SCP power:
>> will not remove resource blade-0-2.local, resource is in ERROR state.
>>> 07/30/2009
>> 16:19:49|34|mpl.cloud.CloudResourceAdapter.prepareDestroy|E|Service
>> power:RemoveResourceCommand for blade-0-2.local returned: SCP power: will
>> not remove resource blade-0-2.local, resource is in ERROR state..
>>> 07/30/2009
>> 16:19:49|32|pterImpl.startDestroyCloudResourceRAOperation|W|Service
>> power:Unexpected problems trying to remove resource [hostname: blade-0-
>> 2.local, instanceId: i-blade-0-2.local, launchTime: 2009-07-
>> 30T15:48:17.000Z]  from the system. Caused by Service power:Prepare
>> destroy failed! Caused by Service power:Error executing
>> RemoveResourceCommand for blade-0-2.local. Caused by SCP power: will not
>> remove resource blade-0-2.local, resource is in ERROR state.
>>> 07/30/2009
>> 16:24:21|32|vice.impl.cloud.CloudServiceAdapterImpl$2.run|E|Service
>> power:Failed waiting for removal of resource [hostname: blade-0-2.local,
>> instanceId: i-blade-0-2.local, launchTime: 2009-07-30T15:48:17.000Z]  from
>> system!
>> You ran into a bug. Reseting an ambiguous resource should be possible.
>> However
>> the reset_resource command has no -s switch. The command will always fail
>> with
>> the error that it can not reset an ambiguous resource.
>>
>> For a manual cleanup please start/install the exec manually on the host
>> blade-0-2.local. The resource (owned by the GE service) will go into
>> ASSIGNED
>> state. Finally call
>>
>>    sdmadm remove_resource -r blade-0-2.local -s gesvc2
>>
>> The command will remove the resource from the GE service. The A flag
>> (ambiguous)
>> will disappear. After restarting the cloud service the resource at cloud
>> service
>> will also go into assigned state.
>>
>> In the current system it is also not possible to remove a resource which
>> is in
>> ERROR state. However we have a work package for the next release that will
>> solve
>> this problem.
>>
>>
>> Richard
>>
>>
>>> Thanks,
>>> - Chansup
>>>
>>> ------------------------------------------------------
>>>
>> http://hedeby.sunsource.net/ds/viewMessage.do?dsForumId=160&dsMessageId=21
>> 0306
>>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at hedeby.sunsource.net].
>>
>>
>> --
>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>> -
>> Richard Hierlmeier           Phone: ++49 (0)941 3075-223
>> Software Engineering         Fax:   ++49 (0)941 3075-222
>> Sun Microsystems GmbH
>> Dr.-Leo-Ritter-Str. 7             mailto: richard.hierlmeier at sun.com
>> D-93049 Regensburg           http://www.sun.com/grid
>>
>> Sitz der Gesellschaft:
>> Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
>> Amtsgericht Muenchen: HRB 161028
>> Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
>> Vorsitzender des Aufsichtsrates: Martin Haering
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId
>> =210408
>>
>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=210429
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].


--
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Richard Hierlmeier           Phone: ++49 (0)941 3075-223
Software Engineering         Fax:   ++49 (0)941 3075-222
Sun Microsystems GmbH
Dr.-Leo-Ritter-Str. 7        mailto: richard.hierlmeier at sun.com
D-93049 Regensburg           http://www.sun.com/grid

Sitz der Gesellschaft:
Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
Vorsitzender des Aufsichtsrates: Martin Haering

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=210445

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list