[GE users] SDM issues

torsten torsten.blix at sun.com
Thu Jul 23 09:25:50 BST 2009


Hi Chansup,

we are assuming that you trying out the power saving use case, is that 
correct? For this use case the VPN configuration is irrelevant.

On 07/22/09 19:46, cbyun wrote:
> Issue 2: How to resolve the VPN server issue?
> 
> I used the following config for cloud service:
> 
>     <cloud_adapter:vpn xsi:type="cloud_adapter:OpenVPNConfig"
>                        vpnBinDir="/usr/sbin"
>                        vpnConfigDir="/tmp"
>                        vpnRequired="true"/>
> 
> Which is taken from the Richard's blog: http://blogs.sun.com/rhierlmeier/entry/using_sdm_cloud_adapter_to
> 
> What does the following actually mean?
> Service power:Problem: VPN server is corrupted! Registered but server-less resources:

The cloud adapter is designed for usage with a VPN. The power saving 
feature in a local network does not need the VPN, however the vpn 
configuration parameters are mandatory anyway. The logic connected with 
handling the VPN server is also used for the power saving use case. This 
logic says that if the "VPN server" is not reachable, all other "cloud" 
hosts are not reachable. "VPN server" here simply stands for a specific 
host (the first discovered resource), no VPN is actually running.

The cloud adapter regularly observes the state of all resources. For 
this purpose it calls the showCloudHostsScript. This script returns a 
list of all resources and their state.

> 07/22/2009 13:37:22|48|vice.impl.cloud.CloudSnapshot.checkCloudState|W|Service power:Problem: VPN server is corrupted! Registered but server-less resources: [[hostname: blade-0-0.local, instanceId: i-blade-0-0, launchTime: 2009-07-21T09:56:03.000Z] , [hostname: blade-0-1.local, instanceId: i-blade-0-1, launchTime: 2009-07-21T09:56:03.000Z] , [hostname: blade-0-2.local, instanceId: i-blade-0-2, launchTime: 2009-07-21T09:56:03.000Z] , [hostname: blade-0-3.local, instanceId: i-blade-0-3, launchTime: 2009-07-21T09:56:03.000Z] , [hostname: blade-0-4.local, instanceId: i-blade-0-4, launchTime: 2009-07-21T09:56:03.000Z] , [hostname: blade-0-5.local, instanceId: i-blade-0-5, launchTime: 2009-07-21T09:56:03.000Z] , [hostname: blade-0-6.local, instanceId: i-blade-0-6, launchTime: 2009-07-21T09:56:03.000Z] , [hostname: blade-0-7.local, instanceId: i-blade-0-7, launchTime: 2009-07-21T09:56:03.000Z] , [hostname: blade-0-8.local, instanceId: i-blade-0-8, launchTime: 2009-07-21T09:56:03.0
00Z] , [hostname: blade-0-9.local, instanceId: i-blade-0-9, launchTime: 2009-07-21T09:56:03.000Z] ].
> 07/22/2009 13:37:22|48|ice.impl.cloud.CloudResourceAmountOptTask.run|I|Service power:Service is in error recovery mode. Skipping resource amount optimization cylce.

The above error message indicates that the showCloudHostsScript somehow 
reported an error for the "VPN server". The cloud adapter therefore 
assumes that all cloud resources cannot be used (error recovery mode).

> # sdmadm sr
> service    id              state    type flags usage annotation
> ---------------------------------------------------------------------------------------------
> gesvc2     blade-0-1.local ASSIGNED host SA    1     Got execd update event
>            blade-0-2.local ASSIGNED host SA    1     Got execd update event
>            blade-0-3.local ASSIGNED host SA    1     Got execd update event
>            blade-0-4.local ASSIGNED host SA    1     Got execd update event
>            blade-0-5.local ASSIGNED host SA    1     Got execd update event
>            blade-0-6.local ASSIGNED host SA    1     Got execd update event
>            blade-0-8.local ASSIGNED host SA    1     Got execd update event
>            blade-0-9.local ASSIGNED host SA    1     Got execd update event
> power      blade-0-0.local ASSIGNED host A     2     Resource is used by two or more services
>            blade-0-1.local ASSIGNED host A     2     Resource is used by two or more services
>            blade-0-2.local ASSIGNED host A     2     Resource is used by two or more services
>            blade-0-3.local ASSIGNED host A     2     Resource is used by two or more services
>            blade-0-4.local ASSIGNED host A     2     Resource is used by two or more services
>            blade-0-5.local ASSIGNED host A     2     Resource is used by two or more services
>            blade-0-6.local ASSIGNED host A     2     Resource is used by two or more services
>            blade-0-7.local ASSIGNED host A     2     Resource is used by two or more services
>            blade-0-8.local ASSIGNED host A     2     Resource is used by two or more services
>            blade-0-9.local ASSIGNED host A     2     Resource is used by two or more services
> spare_pool blade-0-0.local ASSIGNED host A     1     Resource is used by two or more services
>            blade-0-7.local ASSIGNED host A     1     Resource is used by two or more services

This output indicates some further problems. All resources in GE service 
are static. This indicates (unless you set this by hand) that the SDM 
executor component is not running on these resources. GE adapter will 
not be able to uninstall these hosts, they are treated as static (flag 
S). It seems that the GE adapter has autodiscovered them. Does the execd 
start automatically at boot time? For the power saving use case they 
must not.

However, we are assuming that the SDM executor on the power saving hosts 
IS installed in such a way that the SDM executor is started 
automatically at boot time or that there exists a post-startup-hook 
script that starts the SDM executor. The sample implementation of the 
power saving scripts does not describe this fact.

In your case, both the power saving service and the GE service 
autodiscovered the same resources. If a resource is owned by more than 
one service the resource provider treats the resource as ambiguous (flag A).

The recommended way to set up the power saving service is to not have 
any resources in the system when installing the power saving service.

To solve your problem do the following steps:

1. Please ensure that on all hosts the managed host installations has 
been executed (best with the -autostart flag set).

2. Start up the SDM executors on all managed hosts. This should remove 
the static flag from the resources in the GE service.

3. Remove the (still ambiguous) resources from the GE service with
     sdmadm rr -r <resource_name> -s gesvc2
    This should remove the ambiguous status from the remaining resource 
of this name.

4. Implement a post-startup-hook for the power saving scripts:
    a) if the executor is started at boot time, the post-startup-hook 
should block until the executor is up and running.
    b) otherwise the post-startup-hook must startup the SDM executor and 
wait until it is up and running.

Cheers,
Richard and Torsten

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=209141

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list