[GE users] Re: [GE users] hedeby communication problem: sdmadm on master cannot find itself]

reppep pepper at cbio.mskcc.org
Tue Mar 16 13:55:36 GMT 2010


Richard,

	Sorry, the AWS SGE compute nodes, not really 'clients'. I have 6446 open in our firewall, so will put CS there and JMX on 6447 (not accessible from compute nodes due to firewall blocking).

Thanks again!

Chris

rhierlmeier wrote:
> Hi Chris,
> On 03/15/10 16:11, reppep wrote:
>> Richard,
>>
>> 	Thank you -- that's very helpful. Do the JMX and CS ports both need to be accessible from the clients?
> 
> What client?
> 
> The sdmadm command and the JVMs on the managed hosts need access to the CS port 
> of hedeby master. The CS port is the only static port of the SDM system. It is 
> called CS for Configuration Service. CS knows all components of the SDM system.
> 
> On SDM side only the Grid Engine Service Adapter uses the qmaster JMX port. On 
> Grid Engine side it is used by SGE inspect.
> 
> Richard
> 
>> Chris
>>
>> rhierlmeier wrote:
>>> Hi Chris
>>>
>>> it seems that you mixed up your ports. I bet that 6446 is the port of the JMX 
>>> server of your Grid Engine qmaster.
>>>
>>> SDM needs it's own port. Please retry the SDM master installation with an unique 
>>> port (option -cs_port).
>>>
>>> You have to throw away the existing system. The fastest way is deleting the 
>>> directories
>>>
>>>      /etc/sdm/bootstrap/awssge
>>> and /var/sdm/awssge
>>>
>>> Richard
>>>
>>>
>>> On 03/15/10 15:13, reppep wrote:
>>>> Richard,
>>>>
>>>> 	awssge is the hedeby master host, and it can resolve itself.
>>>>
>>>> Thanks,
>>>>
>>>> Chris
>>>>
>>>>> [root at awssge ~]# time ~hedeby/bin/sdmadm -s awssge sj
>>>>> Error: Cannot connect to JVM cs_vm at awssge_cbio_mskcc_org: Failed to retrieve RMI
>>>>>
>>>>> real	0m0.553s
>>>>> user	0m0.802s
>>>>> sys	0m0.067s
>>>>> [root at awssge ~]# time ~hedeby/bin/sdmadm -s awssge suj
>>>>> jvm         host                  result message                       
>>>>> -----------------------------------------------------------------------
>>>>> cs_vm       awssge.cbio.mskcc.org ERROR  JVM: cs_vm died during startup.
>>>>> executor_vm awssge.cbio.mskcc.org ERROR  Timeout. Pid file: /var/spool/sdm/awssg
>>>>> rp_vm       awssge.cbio.mskcc.org ERROR  Timeout. Pid file: /var/spool/sdm/awssg
>>>>> Error: Command has generated error.
>>>>>
>>>>> real	2m4.736s
>>>>> user	0m3.755s
>>>>> sys	0m1.073s
>>>>> [root at awssge ~]# ps -ef | grep java
>>>>> hedeby    3927     1  0 Mar08 ?        00:00:01 /usr/java/jre1.6.0_18/bin/java -Djava.security.manager=java.rmi.RMISecurityManager -Djava.security.policy==/var/spool/sdm/awssge/security/java.policy -Djava.security.auth.login.config=/var/spool/sdm/awssge/security/jaas.config -Dcom.sun.grid.grm.bootstrap.systemname=a
>>>>> root     23610 22595  0 10:10 pts/0    00:00:00 grep java
>>>>> [root at awssge ~]# !cat
>>>>> cat /var/spool/sdm/awssge/log/cs_vm-0.log 
>>>>> 03/08/2010 11:47:38|10|m.bootstrap.JVMImpl$PrivilegedStartAction.run|I|startup jvm (pid=3927)
>>>>> 03/08/2010 11:47:39|11|.grm.bootstrap.JVMImpl$ComponentLifecycle.run|W|Error in lifecycle of component cs_vm: Cannot start component cs_vm: Can not create MBeanServer at port 6,446: Port already in use: 6446; nested exception is: 
>>>>>                                                                       |	java.net.BindException: Address already in use
>>>>> 03/08/2010 11:47:39|12|rid.grm.bootstrap.JVMImpl$ShutdownHandler.run|I|Got shutdown event
>>>>> [root at awssge ~]# ping -c2 awssge
>>>>> PING awssge.cbio.mskcc.org (140.163.254.41) 56(84) bytes of data.
>>>>> 64 bytes from awssge.cbio.mskcc.org (140.163.254.41): icmp_seq=1 ttl=64 time=0.037 ms
>>>>> 64 bytes from awssge.cbio.mskcc.org (140.163.254.41): icmp_seq=2 ttl=64 time=0.011 ms
>>>>>
>>>>> --- awssge.cbio.mskcc.org ping statistics ---
>>>>> 2 packets transmitted, 2 received, 0% packet loss, time 999ms
>>>>> rtt min/avg/max/mdev = 0.011/0.024/0.037/0.013 ms
>>>> rhierlmeier wrote:
>>>>> Hi Chris,
>>>>>
>>>>> What was the output of the sdmadm suj command? How long did it run?
>>>>>
>>>>> Can you send me the output of the log file of the JVM. It is stored in 
>>>>> <local_spool_dir>/log/cs_vm-0.log
>>>>>
>>>>> In your scenarion the <local_spool_dir> is /var/sdm/awssge
>>>>>
>>>>> Please check also that the hostname awssge can be correctly resolved on the 
>>>>> hedeby master host.
>>>>>
>>>>>
>>>>>
>>>>> Richard
>>>>>
>>>>>
>>>>>> I am having trouble setting up a test Hedeby installation. sdmadm cannot communicate with the Java processes on the local system.
>>>>>>
>>>>>>     I installed SGE with JMX & cluster name 'awssge', then followed <http://wiki.gridengine.info/wiki/index.php/SGE-Hedeby-And-Amazon-EC2#HowTo:_Setup_the_Grid_Engine_6.2_Master> with 'hedeby1' as the SDM_SYSTEM name. "sdmadm suj" did *start* the Java processes, but was unable to report on their health.
>>>>>>
>>>>>>     I tried again, following <http://wikis.sun.com/display/gridengine62u3/SDM+Installation+Overview> with 'awssge' as the SDM_MASTER name to match the cluster name, but have the same problems.
>>>>>>
>>>>>>
>>>>>>     My SDM installation command was:
>>>>>>
>>>>>>> ~hedeby/bin/sdmadm -s awssge -p system install_master_host -ca_admin_mail '****' -ca_org "Memorial Sloan-Kettering Cancer Center" -ca_org_unit "Computational Biology" -ca_country US -au hedeby -sge_root /common/sge/ -ca_location "New York City" -cs_port 6446 -ca_state "New York"
>>>>>>     Its output (aside from the license text) was:
>>>>>>
>>>>>>> Do you agree with the terms of the license ? (Y/N)y
>>>>>>> The License has been accepted by the user.
>>>>>>> Install master host command is using default local spool dir: /var/spool/sdm/awssge
>>>>>>> A configuration for system "awssge" has been added.
>>>>>>     The processes started by 'sdmadm suj' are:
>>>>>>
>>>>>>> [root at awssge ~]# ps -ef|grep java
>>>>>>> root      3863  3855 30 11:47 pts/1    00:00:01 /usr/java/default/bin/java -Djava.library.path=/common/sdm/lib/lx-amd64 -Djava.endorsed.dirs=/common/sdm/lib/ext/endorsed -Dcom.sun.grid.grm.management.connectionTimeout=20 -Djava.security.manager=java.rmi.RMISecurityManager -Djava.security.policy=/common/sdm/util/sdmadm.policy -jar /common/sdm/lib/sdm-starter.jar com.sun.grid.grm.cli.SdmAdm suj
>>>>>>> hedeby    3927  3926 30 11:47 ?        00:00:01 /usr/java/jre1.6.0_18/bin/java -Djava.security.manager=java.rmi.RMISecurityManager -Djava.security.policy==/var/spool/sdm/awssge/security/java.policy -Djava.security.auth.login.config=/var/spool/sdm/awssge/security/jaas.config -Dcom.sun.grid.grm.bootstrap.systemname=awssge -Dcom.sun.grid.grm.bootstrap.jvmname=cs_vm -Dcom.sun.grid.grm.bootstrap.localspool=/var/spool/sdm/awssge -Dcom.sun.grid.grm.bootstrap.dist=/common/sdm -Dcom.sun.grid.grm.bootstrap.csInfo=awssge.cbio.mskcc.org:6446 -Dcom.sun.grid.grm.bootstrap.preferencesType=SYSTEM -Djava.util.logging.manager=com.sun.grid.grm.util.GrmLogManager -Djava.library.path=/common/sdm/lib/lx-amd64::/common/sdm/lib/lx-amd64: -Dcom.sun.grid.grm.bootstrap.isCS=true -cp /common/sdm/lib/sdm-security.jar:/common/sdm/lib/sdm-starter.jar:/common/sdm/lib/sdm-cloud-adapter.jar:/common/sdm/lib/sdm-upgrade.jar:/common/sdm/lib/sdm-common.jar:/common/sdm/lib/sdm-ge-adapter.jar:/common/sdm/li
b
> /
>> e
>>> x
>>>> t
>>>>> /
>>>>>> jaxb-impl.jar:/common/sdm/lib/ext/activation.jar:/common/sdm/lib/ext/jsr173_1.0_api.jar -Djava.rmi.server.codebase=file:/common/sdm/lib/sdm-security.jar file:/common/sdm/lib/sdm-starter.jar file:/common/sdm/lib/sdm-cloud-adapter.jar file:/common/sdm/lib/sdm-upgrade.jar file:/common/sdm/lib/sdm-common.jar file:/common/sdm/lib/sdm-ge-adapter.jar file:/common/sdm/lib/ext/jaxb-impl.jar file:/common/sdm/lib/ext/activation.jar file:/common/sdm/lib/ext/jsr173_1.0_api.jar -Djava.endorsed.dirs=/common/sdm/lib/ext/endorsed -Djava.rmi.server.hostname=awssge.cbio.mskcc.org -Xmx128M -Dcom.sun.grid.grm.management.connectionTimeout=60 com.sun.grid.grm.bootstrap.JVMImpl
>>>>>>> root      3993  2439  0 11:47 pts/0    00:00:00 grep java
>>>>>>     But sdmadm cannot see them:
>>>>>>
>>>>>>> [root at awssge ~]# ~hedeby/bin/sdmadm -s awssge sj
>>>>>>> Error: Cannot connect to JVM cs_vm at awssge_cbio_mskcc_org: Failed to retrieve RMIServer stub: javax.naming.NameNotFoundException: awssge
>>>>>>> [root at awssge ~]# ~hedeby/bin/sdmadm -s awssge sc
>>>>>>> Error: Cannot connect to JVM cs_vm at awssge_cbio_mskcc_org: Failed to retrieve RMIServer stub: javax.naming.NameNotFoundException: awssge
>>>>>>> [root at awssge ~]# ~hedeby/bin/sdmadm -s awssge sbc
>>>>>>> system type   host                  port properties
>>>>>>> ---------------------------------------------------
>>>>>>> awssge SYSTEM awssge.cbio.mskcc.org 6446           [root at awssge ~]# 
>>>>>>     What am I doing wrong?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Chris Pepper 
>>
> 
> 


-- 
Chris Pepper:                <http://cbio.mskcc.org/>
                             <http://www.extrapepperoni.com/>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=248961

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list