[GE users] Re: [GE users] hedeby communication problem: sdmadm on master cannot find itself]

reppep pepper at cbio.mskcc.org
Mon Mar 15 14:13:45 GMT 2010


Richard,

	awssge is the hedeby master host, and it can resolve itself.

Thanks,

Chris

> [root at awssge ~]# time ~hedeby/bin/sdmadm -s awssge sj
> Error: Cannot connect to JVM cs_vm at awssge_cbio_mskcc_org: Failed to retrieve RMI
> 
> real	0m0.553s
> user	0m0.802s
> sys	0m0.067s
> [root at awssge ~]# time ~hedeby/bin/sdmadm -s awssge suj
> jvm         host                  result message                       
> -----------------------------------------------------------------------
> cs_vm       awssge.cbio.mskcc.org ERROR  JVM: cs_vm died during startup.
> executor_vm awssge.cbio.mskcc.org ERROR  Timeout. Pid file: /var/spool/sdm/awssg
> rp_vm       awssge.cbio.mskcc.org ERROR  Timeout. Pid file: /var/spool/sdm/awssg
> Error: Command has generated error.
> 
> real	2m4.736s
> user	0m3.755s
> sys	0m1.073s
> [root at awssge ~]# ps -ef | grep java
> hedeby    3927     1  0 Mar08 ?        00:00:01 /usr/java/jre1.6.0_18/bin/java -Djava.security.manager=java.rmi.RMISecurityManager -Djava.security.policy==/var/spool/sdm/awssge/security/java.policy -Djava.security.auth.login.config=/var/spool/sdm/awssge/security/jaas.config -Dcom.sun.grid.grm.bootstrap.systemname=a
> root     23610 22595  0 10:10 pts/0    00:00:00 grep java
> [root at awssge ~]# !cat
> cat /var/spool/sdm/awssge/log/cs_vm-0.log 
> 03/08/2010 11:47:38|10|m.bootstrap.JVMImpl$PrivilegedStartAction.run|I|startup jvm (pid=3927)
> 03/08/2010 11:47:39|11|.grm.bootstrap.JVMImpl$ComponentLifecycle.run|W|Error in lifecycle of component cs_vm: Cannot start component cs_vm: Can not create MBeanServer at port 6,446: Port already in use: 6446; nested exception is: 
>                                                                       |	java.net.BindException: Address already in use
> 03/08/2010 11:47:39|12|rid.grm.bootstrap.JVMImpl$ShutdownHandler.run|I|Got shutdown event
> [root at awssge ~]# ping -c2 awssge
> PING awssge.cbio.mskcc.org (140.163.254.41) 56(84) bytes of data.
> 64 bytes from awssge.cbio.mskcc.org (140.163.254.41): icmp_seq=1 ttl=64 time=0.037 ms
> 64 bytes from awssge.cbio.mskcc.org (140.163.254.41): icmp_seq=2 ttl=64 time=0.011 ms
> 
> --- awssge.cbio.mskcc.org ping statistics ---
> 2 packets transmitted, 2 received, 0% packet loss, time 999ms
> rtt min/avg/max/mdev = 0.011/0.024/0.037/0.013 ms


rhierlmeier wrote:
> Hi Chris,
> 
> What was the output of the sdmadm suj command? How long did it run?
> 
> Can you send me the output of the log file of the JVM. It is stored in 
> <local_spool_dir>/log/cs_vm-0.log
> 
> In your scenarion the <local_spool_dir> is /var/sdm/awssge
> 
> Please check also that the hostname awssge can be correctly resolved on the 
> hedeby master host.
> 
> 
> 
> Richard
> 
> 
>> I am having trouble setting up a test Hedeby installation. sdmadm cannot communicate with the Java processes on the local system.
>>
>>     I installed SGE with JMX & cluster name 'awssge', then followed <http://wiki.gridengine.info/wiki/index.php/SGE-Hedeby-And-Amazon-EC2#HowTo:_Setup_the_Grid_Engine_6.2_Master> with 'hedeby1' as the SDM_SYSTEM name. "sdmadm suj" did *start* the Java processes, but was unable to report on their health.
>>
>>     I tried again, following <http://wikis.sun.com/display/gridengine62u3/SDM+Installation+Overview> with 'awssge' as the SDM_MASTER name to match the cluster name, but have the same problems.
>>
>>
>>     My SDM installation command was:
>>
>>> ~hedeby/bin/sdmadm -s awssge -p system install_master_host -ca_admin_mail '****' -ca_org "Memorial Sloan-Kettering Cancer Center" -ca_org_unit "Computational Biology" -ca_country US -au hedeby -sge_root /common/sge/ -ca_location "New York City" -cs_port 6446 -ca_state "New York"
>>     Its output (aside from the license text) was:
>>
>>> Do you agree with the terms of the license ? (Y/N)y
>>> The License has been accepted by the user.
>>> Install master host command is using default local spool dir: /var/spool/sdm/awssge
>>> A configuration for system "awssge" has been added.
>>
>>     The processes started by 'sdmadm suj' are:
>>
>>> [root at awssge ~]# ps -ef|grep java
>>> root      3863  3855 30 11:47 pts/1    00:00:01 /usr/java/default/bin/java -Djava.library.path=/common/sdm/lib/lx-amd64 -Djava.endorsed.dirs=/common/sdm/lib/ext/endorsed -Dcom.sun.grid.grm.management.connectionTimeout=20 -Djava.security.manager=java.rmi.RMISecurityManager -Djava.security.policy=/common/sdm/util/sdmadm.policy -jar /common/sdm/lib/sdm-starter.jar com.sun.grid.grm.cli.SdmAdm suj
>>> hedeby    3927  3926 30 11:47 ?        00:00:01 /usr/java/jre1.6.0_18/bin/java -Djava.security.manager=java.rmi.RMISecurityManager -Djava.security.policy==/var/spool/sdm/awssge/security/java.policy -Djava.security.auth.login.config=/var/spool/sdm/awssge/security/jaas.config -Dcom.sun.grid.grm.bootstrap.systemname=awssge -Dcom.sun.grid.grm.bootstrap.jvmname=cs_vm -Dcom.sun.grid.grm.bootstrap.localspool=/var/spool/sdm/awssge -Dcom.sun.grid.grm.bootstrap.dist=/common/sdm -Dcom.sun.grid.grm.bootstrap.csInfo=awssge.cbio.mskcc.org:6446 -Dcom.sun.grid.grm.bootstrap.preferencesType=SYSTEM -Djava.util.logging.manager=com.sun.grid.grm.util.GrmLogManager -Djava.library.path=/common/sdm/lib/lx-amd64::/common/sdm/lib/lx-amd64: -Dcom.sun.grid.grm.bootstrap.isCS=true -cp /common/sdm/lib/sdm-security.jar:/common/sdm/lib/sdm-starter.jar:/common/sdm/lib/sdm-cloud-adapter.jar:/common/sdm/lib/sdm-upgrade.jar:/common/sdm/lib/sdm-common.jar:/common/sdm/lib/sdm-ge-adapter.jar:/common/sdm/lib/ex
t
> /
>> jaxb-impl.jar:/common/sdm/lib/ext/activation.jar:/common/sdm/lib/ext/jsr173_1.0_api.jar -Djava.rmi.server.codebase=file:/common/sdm/lib/sdm-security.jar file:/common/sdm/lib/sdm-starter.jar file:/common/sdm/lib/sdm-cloud-adapter.jar file:/common/sdm/lib/sdm-upgrade.jar file:/common/sdm/lib/sdm-common.jar file:/common/sdm/lib/sdm-ge-adapter.jar file:/common/sdm/lib/ext/jaxb-impl.jar file:/common/sdm/lib/ext/activation.jar file:/common/sdm/lib/ext/jsr173_1.0_api.jar -Djava.endorsed.dirs=/common/sdm/lib/ext/endorsed -Djava.rmi.server.hostname=awssge.cbio.mskcc.org -Xmx128M -Dcom.sun.grid.grm.management.connectionTimeout=60 com.sun.grid.grm.bootstrap.JVMImpl
>>> root      3993  2439  0 11:47 pts/0    00:00:00 grep java
>>     But sdmadm cannot see them:
>>
>>> [root at awssge ~]# ~hedeby/bin/sdmadm -s awssge sj
>>> Error: Cannot connect to JVM cs_vm at awssge_cbio_mskcc_org: Failed to retrieve RMIServer stub: javax.naming.NameNotFoundException: awssge
>>> [root at awssge ~]# ~hedeby/bin/sdmadm -s awssge sc
>>> Error: Cannot connect to JVM cs_vm at awssge_cbio_mskcc_org: Failed to retrieve RMIServer stub: javax.naming.NameNotFoundException: awssge
>>> [root at awssge ~]# ~hedeby/bin/sdmadm -s awssge sbc
>>> system type   host                  port properties
>>> ---------------------------------------------------
>>> awssge SYSTEM awssge.cbio.mskcc.org 6446           [root at awssge ~]# 
>>     What am I doing wrong?
>>
>> Thanks,
>>
>> Chris Pepper 
> 
> 


-- 
Chris Pepper:                <http://cbio.mskcc.org/>
                             <http://www.extrapepperoni.com/>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=248742

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list