[GE users] Re: [GE users] hedeby communication problem: sdmadm on master cannot find itself]

reppep pepper at cbio.mskcc.org
Mon Mar 15 15:11:56 GMT 2010


Richard,

	Thank you -- that's very helpful. Do the JMX and CS ports both need to be accessible from the clients?

Chris

rhierlmeier wrote:
> Hi Chris
> 
> it seems that you mixed up your ports. I bet that 6446 is the port of the JMX 
> server of your Grid Engine qmaster.
> 
> SDM needs it's own port. Please retry the SDM master installation with an unique 
> port (option -cs_port).
> 
> You have to throw away the existing system. The fastest way is deleting the 
> directories
> 
>      /etc/sdm/bootstrap/awssge
> and /var/sdm/awssge
> 
> Richard
> 
> 
> On 03/15/10 15:13, reppep wrote:
>> Richard,
>>
>> 	awssge is the hedeby master host, and it can resolve itself.
>>
>> Thanks,
>>
>> Chris
>>
>>> [root at awssge ~]# time ~hedeby/bin/sdmadm -s awssge sj
>>> Error: Cannot connect to JVM cs_vm at awssge_cbio_mskcc_org: Failed to retrieve RMI
>>>
>>> real	0m0.553s
>>> user	0m0.802s
>>> sys	0m0.067s
>>> [root at awssge ~]# time ~hedeby/bin/sdmadm -s awssge suj
>>> jvm         host                  result message                       
>>> -----------------------------------------------------------------------
>>> cs_vm       awssge.cbio.mskcc.org ERROR  JVM: cs_vm died during startup.
>>> executor_vm awssge.cbio.mskcc.org ERROR  Timeout. Pid file: /var/spool/sdm/awssg
>>> rp_vm       awssge.cbio.mskcc.org ERROR  Timeout. Pid file: /var/spool/sdm/awssg
>>> Error: Command has generated error.
>>>
>>> real	2m4.736s
>>> user	0m3.755s
>>> sys	0m1.073s
>>> [root at awssge ~]# ps -ef | grep java
>>> hedeby    3927     1  0 Mar08 ?        00:00:01 /usr/java/jre1.6.0_18/bin/java -Djava.security.manager=java.rmi.RMISecurityManager -Djava.security.policy==/var/spool/sdm/awssge/security/java.policy -Djava.security.auth.login.config=/var/spool/sdm/awssge/security/jaas.config -Dcom.sun.grid.grm.bootstrap.systemname=a
>>> root     23610 22595  0 10:10 pts/0    00:00:00 grep java
>>> [root at awssge ~]# !cat
>>> cat /var/spool/sdm/awssge/log/cs_vm-0.log 
>>> 03/08/2010 11:47:38|10|m.bootstrap.JVMImpl$PrivilegedStartAction.run|I|startup jvm (pid=3927)
>>> 03/08/2010 11:47:39|11|.grm.bootstrap.JVMImpl$ComponentLifecycle.run|W|Error in lifecycle of component cs_vm: Cannot start component cs_vm: Can not create MBeanServer at port 6,446: Port already in use: 6446; nested exception is: 
>>>                                                                       |	java.net.BindException: Address already in use
>>> 03/08/2010 11:47:39|12|rid.grm.bootstrap.JVMImpl$ShutdownHandler.run|I|Got shutdown event
>>> [root at awssge ~]# ping -c2 awssge
>>> PING awssge.cbio.mskcc.org (140.163.254.41) 56(84) bytes of data.
>>> 64 bytes from awssge.cbio.mskcc.org (140.163.254.41): icmp_seq=1 ttl=64 time=0.037 ms
>>> 64 bytes from awssge.cbio.mskcc.org (140.163.254.41): icmp_seq=2 ttl=64 time=0.011 ms
>>>
>>> --- awssge.cbio.mskcc.org ping statistics ---
>>> 2 packets transmitted, 2 received, 0% packet loss, time 999ms
>>> rtt min/avg/max/mdev = 0.011/0.024/0.037/0.013 ms
>>
>> rhierlmeier wrote:
>>> Hi Chris,
>>>
>>> What was the output of the sdmadm suj command? How long did it run?
>>>
>>> Can you send me the output of the log file of the JVM. It is stored in 
>>> <local_spool_dir>/log/cs_vm-0.log
>>>
>>> In your scenarion the <local_spool_dir> is /var/sdm/awssge
>>>
>>> Please check also that the hostname awssge can be correctly resolved on the 
>>> hedeby master host.
>>>
>>>
>>>
>>> Richard
>>>
>>>
>>>> I am having trouble setting up a test Hedeby installation. sdmadm cannot communicate with the Java processes on the local system.
>>>>
>>>>     I installed SGE with JMX & cluster name 'awssge', then followed <http://wiki.gridengine.info/wiki/index.php/SGE-Hedeby-And-Amazon-EC2#HowTo:_Setup_the_Grid_Engine_6.2_Master> with 'hedeby1' as the SDM_SYSTEM name. "sdmadm suj" did *start* the Java processes, but was unable to report on their health.
>>>>
>>>>     I tried again, following <http://wikis.sun.com/display/gridengine62u3/SDM+Installation+Overview> with 'awssge' as the SDM_MASTER name to match the cluster name, but have the same problems.
>>>>
>>>>
>>>>     My SDM installation command was:
>>>>
>>>>> ~hedeby/bin/sdmadm -s awssge -p system install_master_host -ca_admin_mail '****' -ca_org "Memorial Sloan-Kettering Cancer Center" -ca_org_unit "Computational Biology" -ca_country US -au hedeby -sge_root /common/sge/ -ca_location "New York City" -cs_port 6446 -ca_state "New York"
>>>>     Its output (aside from the license text) was:
>>>>
>>>>> Do you agree with the terms of the license ? (Y/N)y
>>>>> The License has been accepted by the user.
>>>>> Install master host command is using default local spool dir: /var/spool/sdm/awssge
>>>>> A configuration for system "awssge" has been added.
>>>>     The processes started by 'sdmadm suj' are:
>>>>
>>>>> [root at awssge ~]# ps -ef|grep java
>>>>> root      3863  3855 30 11:47 pts/1    00:00:01 /usr/java/default/bin/java -Djava.library.path=/common/sdm/lib/lx-amd64 -Djava.endorsed.dirs=/common/sdm/lib/ext/endorsed -Dcom.sun.grid.grm.management.connectionTimeout=20 -Djava.security.manager=java.rmi.RMISecurityManager -Djava.security.policy=/common/sdm/util/sdmadm.policy -jar /common/sdm/lib/sdm-starter.jar com.sun.grid.grm.cli.SdmAdm suj
>>>>> hedeby    3927  3926 30 11:47 ?        00:00:01 /usr/java/jre1.6.0_18/bin/java -Djava.security.manager=java.rmi.RMISecurityManager -Djava.security.policy==/var/spool/sdm/awssge/security/java.policy -Djava.security.auth.login.config=/var/spool/sdm/awssge/security/jaas.config -Dcom.sun.grid.grm.bootstrap.systemname=awssge -Dcom.sun.grid.grm.bootstrap.jvmname=cs_vm -Dcom.sun.grid.grm.bootstrap.localspool=/var/spool/sdm/awssge -Dcom.sun.grid.grm.bootstrap.dist=/common/sdm -Dcom.sun.grid.grm.bootstrap.csInfo=awssge.cbio.mskcc.org:6446 -Dcom.sun.grid.grm.bootstrap.preferencesType=SYSTEM -Djava.util.logging.manager=com.sun.grid.grm.util.GrmLogManager -Djava.library.path=/common/sdm/lib/lx-amd64::/common/sdm/lib/lx-amd64: -Dcom.sun.grid.grm.bootstrap.isCS=true -cp /common/sdm/lib/sdm-security.jar:/common/sdm/lib/sdm-starter.jar:/common/sdm/lib/sdm-cloud-adapter.jar:/common/sdm/lib/sdm-upgrade.jar:/common/sdm/lib/sdm-common.jar:/common/sdm/lib/sdm-ge-adapter.jar:/common/sdm/lib/
e
> x
>> t
>>> /
>>>> jaxb-impl.jar:/common/sdm/lib/ext/activation.jar:/common/sdm/lib/ext/jsr173_1.0_api.jar -Djava.rmi.server.codebase=file:/common/sdm/lib/sdm-security.jar file:/common/sdm/lib/sdm-starter.jar file:/common/sdm/lib/sdm-cloud-adapter.jar file:/common/sdm/lib/sdm-upgrade.jar file:/common/sdm/lib/sdm-common.jar file:/common/sdm/lib/sdm-ge-adapter.jar file:/common/sdm/lib/ext/jaxb-impl.jar file:/common/sdm/lib/ext/activation.jar file:/common/sdm/lib/ext/jsr173_1.0_api.jar -Djava.endorsed.dirs=/common/sdm/lib/ext/endorsed -Djava.rmi.server.hostname=awssge.cbio.mskcc.org -Xmx128M -Dcom.sun.grid.grm.management.connectionTimeout=60 com.sun.grid.grm.bootstrap.JVMImpl
>>>>> root      3993  2439  0 11:47 pts/0    00:00:00 grep java
>>>>     But sdmadm cannot see them:
>>>>
>>>>> [root at awssge ~]# ~hedeby/bin/sdmadm -s awssge sj
>>>>> Error: Cannot connect to JVM cs_vm at awssge_cbio_mskcc_org: Failed to retrieve RMIServer stub: javax.naming.NameNotFoundException: awssge
>>>>> [root at awssge ~]# ~hedeby/bin/sdmadm -s awssge sc
>>>>> Error: Cannot connect to JVM cs_vm at awssge_cbio_mskcc_org: Failed to retrieve RMIServer stub: javax.naming.NameNotFoundException: awssge
>>>>> [root at awssge ~]# ~hedeby/bin/sdmadm -s awssge sbc
>>>>> system type   host                  port properties
>>>>> ---------------------------------------------------
>>>>> awssge SYSTEM awssge.cbio.mskcc.org 6446           [root at awssge ~]# 
>>>>     What am I doing wrong?
>>>>
>>>> Thanks,
>>>>
>>>> Chris Pepper 
>>
> 
> 


-- 
Chris Pepper:                <http://cbio.mskcc.org/>
                             <http://www.extrapepperoni.com/>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=248754

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list