[GE users] gamess info

Reuti reuti at staff.uni-marburg.de
Thu Aug 24 16:05:23 BST 2006


Hi,

but it looks much better now at least.

Am 23.08.2006 um 21:17 schrieb lukacm at pdx.edu:

> It is missing hostname,
>
> the qrsh is deplyed. i can catch now the log (finally ) of the  
> ddikick process:
>
>  ddikick.x: finished with -ddi argument.
>  ddikick.x: finished with -dditree argument
>  ddikick.x: finished with -ppn argument
>  ddikick.x: finished with -scr argument.
>
>  Distributed Data Interface kickoff program.
>  Initiating 4 compute processes on 4 nodes to run the following  
> command:
>  /home/visible/apps/gamess/gamess.01.x exam20
>
>  ddikick.x: kickoff host = compute-0-5.local
>  Master Kickoff Host compute-0-5.local is accepting connections on  
> port 33170.
>  Awaiting connections from 8 GDDI processes.
>  ddikick.x : Thread created on compute-0-5.local:33170 to accept  
> connections.
>  ddikick.x: execvp command line: rsh compute-0-12.local
> /home/visible/apps/gamess/ddikick.x /home/visible/apps/gamess/ 
> gamess.01.x
> exam20 -ddi 4 4 compute-0-5.local:cpus=1 compute-0-4.local:cpus=1
> compute-0-12.lo
> cal:cpus=1 compute-0-9.local:cpus=1 -dditree compute-0-5.local  
> 33170 2 4 rsh
> -scr /tmp/3840.1.gamess.q
>  ddikick.x: execvp command line: rsh compute-0-4.local
> /home/visible/apps/gamess/ddikick.x /home/visible/apps/gamess/ 
> gamess.01.x
> exam20 -ddi 4 4 compute-0-5.local:cpus=1 compute-0-4.local:cpus=1
> compute-0-12.loc
> al:cpus=1 compute-0-9.local:cpus=1 -dditree compute-0-5.local 33170  
> 1 2 rsh -scr
> /tmp/3840.1.gamess.q
> Attemping to create DDI process 0 on local node 0.
> DDI Process 0 Command Line: /home/visible/apps/gamess/gamess.01.x  
> exam20 -ddi
> compute-0-5.local 33170 0 0 4 4 compute-0-5.local:cpus=1
> compute-0-4.local:cpus=1 compute-0-12.local:cpus=1  
> compute-0-9.local:cpus=1
> Attemping to create DDI process 4 on local node 0.
> DDI Process 4 Command Line: /home/visible/apps/gamess/gamess.01.x  
> exam20 -ddi
> compute-0-5.local 33170 0 4 4 4 compute-0-5.local:cpus=1
> compute-0-4.local:cpus=1 compute-0-12.local:cpus=1  
> compute-0-9.local:cpus=1
> /opt/gridengine/bin/lx26-amd64/qrsh -V -inherit compute-0-12.local
> /home/visible/apps/gamess/ddikick.x /home/visible/apps/gamess/ 
> gamess.01.x
> exam20 -ddi 4 4 compute-0-5.local:cpus=1 compute-0-4.local:cpus=1 comp
> ute-0-12.local:cpus=1 compute-0-9.local:cpus=1 -dditree  
> compute-0-5.local 33170
> 2 4 rsh -scr /tmp/3840.1.gamess.q
> /opt/gridengine/bin/lx26-amd64/qrsh -V -inherit compute-0-4.local
> /home/visible/apps/gamess/ddikick.x /home/visible/apps/gamess/ 
> gamess.01.x
> exam20 -ddi 4 4 compute-0-5.local:cpus=1 compute-0-4.local:cpus=1  
> compu
> te-0-12.local:cpus=1 compute-0-9.local:cpus=1 -dditree  
> compute-0-5.local 33170 1
> 2 rsh -scr /tmp/3840.1.gamess.q
>  ddikick.x: 4 bytes received; $lu remaining.
>  ddikick.x: 4 bytes received; $lu remaining.
>  ddikick.x : 0 checked in; receiving via port 33177 (Remaining=7).
>  ddikick.x: 4 bytes received; $lu remaining.
>  ddikick.x: 4 bytes received; $lu remaining.
>  ddikick.x : 4 checked in; receiving via port 33179 (Remaining=6).
>  ddikick.x: Sending kill signal to DDI processes.
>  ddikick.x: Sending kill signal to DDI process 0.
>  ddikick.x: Sending kill signal to DDI process 4.
>  DDI Process 0: terminated upon request.
>  DDI Process 4: terminated upon request.
>  ddikick.x: Execution terminated due to error(s).
>
> and it the error log i have the same as before:
>
>
> error: commlib error: access denied (client IP resolved to host  
> name "". This is
> not identical to clients host name "")

Okay, now we have to investigate this. The hostnames are also all  
known on all machines via /etc/hosts or e.g. NIS? Are the SGE tools  
gethostbyname, gethostbyaddr are working as expected and providing  
reasonable results on all nodes, and for all nodes on each one?

> error: executing task of job 3840 failed: failed sending task to
> execd at compute-0-12.local: can't find connection
> error: commlib error: access denied (client IP resolved to host  
> name "". This is
> not identical to clients host name "")
> error: executing task of job 3840 failed: failed sending task to
> execd at compute-0-4.local: can't find connection
>  ddikick.x: Timed out while waiting for DDI processes to check in.
>  ddikick.x: Fatal error detected.
>  The error is most likely to be in the application, so check for
>  input errors, disk space, memory needs, application bugs, etc.
>  ddikick.x will now clean up all processes, and exit...
> connect to address 10.5.255.249: Connection refused
> connect to address 10.5.255.249: Connection refused
> trying normal rsh (/usr/bin/rsh)
> compute-0-5.local: Connection refused
> connect to address 10.5.255.250: Connection refused
> connect to address 10.5.255.250: Connection refused
> trying normal rsh (/usr/bin/rsh)
> compute-0-4: Connection refused
> connect to address 10.5.255.242: Connection refused
> connect to address 10.5.255.242: Connection refused
> trying normal rsh (/usr/bin/rsh)
> compute-0-12: Connection refused
> connect to address 10.5.255.245: Connection refused
> connect to address 10.5.255.245: Connection refused
> trying normal rsh (/usr/bin/rsh)
> compute-0-9: Connection refused

I wonder, why here again the hostname has no .local, and is using the  
full path to /usr/bin/rsh. I agree, that this will not work.

>
> However, there is a mix between the rsh and qrsh . In the ddikick  
> log there are
> both :
>
>  ddikick.x: execvp command line: rsh compute-0-12.local
> /home/visible/apps/gamess/ddikick.x /home/visible/apps/gamess/ 
> gamess.01.x
> exam20 -ddi 4 4 compute-0-5.local:cpus=1 compute-0-4.local:cpus=1
> compute-0-12.lo
> cal:cpus=1 compute-0-9.local:cpus=1 -dditree compute-0-5.local  
> 33170 2 4 rsh
> -scr /tmp/3840.1.gamess.q
>
> this is not working for 100%

This rsh will be caught by the rsh-wrapper. This is, as it should be.

> and later there is
>
> /opt/gridengine/bin/lx26-amd64/qrsh -V -inherit compute-0-12.local
> /home/visible/apps/gamess/ddikick.x /home/visible/apps/gamess/ 
> gamess.01.x
> exam20 -ddi 4 4 compute-0-5.local:cpus=1 compute-0-4.local:cpus=1 comp
> ute-0-12.local:cpus=1 compute-0-9.local:cpus=1 -dditree  
> compute-0-5.local 33170
> 2 4 rsh -scr /tmp/3840.1.gamess.q

This is the message from the wrapper, it's fine.

-- Reuti


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list