[GE users] gamess info

Reuti reuti at staff.uni-marburg.de
Thu Aug 24 18:28:24 BST 2006


Hi,

Am 24.08.2006 um 19:10 schrieb lukacm at pdx.edu:

> Yes,
>
> all tools by SGE are working. The script is working (partially ,  
> because the
> export are not working for some reason) with ssh; i.e. the jobs are  
> correctly
> started on all remote nodes and do check in.
>
> With correct parameters SGE tools are spiffy and no problem.
>
> The return to the /usr/bin/rsh is a feature of ddikick, that if it  
> does not find
> rsh, it tries all other system default rsh scripts.

I grep'ed the GAMESS source for the word "trying" of the error  
message, and didn't found any hint. - Reuti


>
> martin
>
>
> Quoting Reuti <reuti at staff.uni-marburg.de>:
>
>> Hi,
>>
>> but it looks much better now at least.
>>
>> Am 23.08.2006 um 21:17 schrieb lukacm at pdx.edu:
>>
>>> It is missing hostname,
>>>
>>> the qrsh is deplyed. i can catch now the log (finally ) of the
>>> ddikick process:
>>>
>>>  ddikick.x: finished with -ddi argument.
>>>  ddikick.x: finished with -dditree argument
>>>  ddikick.x: finished with -ppn argument
>>>  ddikick.x: finished with -scr argument.
>>>
>>>  Distributed Data Interface kickoff program.
>>>  Initiating 4 compute processes on 4 nodes to run the following
>>> command:
>>>  /home/visible/apps/gamess/gamess.01.x exam20
>>>
>>>  ddikick.x: kickoff host = compute-0-5.local
>>>  Master Kickoff Host compute-0-5.local is accepting connections on
>>> port 33170.
>>>  Awaiting connections from 8 GDDI processes.
>>>  ddikick.x : Thread created on compute-0-5.local:33170 to accept
>>> connections.
>>>  ddikick.x: execvp command line: rsh compute-0-12.local
>>> /home/visible/apps/gamess/ddikick.x /home/visible/apps/gamess/
>>> gamess.01.x
>>> exam20 -ddi 4 4 compute-0-5.local:cpus=1 compute-0-4.local:cpus=1
>>> compute-0-12.lo
>>> cal:cpus=1 compute-0-9.local:cpus=1 -dditree compute-0-5.local
>>> 33170 2 4 rsh
>>> -scr /tmp/3840.1.gamess.q
>>>  ddikick.x: execvp command line: rsh compute-0-4.local
>>> /home/visible/apps/gamess/ddikick.x /home/visible/apps/gamess/
>>> gamess.01.x
>>> exam20 -ddi 4 4 compute-0-5.local:cpus=1 compute-0-4.local:cpus=1
>>> compute-0-12.loc
>>> al:cpus=1 compute-0-9.local:cpus=1 -dditree compute-0-5.local 33170
>>> 1 2 rsh -scr
>>> /tmp/3840.1.gamess.q
>>> Attemping to create DDI process 0 on local node 0.
>>> DDI Process 0 Command Line: /home/visible/apps/gamess/gamess.01.x
>>> exam20 -ddi
>>> compute-0-5.local 33170 0 0 4 4 compute-0-5.local:cpus=1
>>> compute-0-4.local:cpus=1 compute-0-12.local:cpus=1
>>> compute-0-9.local:cpus=1
>>> Attemping to create DDI process 4 on local node 0.
>>> DDI Process 4 Command Line: /home/visible/apps/gamess/gamess.01.x
>>> exam20 -ddi
>>> compute-0-5.local 33170 0 4 4 4 compute-0-5.local:cpus=1
>>> compute-0-4.local:cpus=1 compute-0-12.local:cpus=1
>>> compute-0-9.local:cpus=1
>>> /opt/gridengine/bin/lx26-amd64/qrsh -V -inherit compute-0-12.local
>>> /home/visible/apps/gamess/ddikick.x /home/visible/apps/gamess/
>>> gamess.01.x
>>> exam20 -ddi 4 4 compute-0-5.local:cpus=1 compute-0-4.local:cpus=1  
>>> comp
>>> ute-0-12.local:cpus=1 compute-0-9.local:cpus=1 -dditree
>>> compute-0-5.local 33170
>>> 2 4 rsh -scr /tmp/3840.1.gamess.q
>>> /opt/gridengine/bin/lx26-amd64/qrsh -V -inherit compute-0-4.local
>>> /home/visible/apps/gamess/ddikick.x /home/visible/apps/gamess/
>>> gamess.01.x
>>> exam20 -ddi 4 4 compute-0-5.local:cpus=1 compute-0-4.local:cpus=1
>>> compu
>>> te-0-12.local:cpus=1 compute-0-9.local:cpus=1 -dditree
>>> compute-0-5.local 33170 1
>>> 2 rsh -scr /tmp/3840.1.gamess.q
>>>  ddikick.x: 4 bytes received; $lu remaining.
>>>  ddikick.x: 4 bytes received; $lu remaining.
>>>  ddikick.x : 0 checked in; receiving via port 33177 (Remaining=7).
>>>  ddikick.x: 4 bytes received; $lu remaining.
>>>  ddikick.x: 4 bytes received; $lu remaining.
>>>  ddikick.x : 4 checked in; receiving via port 33179 (Remaining=6).
>>>  ddikick.x: Sending kill signal to DDI processes.
>>>  ddikick.x: Sending kill signal to DDI process 0.
>>>  ddikick.x: Sending kill signal to DDI process 4.
>>>  DDI Process 0: terminated upon request.
>>>  DDI Process 4: terminated upon request.
>>>  ddikick.x: Execution terminated due to error(s).
>>>
>>> and it the error log i have the same as before:
>>>
>>>
>>> error: commlib error: access denied (client IP resolved to host
>>> name "". This is
>>> not identical to clients host name "")
>>
>> Okay, now we have to investigate this. The hostnames are also all
>> known on all machines via /etc/hosts or e.g. NIS? Are the SGE tools
>> gethostbyname, gethostbyaddr are working as expected and providing
>> reasonable results on all nodes, and for all nodes on each one?
>>
>>> error: executing task of job 3840 failed: failed sending task to
>>> execd at compute-0-12.local: can't find connection
>>> error: commlib error: access denied (client IP resolved to host
>>> name "". This is
>>> not identical to clients host name "")
>>> error: executing task of job 3840 failed: failed sending task to
>>> execd at compute-0-4.local: can't find connection
>>>  ddikick.x: Timed out while waiting for DDI processes to check in.
>>>  ddikick.x: Fatal error detected.
>>>  The error is most likely to be in the application, so check for
>>>  input errors, disk space, memory needs, application bugs, etc.
>>>  ddikick.x will now clean up all processes, and exit...
>>> connect to address 10.5.255.249: Connection refused
>>> connect to address 10.5.255.249: Connection refused
>>> trying normal rsh (/usr/bin/rsh)
>>> compute-0-5.local: Connection refused
>>> connect to address 10.5.255.250: Connection refused
>>> connect to address 10.5.255.250: Connection refused
>>> trying normal rsh (/usr/bin/rsh)
>>> compute-0-4: Connection refused
>>> connect to address 10.5.255.242: Connection refused
>>> connect to address 10.5.255.242: Connection refused
>>> trying normal rsh (/usr/bin/rsh)
>>> compute-0-12: Connection refused
>>> connect to address 10.5.255.245: Connection refused
>>> connect to address 10.5.255.245: Connection refused
>>> trying normal rsh (/usr/bin/rsh)
>>> compute-0-9: Connection refused
>>
>> I wonder, why here again the hostname has no .local, and is using the
>> full path to /usr/bin/rsh. I agree, that this will not work.
>>
>>>
>>> However, there is a mix between the rsh and qrsh . In the ddikick
>>> log there are
>>> both :
>>>
>>>  ddikick.x: execvp command line: rsh compute-0-12.local
>>> /home/visible/apps/gamess/ddikick.x /home/visible/apps/gamess/
>>> gamess.01.x
>>> exam20 -ddi 4 4 compute-0-5.local:cpus=1 compute-0-4.local:cpus=1
>>> compute-0-12.lo
>>> cal:cpus=1 compute-0-9.local:cpus=1 -dditree compute-0-5.local
>>> 33170 2 4 rsh
>>> -scr /tmp/3840.1.gamess.q
>>>
>>> this is not working for 100%
>>
>> This rsh will be caught by the rsh-wrapper. This is, as it should be.
>>
>>> and later there is
>>>
>>> /opt/gridengine/bin/lx26-amd64/qrsh -V -inherit compute-0-12.local
>>> /home/visible/apps/gamess/ddikick.x /home/visible/apps/gamess/
>>> gamess.01.x
>>> exam20 -ddi 4 4 compute-0-5.local:cpus=1 compute-0-4.local:cpus=1  
>>> comp
>>> ute-0-12.local:cpus=1 compute-0-9.local:cpus=1 -dditree
>>> compute-0-5.local 33170
>>> 2 4 rsh -scr /tmp/3840.1.gamess.q
>>
>> This is the message from the wrapper, it's fine.
>>
>> -- Reuti
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list