[GE users] gamess info

lukacm at pdx.edu lukacm at pdx.edu
Thu Aug 24 18:10:04 BST 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Yes,

all tools by SGE are working. The script is working (partially , because the
export are not working for some reason) with ssh; i.e. the jobs are correctly
started on all remote nodes and do check in.

With correct parameters SGE tools are spiffy and no problem.

The return to the /usr/bin/rsh is a feature of ddikick, that if it does not find
rsh, it tries all other system default rsh scripts.


martin


Quoting Reuti <reuti at staff.uni-marburg.de>:

> Hi,
>
> but it looks much better now at least.
>
> Am 23.08.2006 um 21:17 schrieb lukacm at pdx.edu:
>
> > It is missing hostname,
> >
> > the qrsh is deplyed. i can catch now the log (finally ) of the
> > ddikick process:
> >
> >  ddikick.x: finished with -ddi argument.
> >  ddikick.x: finished with -dditree argument
> >  ddikick.x: finished with -ppn argument
> >  ddikick.x: finished with -scr argument.
> >
> >  Distributed Data Interface kickoff program.
> >  Initiating 4 compute processes on 4 nodes to run the following
> > command:
> >  /home/visible/apps/gamess/gamess.01.x exam20
> >
> >  ddikick.x: kickoff host = compute-0-5.local
> >  Master Kickoff Host compute-0-5.local is accepting connections on
> > port 33170.
> >  Awaiting connections from 8 GDDI processes.
> >  ddikick.x : Thread created on compute-0-5.local:33170 to accept
> > connections.
> >  ddikick.x: execvp command line: rsh compute-0-12.local
> > /home/visible/apps/gamess/ddikick.x /home/visible/apps/gamess/
> > gamess.01.x
> > exam20 -ddi 4 4 compute-0-5.local:cpus=1 compute-0-4.local:cpus=1
> > compute-0-12.lo
> > cal:cpus=1 compute-0-9.local:cpus=1 -dditree compute-0-5.local
> > 33170 2 4 rsh
> > -scr /tmp/3840.1.gamess.q
> >  ddikick.x: execvp command line: rsh compute-0-4.local
> > /home/visible/apps/gamess/ddikick.x /home/visible/apps/gamess/
> > gamess.01.x
> > exam20 -ddi 4 4 compute-0-5.local:cpus=1 compute-0-4.local:cpus=1
> > compute-0-12.loc
> > al:cpus=1 compute-0-9.local:cpus=1 -dditree compute-0-5.local 33170
> > 1 2 rsh -scr
> > /tmp/3840.1.gamess.q
> > Attemping to create DDI process 0 on local node 0.
> > DDI Process 0 Command Line: /home/visible/apps/gamess/gamess.01.x
> > exam20 -ddi
> > compute-0-5.local 33170 0 0 4 4 compute-0-5.local:cpus=1
> > compute-0-4.local:cpus=1 compute-0-12.local:cpus=1
> > compute-0-9.local:cpus=1
> > Attemping to create DDI process 4 on local node 0.
> > DDI Process 4 Command Line: /home/visible/apps/gamess/gamess.01.x
> > exam20 -ddi
> > compute-0-5.local 33170 0 4 4 4 compute-0-5.local:cpus=1
> > compute-0-4.local:cpus=1 compute-0-12.local:cpus=1
> > compute-0-9.local:cpus=1
> > /opt/gridengine/bin/lx26-amd64/qrsh -V -inherit compute-0-12.local
> > /home/visible/apps/gamess/ddikick.x /home/visible/apps/gamess/
> > gamess.01.x
> > exam20 -ddi 4 4 compute-0-5.local:cpus=1 compute-0-4.local:cpus=1 comp
> > ute-0-12.local:cpus=1 compute-0-9.local:cpus=1 -dditree
> > compute-0-5.local 33170
> > 2 4 rsh -scr /tmp/3840.1.gamess.q
> > /opt/gridengine/bin/lx26-amd64/qrsh -V -inherit compute-0-4.local
> > /home/visible/apps/gamess/ddikick.x /home/visible/apps/gamess/
> > gamess.01.x
> > exam20 -ddi 4 4 compute-0-5.local:cpus=1 compute-0-4.local:cpus=1
> > compu
> > te-0-12.local:cpus=1 compute-0-9.local:cpus=1 -dditree
> > compute-0-5.local 33170 1
> > 2 rsh -scr /tmp/3840.1.gamess.q
> >  ddikick.x: 4 bytes received; $lu remaining.
> >  ddikick.x: 4 bytes received; $lu remaining.
> >  ddikick.x : 0 checked in; receiving via port 33177 (Remaining=7).
> >  ddikick.x: 4 bytes received; $lu remaining.
> >  ddikick.x: 4 bytes received; $lu remaining.
> >  ddikick.x : 4 checked in; receiving via port 33179 (Remaining=6).
> >  ddikick.x: Sending kill signal to DDI processes.
> >  ddikick.x: Sending kill signal to DDI process 0.
> >  ddikick.x: Sending kill signal to DDI process 4.
> >  DDI Process 0: terminated upon request.
> >  DDI Process 4: terminated upon request.
> >  ddikick.x: Execution terminated due to error(s).
> >
> > and it the error log i have the same as before:
> >
> >
> > error: commlib error: access denied (client IP resolved to host
> > name "". This is
> > not identical to clients host name "")
>
> Okay, now we have to investigate this. The hostnames are also all
> known on all machines via /etc/hosts or e.g. NIS? Are the SGE tools
> gethostbyname, gethostbyaddr are working as expected and providing
> reasonable results on all nodes, and for all nodes on each one?
>
> > error: executing task of job 3840 failed: failed sending task to
> > execd at compute-0-12.local: can't find connection
> > error: commlib error: access denied (client IP resolved to host
> > name "". This is
> > not identical to clients host name "")
> > error: executing task of job 3840 failed: failed sending task to
> > execd at compute-0-4.local: can't find connection
> >  ddikick.x: Timed out while waiting for DDI processes to check in.
> >  ddikick.x: Fatal error detected.
> >  The error is most likely to be in the application, so check for
> >  input errors, disk space, memory needs, application bugs, etc.
> >  ddikick.x will now clean up all processes, and exit...
> > connect to address 10.5.255.249: Connection refused
> > connect to address 10.5.255.249: Connection refused
> > trying normal rsh (/usr/bin/rsh)
> > compute-0-5.local: Connection refused
> > connect to address 10.5.255.250: Connection refused
> > connect to address 10.5.255.250: Connection refused
> > trying normal rsh (/usr/bin/rsh)
> > compute-0-4: Connection refused
> > connect to address 10.5.255.242: Connection refused
> > connect to address 10.5.255.242: Connection refused
> > trying normal rsh (/usr/bin/rsh)
> > compute-0-12: Connection refused
> > connect to address 10.5.255.245: Connection refused
> > connect to address 10.5.255.245: Connection refused
> > trying normal rsh (/usr/bin/rsh)
> > compute-0-9: Connection refused
>
> I wonder, why here again the hostname has no .local, and is using the
> full path to /usr/bin/rsh. I agree, that this will not work.
>
> >
> > However, there is a mix between the rsh and qrsh . In the ddikick
> > log there are
> > both :
> >
> >  ddikick.x: execvp command line: rsh compute-0-12.local
> > /home/visible/apps/gamess/ddikick.x /home/visible/apps/gamess/
> > gamess.01.x
> > exam20 -ddi 4 4 compute-0-5.local:cpus=1 compute-0-4.local:cpus=1
> > compute-0-12.lo
> > cal:cpus=1 compute-0-9.local:cpus=1 -dditree compute-0-5.local
> > 33170 2 4 rsh
> > -scr /tmp/3840.1.gamess.q
> >
> > this is not working for 100%
>
> This rsh will be caught by the rsh-wrapper. This is, as it should be.
>
> > and later there is
> >
> > /opt/gridengine/bin/lx26-amd64/qrsh -V -inherit compute-0-12.local
> > /home/visible/apps/gamess/ddikick.x /home/visible/apps/gamess/
> > gamess.01.x
> > exam20 -ddi 4 4 compute-0-5.local:cpus=1 compute-0-4.local:cpus=1 comp
> > ute-0-12.local:cpus=1 compute-0-9.local:cpus=1 -dditree
> > compute-0-5.local 33170
> > 2 4 rsh -scr /tmp/3840.1.gamess.q
>
> This is the message from the wrapper, it's fine.
>
> -- Reuti
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list