[GE users] Gaussian 03, Revision C.02 parallel + SGE

sangamesh forum.san at gmail.com
Sun Sep 27 08:15:48 BST 2009


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Dear SGE users,

       I'm facing two issues with Gaussian-Linda, G03 revision C.02 for running a parallel job in SGE using Linda.

(1) How to integrate G03.C02 Linda with SGE

The G03.D01 release is the next revision of G03.C02. The G03.D01 provides %NProcShared and %LindaWorker to run Linda jobs across multiple nodes under shared + distributed memory model. The %LindaWorker can be used to mention the list of machines with SGE's $TMPDIR/machines file.

 But G03.C02 doesn't have %LindaWorker directive. As per my understanding, it supports only %NProcShared and %NProcLinda. So in absence of %LIndaWorker, how to integrate it with SGE i.e how to convey Linda to run g03 on SGE's scheduled hosts.

(2) Error Issue: "l302.exel: error while loading shared libraries: util.so: cannot open shared object file: No such file or directory
died without ever signing in Sign in timed out after 0 worker connections.
Did not reach minimum (1), shutting down."

For running parallel Gaussian jobs in SGE, have configured .tsnet.config as follows:
$ cat ~/.tsnet.config
Tsnet.Appl.nodelist: compute-0-0.local compute-0-1.local compute-0-2.local compute-0-3.local compute-0-4.local compute-0-5.local compute-0-6.local compute-0-7.local compute-0-8.local compute-0-9.local compute-0-10.local compute-0-11.local compute-0-12.local compute-0-13.local compute-0-14.local compute-0-15.local compute-0-16.local compute-0-17.local compute-0-18.local compute-0-19.local compute-0-20.local compute-0-21.local compute-0-22.local
Tsnet.Appl.verbose: True
Tsnet.Appl.veryverbose: True
Tsnet.Node.lindarsharg: ssh
Tsnet.Appl.useglobalconfig: false

The Job submit script is as follows:

$ cat sgegausub_1.sh
#!/bin/bash

#$ -N go3linda
#$ -S /bin/bash
#$ -cwd
#$ -q all.q
#$ -e err.$JOB_ID.$JOB_NAME
#$ -o out.$JOB_ID.$JOB_NAME
g03root=/apps/gaussian-linda
GAUSS_EXEDIR=/apps/gaussian-linda/g03:/apps/gaussian-linda/g03/linda-exe
GAUSS_SCRDIR=$HOME/g03scrdir/$JOB_ID
LD_LIBRARY_PATH=/apps/gaussian-linda/g03:/apps/gaussian-linda/g03/linda-exe:$LD_LIBRARY_PATH
#$ -v GAUSS_SCRDIR=$HOME/g03scrdir/$JOB_ID
PATH=$GAUSS_EXEDIR:$PATH
#$ -v LD_LIBRARY_PATH=/apps/gaussian-linda/g03:/apps/gaussian-linda/g03/linda-exe:$LD_LIBRARY_PATH
export g03root GAUSS_EXEDIR PATH LD_LIBRARY_PATH GAUSS_SCRDIR
#$ -V
if [ ! -d $GAUSS_SCRDIR ]; then
echo "Creating directory $GAUSS_SCRDIR"
mkdir -p  $GAUSS_SCRDIR
if [ ! -d $GAUSS_SCRDIR ]; then
echo "Failed to create $GAUSS_SCRDIR"
 exit 1
fi
fi
source /apps/gaussian-linda/g03/bsd/g03.profile
file_orig=/home1/g03/apps_test/gaussian/testprl/test000.com<http://test000.com>
PAR_ENV=2
gjoutfile=
echo %NProcShared=4  > $file_orig.$JOB_ID
echo %NProcLinda=`echo $PAR_ENV`  >> $file_orig.$JOB_ID
cat $file_orig            >> $file_orig.$JOB_ID
/apps/gaussian-linda/g03/bsd/g03l $file_orig.$JOB_ID

The error its giving is:
ntsnet: starting master process on compute-0-11.local
/apps/gaussian-linda/g03/linda7.1/intel-linux2.4-rh8/bin/linda_sh /apps/gaussian-linda/g03/linda-exe/l302.exel 0 /home1/g03/g03scrdir/162/Gau-27587.chk 0 /home1/g03/g03scrdir/162/Gau-27587.int 0 /home1/g03/g03scrdir/162/Gau-27587.rwf 0 /home1/g03/g03scrdir/162/Gau-27587.d2e 0 /home1/g03/g03scrdir/162/Gau-27587.scr 0 /home1/g03/g03scrdir/162/Gau-27586.inp 0 junk.out 0 +LARGS 23 0 -kainterval 1 -master 17687 -tsnetport 46621 -maxworkers 1 -minworkers 1 -minwait 600 -maxwait 600 -nodename compute-0-11.local -kaon
ntsnet: starting 1 worker on compute-0-0.local
/apps/gaussian-linda/g03/linda7.1/intel-linux2.4-rh8/bin/linda_rsh compute-0-0.local -r ssh /apps/gaussian-linda/g03/linda-exe/l302.exel 0 /home1/g03/g03scrdir/162/Gau-27587.chk 0 /home1/g03/g03scrdir/162/Gau-27587.int 0 /home1/g03/g03scrdir/162/Gau-27587.rwf 0 /home1/g03/g03scrdir/162/Gau-27587.d2e 0 /home1/g03/g03scrdir/162/Gau-27587.scr 0 /home1/g03/g03scrdir/162/Gau-27586.inp 0 junk.out 0 +LARGS 23 1 -maxworkers 1 -chdir /home1/g03/apps_test/gaussian/testprl -worker compute-0-11.local:17687 -workerwait 900 -tsnetref 1 -nodename compute-0-0.local
ntsnet: exec'ing /apps/gaussian-linda/g03/linda7.1/intel-linux2.4-rh8/bin/LindaLauncher /tmp/162.1.all.q/viaExecDatatFndsC
/apps/gaussian-linda/g03/linda-exe/l302.exel: error while loading shared libraries: util.so: cannot open shared object file: No such file or directory
subprocess pid = 27613 has exited. status = 0x7f00, id = 0, state = 13. command was /apps/gaussian-linda/g03/linda7.1/intel-linux2.4-rh8/bin/linda_rsh compute-0-0.local -r ssh /apps/gaussian-linda/g03/linda-exe/l302.exel 0 /home1/g03/g03scrdir/162/Gau-27587.chk 0 /home1/g03/g03scrdir/162/Gau-27587.int 0 /home1/g03/g03scrdir/162/Gau-27587.rwf 0 /home1/g03/g03scrdir/162/Gau-27587.d2e 0 /home1/g03/g03scrdir/162/Gau-27587.scr 0 /home1/g03/g03scrdir/162/Gau-27586.inp 0 junk.out 0 +LARGS 0 compute-0-0.local 10.1.1.243 43461 1 1 /home1/g03/apps_test/gaussian/testprl
died without ever signing in
Sign in timed out after 0 worker connections.
Did not reach minimum (1), shutting down.

The error is clear, as it is not able to find util.so shared object file. Eventhough LD_LIBRARY_PATH is mentioned, its still not find. The /apps directory is storage directory and it is mounted in all the nodes.

What could be the reason for this and how to resolve the issue?

Thanks,
Sangamesh



More information about the gridengine-users mailing list