[GE users] How to find out why the SGE job is not termination.

Amit H Kumar AHKumar at odu.edu
Thu Aug 10 15:04:33 BST 2006


HI SGE,

I submitted a HelloWorld program couple of times through SGE. 99.99% of the
times I found it runs successfully.
But this one time I don't see it terminating.  Though the .oJOBID and
.poJOBID files are created successfully.

The .oJOBID file has the "correct result" from MPICH2 job. So this is not
due to MPICH2 job.
My script looks like this:  The only Problem here is The script the I run
after MPICH2 job looks for a file and it is missing.
And the .oJOBID file for this unfinished job seems to have stuck at that
point. Because it is not reporting that the file is missing.
Though it does report about missing file when i ran it for the 2nd 3rd
....nth time.

<snip> ======================

#!/bin/tcsh

#$ -N helloworld.exe
#$ -m ae
#$ -M me at odu.edu
#$ -cwd
#$ -j y
#$ -S /bin/tcsh

set NPROCPM=2
@ NPROCS=$NSLOTS * $NPROCPM

/usr/local/bin/mpiexec -machinefile $HPC_HOSTFILE -np $NPROCS
./helloworld.exe

/usr/local/bin/HPC_unsetmpi.csh $MPI_TYPE      #<====== This script has a
bug: A missing file that it is trying to read.

</snip> =======================


I have not changed any SGE settings in between runs.  I still see the
process's "common" files in the $SGE_ROOT/default/spool/qmaster/jobs/......

My question is How do i find Why and Where  is it stuck, looking at these
spool directory may be on head node or compute nodes.


Thank you for any feedback,
-AK


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list