[GE users] How to find out why the SGE job is not termination.

Reuti reuti at staff.uni-marburg.de
Thu Aug 10 20:20:09 BST 2006


Hi,

Am 10.08.2006 um 16:04 schrieb Amit H Kumar:

>
> HI SGE,
>
> I submitted a HelloWorld program couple of times through SGE.  
> 99.99% of the
> times I found it runs successfully.
> But this one time I don't see it terminating.  Though the .oJOBID and
> .poJOBID files are created successfully.
>
> The .oJOBID file has the "correct result" from MPICH2 job. So this  
> is not
> due to MPICH2 job.
> My script looks like this:  The only Problem here is The script the  
> I run
> after MPICH2 job looks for a file and it is missing.
> And the .oJOBID file for this unfinished job seems to have stuck at  
> that
> point. Because it is not reporting that the file is missing.
> Though it does report about missing file when i ran it for the 2nd 3rd
> ....nth time.
>
> <snip> ======================
>
> #!/bin/tcsh
>
> #$ -N helloworld.exe
> #$ -m ae
> #$ -M me at odu.edu
> #$ -cwd
> #$ -j y
> #$ -S /bin/tcsh
>
> set NPROCPM=2
> @ NPROCS=$NSLOTS * $NPROCPM
>
> /usr/local/bin/mpiexec -machinefile $HPC_HOSTFILE -np $NPROCS
> ./helloworld.exe
>
> /usr/local/bin/HPC_unsetmpi.csh $MPI_TYPE      #<====== This script  
> has a
> bug: A missing file that it is trying to read.
>
> </snip> =======================
>

what is HPC_unsetmpi.csh doing? I found a similar procedure to set  
these during the startup:

http://www.engres.odu.edu/Clusters/Options/mpi_tutorial.html

On what platform are you running your script, as I'm not aware of the  
set/unset-script? You requested a PE in your qsub command, and the  
MPICH2 integration is setup in a proper way?

Can you please post your PE and queue definition, and if they are not  
too long also the set/unset scripts.

-- Reuti


>
> I have not changed any SGE settings in between runs.  I still see the
> process's "common" files in the $SGE_ROOT/default/spool/qmaster/ 
> jobs/......
>
> My question is How do i find Why and Where  is it stuck, looking at  
> these
> spool directory may be on head node or compute nodes.
>
>
> Thank you for any feedback,
> -AK
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list