[GE users] How to find out why the SGE job is not termination.

Amit H Kumar AHKumar at odu.edu
Thu Aug 10 21:21:13 BST 2006


Reuti <reuti at staff.uni-marburg.de> wrote on 08/10/2006 03:20:09 PM:

> Hi,
>
> Am 10.08.2006 um 16:04 schrieb Amit H Kumar:
>
> >
> > HI SGE,
> >
> > I submitted a HelloWorld program couple of times through SGE.
> > 99.99% of the
> > times I found it runs successfully.
> > But this one time I don't see it terminating.  Though the .oJOBID and
> > .poJOBID files are created successfully.
> >
> > The .oJOBID file has the "correct result" from MPICH2 job. So this
> > is not
> > due to MPICH2 job.
> > My script looks like this:  The only Problem here is The script the
> > I run
> > after MPICH2 job looks for a file and it is missing.
> > And the .oJOBID file for this unfinished job seems to have stuck at
> > that
> > point. Because it is not reporting that the file is missing.
> > Though it does report about missing file when i ran it for the 2nd 3rd
> > ....nth time.
> >
> > <snip> ======================
> >
> > #!/bin/tcsh
> >
> > #$ -N helloworld.exe
> > #$ -m ae
> > #$ -M me at odu.edu
> > #$ -cwd
> > #$ -j y
> > #$ -S /bin/tcsh
> >
> > set NPROCPM=2
> > @ NPROCS=$NSLOTS * $NPROCPM
> >
> > /usr/local/bin/mpiexec -machinefile $HPC_HOSTFILE -np $NPROCS
> > ./helloworld.exe
> >
> > /usr/local/bin/HPC_unsetmpi.csh $MPI_TYPE      #<====== This script
> > has a
> > bug: A missing file that it is trying to read.
> >
> > </snip> =======================
> >
>
> what is HPC_unsetmpi.csh doing? I found a similar procedure to set
> these during the startup:
Hi Reuti,

Well what i posted here was basically a cut paste from the link below.
I did this to make it look simple. Yes it does set and unset.

The set procedure basically boots the MPD ring. The unset procedure exits
the MPD ring,
and then uses another script to clean any temporary  files created.

Now I am starting to think. Since the cleanup script, regardless it is
missing or not, is on a temporary user home directory,
which is NFS auto-mounted. And if I remember right the first time I ran
this job it was hanging in there, but then every run after that didn't have
any problems. So do you think this could be the problem?

Thank you,
-AK




---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list