[GE users] How to find out why the SGE job is not termination.

Reuti reuti at staff.uni-marburg.de
Thu Aug 10 22:24:08 BST 2006


Am 10.08.2006 um 22:21 schrieb Amit H Kumar:

> Reuti <reuti at staff.uni-marburg.de> wrote on 08/10/2006 03:20:09 PM:
>
>> Hi,
>>
>> Am 10.08.2006 um 16:04 schrieb Amit H Kumar:
>>
>>>
>>> HI SGE,
>>>
>>> I submitted a HelloWorld program couple of times through SGE.
>>> 99.99% of the
>>> times I found it runs successfully.
>>> But this one time I don't see it terminating.  Though the .oJOBID  
>>> and
>>> .poJOBID files are created successfully.
>>>
>>> The .oJOBID file has the "correct result" from MPICH2 job. So this
>>> is not
>>> due to MPICH2 job.
>>> My script looks like this:  The only Problem here is The script the
>>> I run
>>> after MPICH2 job looks for a file and it is missing.
>>> And the .oJOBID file for this unfinished job seems to have stuck at
>>> that
>>> point. Because it is not reporting that the file is missing.
>>> Though it does report about missing file when i ran it for the  
>>> 2nd 3rd
>>> ....nth time.
>>>
>>> <snip> ======================
>>>
>>> #!/bin/tcsh
>>>
>>> #$ -N helloworld.exe
>>> #$ -m ae
>>> #$ -M me at odu.edu
>>> #$ -cwd
>>> #$ -j y
>>> #$ -S /bin/tcsh
>>>
>>> set NPROCPM=2
>>> @ NPROCS=$NSLOTS * $NPROCPM
>>>
>>> /usr/local/bin/mpiexec -machinefile $HPC_HOSTFILE -np $NPROCS
>>> ./helloworld.exe
>>>
>>> /usr/local/bin/HPC_unsetmpi.csh $MPI_TYPE      #<====== This script
>>> has a
>>> bug: A missing file that it is trying to read.
>>>
>>> </snip> =======================
>>>
>>
>> what is HPC_unsetmpi.csh doing? I found a similar procedure to set
>> these during the startup:
> Hi Reuti,
>
> Well what i posted here was basically a cut paste from the link below.
> I did this to make it look simple. Yes it does set and unset.
>
> The set procedure basically boots the MPD ring. The unset procedure  
> exits
> the MPD ring,
> and then uses another script to clean any temporary  files created.

For a tightly integrated job, SGE will cleanup the temporary files.

>
> Now I am starting to think. Since the cleanup script, regardless it is
> missing or not, is on a temporary user home directory,
> which is NFS auto-mounted. And if I remember right the first time I  
> ran
> this job it was hanging in there, but then every run after that  
> didn't have
> any problems. So do you think this could be the problem?

Possibly, but:

are you requesting an empty PE for this then? I never tried to use it  
(the mpd startup method), as there is no Tight Integration possible.  
And two mpds on one node (this could happen, if you have two  
different jobs there) will also not be easy to implement I think.

Did you tried the mentioned options in the Howto:

http://gridengine.sunsource.net/howto/mpich2-integration/mpich2- 
integration.html

I can't say much, as I don't know, what you are doing in your stat- 
stop scripts. But also if you want to stay with the mpd startup  
method, I would suggest to put these scripts in the start/stop  
procedures of the PE, instead putting them in an end-user script.

-- Reuti

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list