[GE users] How to find out why the SGE job is not termination.

Amit H Kumar AHKumar at odu.edu
Thu Aug 10 23:18:30 BST 2006



Reuti <reuti at staff.uni-marburg.de> wrote on 08/10/2006 05:24:08 PM:

> >>>
> >>
> >> what is HPC_unsetmpi.csh doing? I found a similar procedure to set
> >> these during the startup:
> > Hi Reuti,
> >
> > Well what i posted here was basically a cut paste from the link below.
> > I did this to make it look simple. Yes it does set and unset.
> >
> > The set procedure basically boots the MPD ring. The unset procedure
> > exits
> > the MPD ring,
> > and then uses another script to clean any temporary  files created.
>
> For a tightly integrated job, SGE will cleanup the temporary files.

No I don't have a tight integration on these machines. But the
setup_scripts basically uses the $PE_HOSTFILE
and creates a machinfile to be used by MPDBOOT & MPIEXEC.

>
> >
> > Now I am starting to think. Since the cleanup script, regardless it is
> > missing or not, is on a temporary user home directory,
> > which is NFS auto-mounted. And if I remember right the first time I
> > ran
> > this job it was hanging in there, but then every run after that
> > didn't have
> > any problems. So do you think this could be the problem?
>
> Possibly, but:
>
> are you requesting an empty PE for this then? I never tried to use it
> (the mpd startup method), as there is no Tight Integration possible.
> And two mpds on one node (this could happen, if you have two
> different jobs there) will also not be easy to implement I think.
>
Okay.
I don't understand what an empty PE means ?
I request a PE for the mpich2 jobs as (ex. %> qsub -pe mpich2 5
./sge_submit_script.sh)
And then since we want to cleanup some of the non-SGE temporary files, we
use a script to do so.
The cleanup script is called within the sge_submit_script.sh, after the
mpich2 job is finished.



Now I am a little confused:
I understand that mpiexec will run the jobs depending on the nodes selected
and specified in the -machinefile.
But Now what happens to this cleanup script within the
sge_submit_script.sh:
     Does it run only on the submit_host because  I am not using it via
mpiexec. ???


MY PE CONF:
============
qconf -sp mpich2
pe_name           mpich2
slots             64
user_lists        NONE
xuser_lists       NONE
start_proc_args   /opt/gridengine/mpi/startmpi.sh -catch_rsh $pe_hostfile
stop_proc_args    /opt/gridengine/mpi/stopmpi.sh
allocation_rule   $fill_up
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min



> Did you tried the mentioned options in the Howto:
>
> http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-
> integration.html

We are using MPD based startup:
If I understand right, a tight integration for MPD based startup method
involves,
parsing of the $PE_HOSTFILE to prepare a machinefile to boot the MPD rings
and then use the
same nodes to run your jobs via MPIEXEC. In addition, have the PE setup to
run startmpi.sh and stopmpi.sh as above.


>
> I can't say much, as I don't know, what you are doing in your stat-
> stop scripts. But also if you want to stay with the mpd startup
> method, I would suggest to put these scripts in the start/stop
> procedures of the PE, instead putting them in an end-user script.

In short, Should I make these setup or cleanup scripts part of starmpi.sh
and stopmpi.sh ?
May be a stupid question: Does stopmpi.sh run on all nodes ?

More I try to understand, more lost I am.

Thank you Reuti for your feedback,
-AK

>
> -- Reuti
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list