[GE users] Long delay when submitting large jobs
reuti at staff.uni-marburg.de
Mon Feb 7 22:03:26 GMT 2005
[ The following text is in the "ISO-8859-1" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
Quoting Rayson Ho <raysonho at eseenet.com>:
> >you mean because it's just mentioned on the mpiexec page (version 0.77)?
> >- the LAM/MPI patch is a little bit out of date and for 6.5.8
> >(current 7.1.1 - is it still working?).
> No, the current LAM-PBS integration (for LAM 7.x) doesn't use mpiexec, but
> LAM directly calls the TM API.
Aha, if I got it correctly you will still call lamboot which will use the TM.
Then use some times mpirun, and at the end a lamhalt shouldn't be really
necessary, as the daemons will be killed at the end of the job anyway.
> >And: you can compile MPICH2 startup in more than one way.
> >I'm not sure, whether all are supported, as the supplied version is
> >for a beta version of MPICH2 only.
> You can still use the TM mpiexec for MPICH2, see this mail from the author
> of the TM mpiexec --
> "mpich2 - will the real mpiexec please stand up"
With all types of startup methods in MPICH2? There are not only many versions
of mpiexec in MPICH2, but also the static libraries are different - and when
you got only a binary... But I can get access to a PBS cluster, I will look
into it for curiosity.
> >But please keep the start_proc_args/stop_proc_args anyway, as we
> >need special directories to be created on the slave nodes. Maybe
> >it's personal taste, but I would still prefer setting up all the
> >daemons for LAM/MPICH2 in start_proc_args, and the end-user can
> >just use mpirun/mpich2-mpiexec, since all is already setup.
> Agreed, or we can use the current prolog/epilog interface...
Okay, depends. First we should see a working TM in SGE, and then decide whether
there can anything be removed or changed in the current PE support. I think
also of other parallel environments like Linda and TCGMSG, which can be used
already today with SGE without any problems. And if you use the daemonless
version of MPICH2, it's also working today.
> IMO, SGE should only take care of allocating nodes, doing job accounting,
> and killing jobs... but launching the parallel jobs should be done by a
> parallel job launcher, such as mpiexec.
It's a point of discussion: what should a "parallel job launcher" do? Maybe
speed up the rsh/ssh calls by just mapping them to the TM (no rshd and
following rsh to establish) - this could work with all parallel environments.
This way you don't need a PBS-mpiexec at all (What do I overlook here? PBS has
no shepered and therefore the handling of communication in PBS-mpiexec?), and
this function could be built also into SGE.
Cheers - Reuti
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users