[GE users] Long delay when submitting large jobs

Reuti reuti at staff.uni-marburg.de
Tue Feb 8 00:04:29 GMT 2005

    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Rayson,

Quoting Rayson Ho <raysonho at eseenet.com>:

> >Aha, if I got it correctly you will still call lamboot which will
> >use the TM. Then use some times mpirun, and at the end a lamhalt
> >shouldn't be really necessary, as the daemons will be killed at
> >the end of the job anyway.
> Need to read the code... but I can't do it now :(
> BTW, what does our current LAM 7.x implementation do??

you mean the SGE-LAM integration? I'm just looking into it to prepare a new 
Howto anyway. I think it can be done without the supplied perl scripts.
> >But I can get access to a PBS cluster, I will look 
> >into it for curiosity.
> Thx, IMO our tight PE integration is not good enough because it doesn't
> "work out of the box" -- you need to configure a PE, and you need to make
> sure the "parallel job launcher" (e.g. mpirun) calls the qrsh.
> And the PE for each MPI implementation is different, which is also a weak
> point.

Yes, I see the point. But then a possibility to 'map' the mpirun to the correct 
mpiexec would be useful. Some programs like Turbomole come with scripts, which 
will call mpirun many times during the iterations. In this case you would have 
to adjust the scripts of the program, and it would also not work out of the 
box. (And Turbomole will start one process more than requested which will only 
collect the data and not calculating, got 4 slots and use mpirun -np 5 - so you 
have to prepare a proper hostfile to get a proper distribution of the 
calculating tasks.)

> >First we should see a working TM in SGE, and then decide whether 
> >there can anything be removed
> We may not need to remove anything, may be we can create a type of PE
> called the "tight TM PE", with which qmaster doesn't start rshds.
> > or changed in the current PE support. I think 
> >also of other parallel environments like Linda and TCGMSG, which >can be
> used 
> >already today with SGE without any problems. And if you use the
> >daemonless version of MPICH2, it's also working today.
> Yes, but again, just offload the MPI implementation details to mpiexec. The
> TM interface is cleaner than what we have today.


> >PBS has no shepered and therefore the handling of communication in
> >PBS-mpiexec?), and this function could be built also into SGE.
> Sorry, don't understand what you are talking about.
> PBS has MOM, and it is just like the execd.

Well, let me explain this way: for now, SGE will catch the rsh, start the rshd, 
an use a 'real' rsh to start the communication. Instead of starting the rshd, 
why not directly start the program on the node as child of the shepered? No 
rshd in the way. Would this work with all parallel programs out of the box?

I looked into the mpiexec source, and at a first glance I saw support for all 
the communication types like ch_p4 and gm. And the question was: is this 
necessary, because you have only PBS_MOM and no sheperd on the slave nodes with 
PBS - so you must take care of the communication on your own?

mpiexec will catch the communication calls directly, and so avoid any rsh at 
all, since it can't be handled by PBS in a proper way is my assumption. I try 
to understand, what mpiexec is doing to direct all the communication to the TM 
(yes, I will look into the source).

And: what will mpiexec do with the processgroups on the slaves? Do you also 
have to take care that they aren't created, hence set some environment 
variables. And argh, will the TM also send the environment variables to the 
slave nodes like -V? Okay, I will see it - give me some days.

Cheers - Reuti

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list