[GE users] Long delay when submitting large jobs

Rayson Ho raysonho at eseenet.com
Mon Feb 7 20:24:04 GMT 2005


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

>My problem (I think) has nothing to do with the PE being
>tight or loose.  The problem is that when migrating between
>'qw' and 't', the server talks to every node (when control_slaves is
>TRUE).  During this time, the server cannot respond to other
>requests for information, like qsub.  The server shouldn't block.

But qmaster needs to start the rshds on the remote nodes, so if you have a
1024-node MPI job, qmaster needs to iterate 1024 times to contact the
nodes.

When loose integration is the case, control_slaves should be false, and
qmaster doesn't need to do that. (and value of "control_slaves" is the only
difference between tight and loose PE from qmaster's point of view)

>Isn't the rshd process captured when there is a call to rsh
>or ssh?  That is when the job script is running.  The problem
>is prior to that.

But you still need someone to start the rshd, and remember they are not the
normal rshds, they are "SGE-enabled". And that someone is the shepherd,
which is started by execd. And execd itself doesn't know when to start the
shepherd/rshd, so qmaster needs to contact each execd, and that's why
qmaster sort of hangs when you have a large job.

Rayson


>
>Craig
>
---------------------------------------------------------
Get your FREE E-mail account at http://www.eseenet.com !

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list