[GE users] Long delay when submitting large jobs

Bogdan Costescu Bogdan.Costescu at iwr.uni-heidelberg.de
Tue Feb 15 10:17:11 GMT 2005


On Mon, 14 Feb 2005, Rayson Ho wrote:

> And if we have the TM library available for SGE now, what kind of changes
> would be needed to do what you sent to the dev last year??

No change, the main point was the possibility to spawn tasks on slave 
nodes without the whole rsh/rshd overhead and usage of random ports 
and, as you already wrote, tm_spawn() can do this.

I would like however to emphasize one thing that is very important for
parallel jobs: tm_spawn() or some similar functionality should be
implemented efficiently, to allow starting up of a parallel job on
hundreds of nodes - a simple loop over some blocking calls (e.g.
waiting for start-up confirmation) should be available for those that 
do not want a flurry of network activity on the master node, but a 
smart routine that sends lots of start-up requests and then waits for 
confirmations should be the default. This might mean some changes to 
the way the job information is kept in SGE, I haven't looked at the 
source since March-April 2004...

-- 
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu at IWR.Uni-Heidelberg.De



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list