[GE users] nodes overloaded: processes placed on already full nodes

steve_s elcortogm at googlemail.com
Tue Dec 21 19:09:27 GMT 2010

On Dec 21 19:31 +0100, reuti wrote:
> The sge_shepherd will be started on each slave node in case of a tight
> integration too. When you have a loose integration and no sge_shepherd
> on the slaves, then there maybe processes which survive the crash of a
> job and hence results in the effect you observed. Simply because SGE
> doesn't know anything about the processes started by a simple rsh/ssh
> outside of SGE's context.

OK, makes sense. I checked again, and yes: sge_shepherd only on master.
sge_shepherds on the slaves are from different jobs.

> There is a Howto for the tight integration of MPICH2 prior 1.3 and
> Intel MPI which you are using into SGE:
> http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-integration.html
> http://gridengine.sunsource.net/howto/remove_orphaned_processes.html
> Intel MPICH2 will at some point in the future also use the Hydra
> startup manager.

We will have a look at these. Thanks very much indeed.



To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list