[GE users] Large openmpi jobs hang when launched through SGE on 130 hosts or more

craffi dag at sonsorol.org
Fri Apr 30 13:04:12 BST 2010

Is openmpi using SSH under the hood? Are you possibly running into a 
limit on open filehandles or other shell/system limits?

fredlefebvre wrote:
> Hi All,
> I'm not sure if this is an SGE or OpenMPI issue but the fact is it
> works fine when I tested it outside of SGE... so here it is.
> We use SGE 6.2u3 on a cluster of 960 8 core nodes interconnected with
> QDR infiniband.  We run mostly mpi jobs with openmpi.  Most of our
> users use less than 250 cores at a time and it works fine for them.
> But larger users have reported that their mpi applications hang at
> start time from time to time.
> I first though the problem was related to their use of a larger number
> of 'cores'...  But it turns out it is directly linked to the number of
> 'nodes/hosts' used by an application.  Basically, if a program runs on
> 129 hosts or less, it works as expected but if it runs on 130 hosts or
> more, it hangs at startup.  That could be a 1040 slot job on 130 hosts
> at 8 cores per host or a 130 slot job at 1 core per host.  I
> understand that 'hang' may be a bit vague.  Both 'qstat' and 'qhost'
> report processes on all requested hosts but 'ps' fails to show
> anything more than the 'qrsh' and 'orted' processes (as well as the
> mpirun on the master node) so it appears the error occurs before the
> program gets a chance to run.. The same jobs work when launched
> manually with mpirun and a hostfile.
> As anyone observe that issue before? Any hints?
> Thank's
> Frederick Lefebvre
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=255536
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list