[GE users] Large openmpi jobs hang when launched through SGE on 130 hosts or more
frederick.lefebvre at clumeq.ca
Fri Apr 30 12:22:35 BST 2010
I'm not sure if this is an SGE or OpenMPI issue but the fact is it
works fine when I tested it outside of SGE... so here it is.
We use SGE 6.2u3 on a cluster of 960 8 core nodes interconnected with
QDR infiniband. We run mostly mpi jobs with openmpi. Most of our
users use less than 250 cores at a time and it works fine for them.
But larger users have reported that their mpi applications hang at
start time from time to time.
I first though the problem was related to their use of a larger number
of 'cores'... But it turns out it is directly linked to the number of
'nodes/hosts' used by an application. Basically, if a program runs on
129 hosts or less, it works as expected but if it runs on 130 hosts or
more, it hangs at startup. That could be a 1040 slot job on 130 hosts
at 8 cores per host or a 130 slot job at 1 core per host. I
understand that 'hang' may be a bit vague. Both 'qstat' and 'qhost'
report processes on all requested hosts but 'ps' fails to show
anything more than the 'qrsh' and 'orted' processes (as well as the
mpirun on the master node) so it appears the error occurs before the
program gets a chance to run.. The same jobs work when launched
manually with mpirun and a hostfile.
As anyone observe that issue before? Any hints?
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users