[GE users] Large openmpi jobs hang when launched through SGE on 130 hosts or more

andy andy.schwierskott at sun.com
Fri Apr 30 13:59:29 BST 2010


Hi,

did you run into an openMPI/sun HPC Cluster Tools related limit:

% ompi_info -all | grep plm_rsh_num_concurrent
  MCA plm: parameter "plm_rsh_num_concurrent" (current value: "128", data source: default value)

does this work:

   mpirun -mca plm_rsh_num_concurrent 256 -np 2000

Regards,
Andy

On Fri, 30 Apr 2010, fredlefebvre wrote:

> Hi All,
>
> I'm not sure if this is an SGE or OpenMPI issue but the fact is it
> works fine when I tested it outside of SGE... so here it is.
>
> We use SGE 6.2u3 on a cluster of 960 8 core nodes interconnected with
> QDR infiniband.  We run mostly mpi jobs with openmpi.  Most of our
> users use less than 250 cores at a time and it works fine for them.
> But larger users have reported that their mpi applications hang at
> start time from time to time.
>
> I first though the problem was related to their use of a larger number
> of 'cores'...  But it turns out it is directly linked to the number of
> 'nodes/hosts' used by an application.  Basically, if a program runs on
> 129 hosts or less, it works as expected but if it runs on 130 hosts or
> more, it hangs at startup.  That could be a 1040 slot job on 130 hosts
> at 8 cores per host or a 130 slot job at 1 core per host.  I
> understand that 'hang' may be a bit vague.  Both 'qstat' and 'qhost'
> report processes on all requested hosts but 'ps' fails to show
> anything more than the 'qrsh' and 'orted' processes (as well as the
> mpirun on the master node) so it appears the error occurs before the
> program gets a chance to run.. The same jobs work when launched
> manually with mpirun and a hostfile.
>
> As anyone observe that issue before? Any hints?
>
> Thank's
>
> Frederick Lefebvre
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=255536
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=255545

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list