[GE users] Large openmpi jobs hang when launched through SGE on 130 hosts or more

fredlefebvre frederick.lefebvre at clumeq.ca
Fri Apr 30 14:54:18 BST 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Setting 'plm_rsh_num_concurrent' worked.  Thank's to all of you!

Frederick Lefebvre


On Fri, Apr 30, 2010 at 8:59 AM, andy <andy.schwierskott at sun.com> wrote:
> Hi,
>
> did you run into an openMPI/sun HPC Cluster Tools related limit:
>
> % ompi_info -all | grep plm_rsh_num_concurrent
>  MCA plm: parameter "plm_rsh_num_concurrent" (current value: "128", data source: default value)
>
> does this work:
>
>   mpirun -mca plm_rsh_num_concurrent 256 -np 2000
>
> Regards,
> Andy
>
> On Fri, 30 Apr 2010, fredlefebvre wrote:
>
>> Hi All,
>>
>> I'm not sure if this is an SGE or OpenMPI issue but the fact is it
>> works fine when I tested it outside of SGE... so here it is.
>>
>> We use SGE 6.2u3 on a cluster of 960 8 core nodes interconnected with
>> QDR infiniband.  We run mostly mpi jobs with openmpi.  Most of our
>> users use less than 250 cores at a time and it works fine for them.
>> But larger users have reported that their mpi applications hang at
>> start time from time to time.
>>
>> I first though the problem was related to their use of a larger number
>> of 'cores'...  But it turns out it is directly linked to the number of
>> 'nodes/hosts' used by an application.  Basically, if a program runs on
>> 129 hosts or less, it works as expected but if it runs on 130 hosts or
>> more, it hangs at startup.  That could be a 1040 slot job on 130 hosts
>> at 8 cores per host or a 130 slot job at 1 core per host.  I
>> understand that 'hang' may be a bit vague.  Both 'qstat' and 'qhost'
>> report processes on all requested hosts but 'ps' fails to show
>> anything more than the 'qrsh' and 'orted' processes (as well as the
>> mpirun on the master node) so it appears the error occurs before the
>> program gets a chance to run.. The same jobs work when launched
>> manually with mpirun and a hostfile.
>>
>> As anyone observe that issue before? Any hints?
>>
>> Thank's
>>
>> Frederick Lefebvre
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=255536
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=255545
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=255555

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list