[GE users] SGE/OpenMPI - all MPI tasks run only on a single node

reuti reuti at staff.uni-marburg.de
Mon Dec 21 20:44:12 GMT 2009


Sure, it's working:

reuti at pc15370:~> qconf -sp openmpi
pe_name            openmpi
slots              8
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $round_robin
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary TRUE
reuti at pc15370:~> qsub -pe openmpi 4 ./test_openmpi.sh
Your job 2870 ("test_openmpi.sh") has been submitted
reuti at pc15370:~> ps -e f
   PID TTY      STAT   TIME COMMAND
...
  4732 ?        Sl    10:27 /usr/sge/bin/lx24-x86/sge_execd
25882 ?        S      0:00  \_ sge_shepherd-2870 -bg
25884 ?        Ss     0:00      \_ /bin/sh /var/spool/sge/pc15370/ 
job_scripts/2870
25885 ?        S      0:00          \_ mpirun -np 4 /home/reuti/mpihello
25886 ?        Sl     0:00              \_ /usr/sge/bin/lx24-x86/qrsh  
-inherit -nostdin -V pc15381.Chemie.Uni-Marburg.DE  orted -mca
25889 ?        R      0:01              \_ /home/reuti/mpihello
25890 ?        R      0:00              \_ /home/reuti/mpihello
...
reuti at pc15370:~> ssh pc15381 ps -e f
   PID TTY      STAT   TIME COMMAND
...
15803 ?        Sl     6:45 /usr/sge/bin/lx24-x86/sge_execd
  5181 ?        Sl     0:00  \_ sge_shepherd-2870 -bg
  5182 ?        Ss     0:00      \_ /usr/sge/utilbin/lx24-x86/ 
qrsh_starter /var/spool/sge/pc15381/active_jobs/2870.1/1.pc15381
  5189 ?        S      0:00          \_ orted -mca ess env -mca  
orte_ess_jobid 1891631104 -mca orte_ess_vpid 1 -mca  
orte_ess_num_procs 2 --hnp-uri 1891631104.0;tcp:// 
192.168.151.70:43188;tcp6://2002:89f8:9962:b:213:d4ff:fe16:34e4:46480
  5193 ?        R      0:04              \_ /home/reuti/mpihello
  5194 ?        R      0:04              \_ /home/reuti/mpihello
...

Still the question: any firewall, maybe in the switch for certain ports?

-- Reuti


Am 21.12.2009 um 20:13 schrieb k_clevenger:

>>> SELINUX_INIT=YES
>>
>> Are you running SELinux? Can you turn it off, there are reported
>> problems with it and SGE.
>
> SELINUX=disabled is set on all nodes and the head. We were unable  
> to determine exactly where this was coming from. It is not coming  
> from any of the rc scripts.
>
>>>
>>> The PE definition:
>>> pe_name            ompi
>>> slots              2
>>
>> This is now a test-configuration - it was 32 in your last mail?
>
> Yes, it is a test cluster that was built as a process/sanity check.  
> We see exactly the same results on the test cluster as the  
> production cluster
>
>>> user_lists         NONE
>>> xuser_lists        NONE
>>> start_proc_args    /bin/true
>>> stop_proc_args     /bin/true
>>> allocation_rule    $round_robin # the default $pe_hostfile
>>> absolutely will not work
>>
>> Well, with one slot per node it can't find both, as $pe_slots implies
>> to use only one machine.
>
> Good to know
>
>>
>>> control_slaves     FALSE
>>> job_is_first_task  TRUE
>>
>> This was different the last time, it should be:
>>
>> control_slaves TRUE
>> job_is_first_task FALSE
>
> Fixed, no affect on jobs running correctly
>
>>> slots                 2,[sgenode1.coh.org=1],[sgenode0.coh.org=1]
>>
>> This is now a test-configuration with less slots?
>
> Yes
>
>>
>>> tmpdir                /tmp
>>> shell                 /bin/bash
>>> prolog                NONE
>>> epilog                NONE
>>> shell_start_mode      unix_behavior # I've tried both
>>> posix_behavior and unix_behavior
>>
>> Yes, unix_behavior is often better than the default.
>>
>> -- Reuti
>>
>
> Has anyone verified that the ge62u4_lx24-amd64.tar.gz binaries will  
> actually run OpenMPI jobs correctly i.e.; on more than one cluster  
> node? Having built two clusters that exhibit exactly the same  
> behavior (MPI cmdline works, SGE job doesn't) leads me to believe  
> that either A) we're making the same configuration mistake  
> somewhere or B) the binary is broken.
>
> Given that I've posted the cluster, queue and PE configurations  
> here and changed any outpoints I'm leaning toward B.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=234510

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list