[GE users] SGE/OpenMPI - all MPI tasks run only on a single node
reuti at staff.uni-marburg.de
Mon Dec 21 20:44:12 GMT 2009
Sure, it's working:
reuti at pc15370:~> qconf -sp openmpi
reuti at pc15370:~> qsub -pe openmpi 4 ./test_openmpi.sh
Your job 2870 ("test_openmpi.sh") has been submitted
reuti at pc15370:~> ps -e f
PID TTY STAT TIME COMMAND
4732 ? Sl 10:27 /usr/sge/bin/lx24-x86/sge_execd
25882 ? S 0:00 \_ sge_shepherd-2870 -bg
25884 ? Ss 0:00 \_ /bin/sh /var/spool/sge/pc15370/
25885 ? S 0:00 \_ mpirun -np 4 /home/reuti/mpihello
25886 ? Sl 0:00 \_ /usr/sge/bin/lx24-x86/qrsh
-inherit -nostdin -V pc15381.Chemie.Uni-Marburg.DE orted -mca
25889 ? R 0:01 \_ /home/reuti/mpihello
25890 ? R 0:00 \_ /home/reuti/mpihello
reuti at pc15370:~> ssh pc15381 ps -e f
PID TTY STAT TIME COMMAND
15803 ? Sl 6:45 /usr/sge/bin/lx24-x86/sge_execd
5181 ? Sl 0:00 \_ sge_shepherd-2870 -bg
5182 ? Ss 0:00 \_ /usr/sge/utilbin/lx24-x86/
5189 ? S 0:00 \_ orted -mca ess env -mca
orte_ess_jobid 1891631104 -mca orte_ess_vpid 1 -mca
orte_ess_num_procs 2 --hnp-uri 1891631104.0;tcp://
5193 ? R 0:04 \_ /home/reuti/mpihello
5194 ? R 0:04 \_ /home/reuti/mpihello
Still the question: any firewall, maybe in the switch for certain ports?
Am 21.12.2009 um 20:13 schrieb k_clevenger:
>> Are you running SELinux? Can you turn it off, there are reported
>> problems with it and SGE.
> SELINUX=disabled is set on all nodes and the head. We were unable
> to determine exactly where this was coming from. It is not coming
> from any of the rc scripts.
>>> The PE definition:
>>> pe_name ompi
>>> slots 2
>> This is now a test-configuration - it was 32 in your last mail?
> Yes, it is a test cluster that was built as a process/sanity check.
> We see exactly the same results on the test cluster as the
> production cluster
>>> user_lists NONE
>>> xuser_lists NONE
>>> start_proc_args /bin/true
>>> stop_proc_args /bin/true
>>> allocation_rule $round_robin # the default $pe_hostfile
>>> absolutely will not work
>> Well, with one slot per node it can't find both, as $pe_slots implies
>> to use only one machine.
> Good to know
>>> control_slaves FALSE
>>> job_is_first_task TRUE
>> This was different the last time, it should be:
>> control_slaves TRUE
>> job_is_first_task FALSE
> Fixed, no affect on jobs running correctly
>>> slots 2,[sgenode1.coh.org=1],[sgenode0.coh.org=1]
>> This is now a test-configuration with less slots?
>>> tmpdir /tmp
>>> shell /bin/bash
>>> prolog NONE
>>> epilog NONE
>>> shell_start_mode unix_behavior # I've tried both
>>> posix_behavior and unix_behavior
>> Yes, unix_behavior is often better than the default.
>> -- Reuti
> Has anyone verified that the ge62u4_lx24-amd64.tar.gz binaries will
> actually run OpenMPI jobs correctly i.e.; on more than one cluster
> node? Having built two clusters that exhibit exactly the same
> behavior (MPI cmdline works, SGE job doesn't) leads me to believe
> that either A) we're making the same configuration mistake
> somewhere or B) the binary is broken.
> Given that I've posted the cluster, queue and PE configurations
> here and changed any outpoints I'm leaning toward B.
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users