[GE users] SGE/OpenMPI - all MPI tasks run only on a single node

k_clevenger kclevenger at coh.org
Mon Dec 21 19:13:30 GMT 2009


> > SELINUX_INIT=YES
> 
> Are you running SELinux? Can you turn it off, there are reported  
> problems with it and SGE.

SELINUX=disabled is set on all nodes and the head. We were unable to determine exactly where this was coming from. It is not coming from any of the rc scripts.
 
> >
> > The PE definition:
> > pe_name            ompi
> > slots              2
> 
> This is now a test-configuration - it was 32 in your last mail?

Yes, it is a test cluster that was built as a process/sanity check. We see exactly the same results on the test cluster as the production cluster

> > user_lists         NONE
> > xuser_lists        NONE
> > start_proc_args    /bin/true
> > stop_proc_args     /bin/true
> > allocation_rule    $round_robin # the default $pe_hostfile  
> > absolutely will not work
> 
> Well, with one slot per node it can't find both, as $pe_slots implies  
> to use only one machine.

Good to know

> 
> > control_slaves     FALSE
> > job_is_first_task  TRUE
> 
> This was different the last time, it should be:
> 
> control_slaves TRUE
> job_is_first_task FALSE

Fixed, no affect on jobs running correctly

> > slots                 2,[sgenode1.coh.org=1],[sgenode0.coh.org=1]
> 
> This is now a test-configuration with less slots?

Yes

> 
> > tmpdir                /tmp
> > shell                 /bin/bash
> > prolog                NONE
> > epilog                NONE
> > shell_start_mode      unix_behavior # I've tried both  
> > posix_behavior and unix_behavior
> 
> Yes, unix_behavior is often better than the default.
> 
> -- Reuti
> 

Has anyone verified that the ge62u4_lx24-amd64.tar.gz binaries will actually run OpenMPI jobs correctly i.e.; on more than one cluster node? Having built two clusters that exhibit exactly the same behavior (MPI cmdline works, SGE job doesn't) leads me to believe that either A) we're making the same configuration mistake somewhere or B) the binary is broken. 

Given that I've posted the cluster, queue and PE configurations here and changed any outpoints I'm leaning toward B.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=234504

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list