[GE users] SGE 6.1u3 + OpenMPI 1.2.8 - what am I missing?

Alessio Terpin aterpin at aoes.com
Wed Dec 17 09:32:30 GMT 2008


Alex Chekholko wrote:

Hi Alex,

  I don't know if is your problem, but I have made the integration of 
sge with OpenMPI
  I get some problem, with ssh and OpenMPI

  OpenMPI lanch the orted deamon by ssh,  but  in no interactive shell 
so that you the shell
  *do not* read the profile.

  I have passed the enviroment variable, by ~/.ssh/enviroment

  I hope that is useful



> Hi,
>
> I'm running SGE 6.1u3 on x86_64 and I just installed OpenMPI 1.2.8 and I'm trying to get it working.
>
> I can run mpirun commands on the headnode, so that works.
>
> I can qsub a non-parallel job that runs mpirun, so that works as well, so all my env vars are OK, I think.
>
> I'm trying to run a parallel job now, after creating the PE and adding the PE to my queue.
>
> # qconf -sp OpenMPI
> pe_name           OpenMPI
> slots             256
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   /bin/true
> stop_proc_args    /bin/true
> allocation_rule   $round_robin
> control_slaves    TRUE
> job_is_first_task FALSE
> urgency_slots     min
>
> Trying to run a job like this:
> $ cat mpi/test_mpi.sh 
> #!/bin/bash
> /gpfs/fs0/share/bin/mpirun --mca pls_gridengine_verbose 1 --mca plm_rsh_agent ssh -np 4 a.out
>
> Where a.out is this code:
> http://en.wikipedia.org/wiki/Message_Passing_Interface#Example_program
>
> via a command like this:
> qsub -V -pe OpenMPI 4 mpi/test_mpi.sh
>
> Get an error output like this:
> $ cat  test_mpi.sh.e1176114
> local configuration node-r1-u32-c5-p11-o22.local not defined - using global configuration
> local configuration node-r1-u32-c5-p11-o22.local not defined - using global configuration
> Starting server daemon at host "node-r1-u32-c5-p11-o22.local"
> local configuration node-r1-u32-c5-p11-o22.local not defined - using global configuration
> Starting server daemon at host "node-r1-u30-c7-p11-o21.local"
> Starting server daemon at host "node-r4-u15-c24-p16-o16.local"
> local configuration node-r1-u32-c5-p11-o22.local not defined - using global configuration
> Starting server daemon at host "node-r2-u34-c3-p14-o18.local"
> Server daemon successfully started with task id "1.node-r1-u32-c5-p11-o22"
> Establishing /usr/bin/ssh -o StrictHostChecking=no session to host node-r1-u32-c5-p11-o22.local ...
> /usr/bin/ssh -o StrictHostChecking=no exited on signal 13 (PIPE)
> reading exit code from shepherd ... Server daemon successfully started with task id "1.node-r4-u15-c24-p16-o16"
> Server daemon successfully started with task id "1.node-r1-u30-c7-p11-o21"
> Establishing /usr/bin/ssh -o StrictHostChecking=no session to host node-r1-u30-c7-p11-o21.local ...
> /usr/bin/ssh -o StrictHostChecking=no exited on signal 13 (PIPE)
> reading exit code from shepherd ... Establishing /usr/bin/ssh -o StrictHostChecking=no session to host node-r4-u15-c24-p16-o16.local ...
> /usr/bin/ssh -o StrictHostChecking=no exited on signal 13 (PIPE)
> reading exit code from shepherd ... Server daemon successfully started with task id "1.node-r2-u34-c3-p14-o18"
> Establishing /usr/bin/ssh -o StrictHostChecking=no session to host node-r2-u34-c3-p14-o18.local ...
> /usr/bin/ssh -o StrictHostChecking=no exited on signal 13 (PIPE)
> reading exit code from shepherd ... timeout (60 s) expired while waiting on socket fd 5
>
> How do I diagnose this "signal 13 (PIPE)" message?  My qlogin/qrsh/qsh are configured per
> http://gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html
> except I also added the "-o StrictHostChecking=no"
>
> Also, I'm using LDAP for user accounts, does that matter?  One thread I found said I _must_ use local accounts?
> http://www.open-mpi.org/community/lists/users/2007/03/2826.php
>
> What am I missing?
>
> Thanks,
>   


-- 
AOES                    | Alessio Terpin : Unix System Administrator
Huygensstraat 34        | Tel : +31 (0) 71 579 55 519 
2201 DK Noordwijk (ZH)  | Fax : +31 (0) 71 572 12 77
The Netherlands         | WebSite www.aoes.com

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=92924

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list