[GE users] SGE 6.1u3 + OpenMPI 1.2.8 - what am I missing?

reuti reuti at staff.uni-marburg.de
Wed Dec 17 11:20:28 GMT 2008


Hi,

Am 17.12.2008 um 10:32 schrieb Alessio Terpin:

> Alex Chekholko wrote:
>
> Hi Alex,
>
>   I don't know if is your problem, but I have made the integration of
> sge with OpenMPI
>   I get some problem, with ssh and OpenMPI

with a tight integration, the mpiexec should call qrsh - and this  
will call the set up method in SGE for qrsh, either rsh or ssh. You  
setup SGE to use ssh according to:

http://gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html ?

Most often in a closed cluster in a private network rsh might also be  
sufficient. As others stated: a plain mpiexec/mpirun should start the  
job correctly, but it might be necessary to specify the number of  
slots again:

$ mpirun -np $NSLOTS a.out

in your jobscript. I even never saw: "local configuration node-r1-u32- 
c5-p11-o22.local not defined - using global configuration" in a job  
output. It appears when the sgeexecd is started AFAIK.

-- Reuti


>   OpenMPI lanch the orted deamon by ssh,  but  in no interactive shell
> so that you the shell
>   *do not* read the profile.
>
>   I have passed the enviroment variable, by ~/.ssh/enviroment
>
>   I hope that is useful
>
>
>
>> Hi,
>>
>> I'm running SGE 6.1u3 on x86_64 and I just installed OpenMPI 1.2.8  
>> and I'm trying to get it working.
>>
>> I can run mpirun commands on the headnode, so that works.
>>
>> I can qsub a non-parallel job that runs mpirun, so that works as  
>> well, so all my env vars are OK, I think.
>>
>> I'm trying to run a parallel job now, after creating the PE and  
>> adding the PE to my queue.
>>
>> # qconf -sp OpenMPI
>> pe_name           OpenMPI
>> slots             256
>> user_lists        NONE
>> xuser_lists       NONE
>> start_proc_args   /bin/true
>> stop_proc_args    /bin/true
>> allocation_rule   $round_robin
>> control_slaves    TRUE
>> job_is_first_task FALSE
>> urgency_slots     min
>>
>> Trying to run a job like this:
>> $ cat mpi/test_mpi.sh
>> #!/bin/bash
>> /gpfs/fs0/share/bin/mpirun --mca pls_gridengine_verbose 1 --mca  
>> plm_rsh_agent ssh -np 4 a.out
>>
>> Where a.out is this code:
>> http://en.wikipedia.org/wiki/ 
>> Message_Passing_Interface#Example_program
>>
>> via a command like this:
>> qsub -V -pe OpenMPI 4 mpi/test_mpi.sh
>>
>> Get an error output like this:
>> $ cat  test_mpi.sh.e1176114
>> local configuration node-r1-u32-c5-p11-o22.local not defined -  
>> using global configuration
>> local configuration node-r1-u32-c5-p11-o22.local not defined -  
>> using global configuration
>> Starting server daemon at host "node-r1-u32-c5-p11-o22.local"
>> local configuration node-r1-u32-c5-p11-o22.local not defined -  
>> using global configuration
>> Starting server daemon at host "node-r1-u30-c7-p11-o21.local"
>> Starting server daemon at host "node-r4-u15-c24-p16-o16.local"
>> local configuration node-r1-u32-c5-p11-o22.local not defined -  
>> using global configuration
>> Starting server daemon at host "node-r2-u34-c3-p14-o18.local"
>> Server daemon successfully started with task id "1.node-r1-u32-c5- 
>> p11-o22"
>> Establishing /usr/bin/ssh -o StrictHostChecking=no session to host  
>> node-r1-u32-c5-p11-o22.local ...
>> /usr/bin/ssh -o StrictHostChecking=no exited on signal 13 (PIPE)
>> reading exit code from shepherd ... Server daemon successfully  
>> started with task id "1.node-r4-u15-c24-p16-o16"
>> Server daemon successfully started with task id "1.node-r1-u30-c7- 
>> p11-o21"
>> Establishing /usr/bin/ssh -o StrictHostChecking=no session to host  
>> node-r1-u30-c7-p11-o21.local ...
>> /usr/bin/ssh -o StrictHostChecking=no exited on signal 13 (PIPE)
>> reading exit code from shepherd ... Establishing /usr/bin/ssh -o  
>> StrictHostChecking=no session to host node-r4-u15-c24-p16- 
>> o16.local ...
>> /usr/bin/ssh -o StrictHostChecking=no exited on signal 13 (PIPE)
>> reading exit code from shepherd ... Server daemon successfully  
>> started with task id "1.node-r2-u34-c3-p14-o18"
>> Establishing /usr/bin/ssh -o StrictHostChecking=no session to host  
>> node-r2-u34-c3-p14-o18.local ...
>> /usr/bin/ssh -o StrictHostChecking=no exited on signal 13 (PIPE)
>> reading exit code from shepherd ... timeout (60 s) expired while  
>> waiting on socket fd 5
>>
>> How do I diagnose this "signal 13 (PIPE)" message?  My qlogin/qrsh/ 
>> qsh are configured per
>> http://gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html
>> except I also added the "-o StrictHostChecking=no"
>>
>> Also, I'm using LDAP for user accounts, does that matter?  One  
>> thread I found said I _must_ use local accounts?
>> http://www.open-mpi.org/community/lists/users/2007/03/2826.php
>>
>> What am I missing?
>>
>> Thanks,
>>
>
>
> -- 
> AOES                    | Alessio Terpin : Unix System Administrator
> Huygensstraat 34        | Tel : +31 (0) 71 579 55 519
> 2201 DK Noordwijk (ZH)  | Fax : +31 (0) 71 572 12 77
> The Netherlands         | WebSite www.aoes.com
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=92924
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=92938

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list