[GE users] SGE 6.1u3 + OpenMPI 1.2.8 - what am I missing?

reuti reuti at staff.uni-marburg.de
Wed Dec 17 21:28:17 GMT 2008


Hi,

Am 17.12.2008 um 21:04 schrieb Alex Chekholko:

> Hi all,
>
> Thanks for your responses.  I did read that FAQ.
>
> I tried Gerald's suggestion, and SGE submits the job correctly and  
> I can see the four slots vi qstat.
>
> $ qstat -t
> job-ID  prior   name       user         state submit/start at      
> queue                          master ja-task-ID task-ID state  
> cpu        mem     io      stat failed
> ---------------------------------------------------------------------- 
> ---------------------------------------------------------------------- 
> ---------------------------
> 1176128 0.60500 mpi1.txt   chekh        r     12/17/2008 15:02:50  
> all.q at node-r2-u17-c18-p12-o12. SLAVE
> 1176128 0.60500 mpi1.txt   chekh        r     12/17/2008 15:02:50  
> all.q at node-r2-u18-c17-p13-o12. SLAVE            1.node-r2-u18-c17- 
> p13-o12 r     00:00:00 0.00000 0.00000
> 1176128 0.60500 mpi1.txt   chekh        r     12/17/2008 15:02:50  
> all.q at node-r2-u32-c5-p13-o22.l MASTER                        r
>                                                                    
> all.q at node-r2-u32-c5-p13-o22.l SLAVE
> 1176128 0.60500 mpi1.txt   chekh        r     12/17/2008 15:02:50  
> all.q at node-r4-u26-c13-p15-o10. SLAVE
>
>
> However, it looks like sge_shepherdd crashes on each of the nodes  
> that gets the job:
> sge_shepherd[17462]: segfault at 0000000000000001 rip  
> 00000032350607a7 rsp 00007fffa3f2ac50 error 4

this is severe of course. What OS, i,e, kernel version..., are you  
using? Does it also happen when you submit without the -V option? You  
tried also to give the mpirun the number of to be used slots?

Other serial and parallel jobs are running fine I assume, when I look  
at the job number in the output.

-- Reuti


> Odd.  Any suggestions?
>
> Regards,
> Alex
>
>
> On Wed, 17 Dec 2008 08:52:07 -0500
> Chansup Byun <chansup.byun at sun.com> wrote:
>
>> I'm not sure if you checked the following FAQ:
>>
>> http://www.open-mpi.org/faq/?category=running#run-n1ge-or-sge
>>
>> - Chansup
>>
>> On 12/16/08 17:59, Gerald Ragghianti wrote:
>>> OpenMPI can detect that you are running within SGE, and shouldn't
>>> require many of the options to mpirun that you are providing.  I
>>> recommend the following submit file:
>>>
>>> #$ -V
>>> #$ -pe OpenMPI 4
>>> /gpfs/fs0/share/bin/mpirun a.out
>>>
>>> Submit the job as follows:
>>>
>>> qsub submitfile.txt
>>>
>>> Also, make sure that mpirun is the one provided by openmpi 1.2.8.
>>>
>>> - Gerald
>>>
>>> Alex Chekholko wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm running SGE 6.1u3 on x86_64 and I just installed OpenMPI  
>>>> 1.2.8 and I'm trying to get it working.
>>>>
>>>> I can run mpirun commands on the headnode, so that works.
>>>>
>>>> I can qsub a non-parallel job that runs mpirun, so that works as  
>>>> well, so all my env vars are OK, I think.
>>>>
>>>> I'm trying to run a parallel job now, after creating the PE and  
>>>> adding the PE to my queue.
>>>>
>>>> # qconf -sp OpenMPI
>>>> pe_name           OpenMPI
>>>> slots             256
>>>> user_lists        NONE
>>>> xuser_lists       NONE
>>>> start_proc_args   /bin/true
>>>> stop_proc_args    /bin/true
>>>> allocation_rule   $round_robin
>>>> control_slaves    TRUE
>>>> job_is_first_task FALSE
>>>> urgency_slots     min
>>>>
>>>> Trying to run a job like this:
>>>> $ cat mpi/test_mpi.sh
>>>> #!/bin/bash
>>>> /gpfs/fs0/share/bin/mpirun --mca pls_gridengine_verbose 1 --mca  
>>>> plm_rsh_agent ssh -np 4 a.out
>>>>
>>>> Where a.out is this code:
>>>> http://en.wikipedia.org/wiki/ 
>>>> Message_Passing_Interface#Example_program
>>>>
>>>> via a command like this:
>>>> qsub -V -pe OpenMPI 4 mpi/test_mpi.sh
>>>>
>>>> Get an error output like this:
>>>> $ cat  test_mpi.sh.e1176114
>>>> local configuration node-r1-u32-c5-p11-o22.local not defined -  
>>>> using global configuration
>>>> local configuration node-r1-u32-c5-p11-o22.local not defined -  
>>>> using global configuration
>>>> Starting server daemon at host "node-r1-u32-c5-p11-o22.local"
>>>> local configuration node-r1-u32-c5-p11-o22.local not defined -  
>>>> using global configuration
>>>> Starting server daemon at host "node-r1-u30-c7-p11-o21.local"
>>>> Starting server daemon at host "node-r4-u15-c24-p16-o16.local"
>>>> local configuration node-r1-u32-c5-p11-o22.local not defined -  
>>>> using global configuration
>>>> Starting server daemon at host "node-r2-u34-c3-p14-o18.local"
>>>> Server daemon successfully started with task id "1.node-r1-u32- 
>>>> c5-p11-o22"
>>>> Establishing /usr/bin/ssh -o StrictHostChecking=no session to  
>>>> host node-r1-u32-c5-p11-o22.local ...
>>>> /usr/bin/ssh -o StrictHostChecking=no exited on signal 13 (PIPE)
>>>> reading exit code from shepherd ... Server daemon successfully  
>>>> started with task id "1.node-r4-u15-c24-p16-o16"
>>>> Server daemon successfully started with task id "1.node-r1-u30- 
>>>> c7-p11-o21"
>>>> Establishing /usr/bin/ssh -o StrictHostChecking=no session to  
>>>> host node-r1-u30-c7-p11-o21.local ...
>>>> /usr/bin/ssh -o StrictHostChecking=no exited on signal 13 (PIPE)
>>>> reading exit code from shepherd ... Establishing /usr/bin/ssh -o  
>>>> StrictHostChecking=no session to host node-r4-u15-c24-p16- 
>>>> o16.local ...
>>>> /usr/bin/ssh -o StrictHostChecking=no exited on signal 13 (PIPE)
>>>> reading exit code from shepherd ... Server daemon successfully  
>>>> started with task id "1.node-r2-u34-c3-p14-o18"
>>>> Establishing /usr/bin/ssh -o StrictHostChecking=no session to  
>>>> host node-r2-u34-c3-p14-o18.local ...
>>>> /usr/bin/ssh -o StrictHostChecking=no exited on signal 13 (PIPE)
>>>> reading exit code from shepherd ... timeout (60 s) expired while  
>>>> waiting on socket fd 5
>>>>
>>>> How do I diagnose this "signal 13 (PIPE)" message?  My qlogin/ 
>>>> qrsh/qsh are configured per
>>>> http://gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html
>>>> except I also added the "-o StrictHostChecking=no"
>>>>
>>>> Also, I'm using LDAP for user accounts, does that matter?  One  
>>>> thread I found said I _must_ use local accounts?
>>>> http://www.open-mpi.org/community/lists/users/2007/03/2826.php
>>>>
>>>> What am I missing?
>>>>
>>>> Thanks,
>>>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=93035
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=93041

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list