[GE users] SGE 6.1u3 + OpenMPI 1.2.8 - what am I missing?

Alex Chekholko chekh at pcbi.upenn.edu
Wed Dec 17 20:04:51 GMT 2008


Hi all,

Thanks for your responses.  I did read that FAQ.

I tried Gerald's suggestion, and SGE submits the job correctly and I can see the four slots vi qstat.

$ qstat -t
job-ID  prior   name       user         state submit/start at     queue                          master ja-task-ID task-ID state cpu        mem     io      stat failed 
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
1176128 0.60500 mpi1.txt   chekh        r     12/17/2008 15:02:50 all.q at node-r2-u17-c18-p12-o12. SLAVE         
1176128 0.60500 mpi1.txt   chekh        r     12/17/2008 15:02:50 all.q at node-r2-u18-c17-p13-o12. SLAVE            1.node-r2-u18-c17-p13-o12 r     00:00:00 0.00000 0.00000 
1176128 0.60500 mpi1.txt   chekh        r     12/17/2008 15:02:50 all.q at node-r2-u32-c5-p13-o22.l MASTER                        r                                
                                                                  all.q at node-r2-u32-c5-p13-o22.l SLAVE         
1176128 0.60500 mpi1.txt   chekh        r     12/17/2008 15:02:50 all.q at node-r4-u26-c13-p15-o10. SLAVE         


However, it looks like sge_shepherdd crashes on each of the nodes that gets the job:
sge_shepherd[17462]: segfault at 0000000000000001 rip 00000032350607a7 rsp 00007fffa3f2ac50 error 4

Odd.  Any suggestions?

Regards,
Alex


On Wed, 17 Dec 2008 08:52:07 -0500
Chansup Byun <chansup.byun at sun.com> wrote:

> I'm not sure if you checked the following FAQ:
> 
> http://www.open-mpi.org/faq/?category=running#run-n1ge-or-sge
> 
> - Chansup
> 
> On 12/16/08 17:59, Gerald Ragghianti wrote:
> > OpenMPI can detect that you are running within SGE, and shouldn't 
> > require many of the options to mpirun that you are providing.  I 
> > recommend the following submit file:
> >
> > #$ -V
> > #$ -pe OpenMPI 4
> > /gpfs/fs0/share/bin/mpirun a.out
> >
> > Submit the job as follows:
> >
> > qsub submitfile.txt
> >
> > Also, make sure that mpirun is the one provided by openmpi 1.2.8.
> >
> > - Gerald
> >
> > Alex Chekholko wrote:
> >   
> >> Hi,
> >>
> >> I'm running SGE 6.1u3 on x86_64 and I just installed OpenMPI 1.2.8 and I'm trying to get it working.
> >>
> >> I can run mpirun commands on the headnode, so that works.
> >>
> >> I can qsub a non-parallel job that runs mpirun, so that works as well, so all my env vars are OK, I think.
> >>
> >> I'm trying to run a parallel job now, after creating the PE and adding the PE to my queue.
> >>
> >> # qconf -sp OpenMPI
> >> pe_name           OpenMPI
> >> slots             256
> >> user_lists        NONE
> >> xuser_lists       NONE
> >> start_proc_args   /bin/true
> >> stop_proc_args    /bin/true
> >> allocation_rule   $round_robin
> >> control_slaves    TRUE
> >> job_is_first_task FALSE
> >> urgency_slots     min
> >>
> >> Trying to run a job like this:
> >> $ cat mpi/test_mpi.sh 
> >> #!/bin/bash
> >> /gpfs/fs0/share/bin/mpirun --mca pls_gridengine_verbose 1 --mca plm_rsh_agent ssh -np 4 a.out
> >>
> >> Where a.out is this code:
> >> http://en.wikipedia.org/wiki/Message_Passing_Interface#Example_program
> >>
> >> via a command like this:
> >> qsub -V -pe OpenMPI 4 mpi/test_mpi.sh
> >>
> >> Get an error output like this:
> >> $ cat  test_mpi.sh.e1176114
> >> local configuration node-r1-u32-c5-p11-o22.local not defined - using global configuration
> >> local configuration node-r1-u32-c5-p11-o22.local not defined - using global configuration
> >> Starting server daemon at host "node-r1-u32-c5-p11-o22.local"
> >> local configuration node-r1-u32-c5-p11-o22.local not defined - using global configuration
> >> Starting server daemon at host "node-r1-u30-c7-p11-o21.local"
> >> Starting server daemon at host "node-r4-u15-c24-p16-o16.local"
> >> local configuration node-r1-u32-c5-p11-o22.local not defined - using global configuration
> >> Starting server daemon at host "node-r2-u34-c3-p14-o18.local"
> >> Server daemon successfully started with task id "1.node-r1-u32-c5-p11-o22"
> >> Establishing /usr/bin/ssh -o StrictHostChecking=no session to host node-r1-u32-c5-p11-o22.local ...
> >> /usr/bin/ssh -o StrictHostChecking=no exited on signal 13 (PIPE)
> >> reading exit code from shepherd ... Server daemon successfully started with task id "1.node-r4-u15-c24-p16-o16"
> >> Server daemon successfully started with task id "1.node-r1-u30-c7-p11-o21"
> >> Establishing /usr/bin/ssh -o StrictHostChecking=no session to host node-r1-u30-c7-p11-o21.local ...
> >> /usr/bin/ssh -o StrictHostChecking=no exited on signal 13 (PIPE)
> >> reading exit code from shepherd ... Establishing /usr/bin/ssh -o StrictHostChecking=no session to host node-r4-u15-c24-p16-o16.local ...
> >> /usr/bin/ssh -o StrictHostChecking=no exited on signal 13 (PIPE)
> >> reading exit code from shepherd ... Server daemon successfully started with task id "1.node-r2-u34-c3-p14-o18"
> >> Establishing /usr/bin/ssh -o StrictHostChecking=no session to host node-r2-u34-c3-p14-o18.local ...
> >> /usr/bin/ssh -o StrictHostChecking=no exited on signal 13 (PIPE)
> >> reading exit code from shepherd ... timeout (60 s) expired while waiting on socket fd 5
> >>
> >> How do I diagnose this "signal 13 (PIPE)" message?  My qlogin/qrsh/qsh are configured per
> >> http://gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html
> >> except I also added the "-o StrictHostChecking=no"
> >>
> >> Also, I'm using LDAP for user accounts, does that matter?  One thread I found said I _must_ use local accounts?
> >> http://www.open-mpi.org/community/lists/users/2007/03/2826.php
> >>
> >> What am I missing?
> >>
> >> Thanks,
> >>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=93035

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list