[GE users] SGE 6.1u3 + OpenMPI 1.2.8 - what am I missing?

Alex Chekholko chekh at pcbi.upenn.edu
Wed Dec 17 22:07:21 GMT 2008


Hi Reuti, all,

This is on EL5.2, CentOS
Linux node-r1-u19-c16-p10-o14.local 2.6.18-92.1.13.el5 #1 SMP Wed Sep 24 19:32:05 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux


I tried adding "-np 4" to my mpi.txt:


$ qstat -t
job-ID  prior   name       user         state submit/start at     queue                          master ja-task-ID task-ID state cpu        mem     io      stat failed 
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
1176139 0.60500 mpi1.txt   chekh        r     12/17/2008 16:51:06 all.q at node-r1-u14-c21-p11-o11. MASTER        
                                                                  all.q at node-r1-u14-c21-p11-o11. SLAVE         
1176139 0.60500 mpi1.txt   chekh        r     12/17/2008 16:51:06 all.q at node-r1-u17-c18-p10-o13. SLAVE         
1176139 0.60500 mpi1.txt   chekh        r     12/17/2008 16:51:06 all.q at node-r1-u21-c14-p10-o23. SLAVE         
1176139 0.60500 mpi1.txt   chekh        r     12/17/2008 16:51:06 all.q at node-r1-u31-c6-p10-o22.l SLAVE         
[chekh at beta.genomics.upenn.edu] ~/mpi [0] 
$ qstat -t
job-ID  prior   name       user         state submit/start at     queue                          master ja-task-ID task-ID state cpu        mem     io      stat failed 
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
1176139 0.60500 mpi1.txt   chekh        r     12/17/2008 16:51:06 all.q at node-r1-u14-c21-p11-o11. MASTER                        r     00:00:00 0.00170 0.00000 
                                                                  all.q at node-r1-u14-c21-p11-o11. SLAVE            1.node-r1-u14-c21-p11-o11 r     00:00:00 0.00000 0.00000 
1176139 0.60500 mpi1.txt   chekh        r     12/17/2008 16:51:06 all.q at node-r1-u17-c18-p10-o13. SLAVE            1.node-r1-u17-c18-p10-o13 r     00:00:00 0.00000 0.00000 
1176139 0.60500 mpi1.txt   chekh        r     12/17/2008 16:51:06 all.q at node-r1-u21-c14-p10-o23. SLAVE            1.node-r1-u21-c14-p10-o23 r     00:00:00 0.00000 0.00000 
1176139 0.60500 mpi1.txt   chekh        r     12/17/2008 16:51:06 all.q at node-r1-u31-c6-p10-o22.l SLAVE            1.node-r1-u31-c6-p10-o22 r     00:00:00 0.00000 0.00000 


The results were better:
root      3504  0.1  0.0  88420  4492 ?        S    Dec01  29:33 /gpfs/fs0/share/ge-6.1u3/bin/lx24-amd64/sge_execd
root     17630  0.0  0.0  32832  3332 ?        S    16:51   0:00  \_ sge_shepherd-1176139 -bg
chekh    17632  0.0  0.0  63844  1068 ?        Ss   16:51   0:00  |   \_ bash /gpfs/fs0/share/ge-6.1u3/PGFI3/spool/node-r1-u14-c21-p11-o11/job_sc
ripts/1176139
chekh    17633  0.0  0.0  96864  4684 ?        S    16:51   0:00  |       \_ /gpfs/fs0/share/bin/mpirun -np 4 a.out
chekh    17634  0.0  0.0  32556  3896 ?        S    16:51   0:00  |           \_ qrsh -inherit -noshell -nostdin -V node-r1-u14-c21-p11-o11.local
 /gpfs/fs0/share/bin/orted --no-daemonize --bootproxy 1 --name 0.0.1 --num_procs 5 --vpid_start 0 --nodename node-r1-u14-c21-p11-o11.local --univ
erse chekh at node-r1-u14-c21-p11-o11.local:default-universe-17633 --nsreplica "0.0.0;tcp://10.10.73.49:33330" --gprreplica "0.0.0;tcp://10.10.73.49
:33330"
chekh    17635  0.0  0.0  32556  3896 ?        S    16:51   0:00  |           \_ qrsh -inherit -noshell -nostdin -V node-r1-u31-c6-p10-o22.local 
/gpfs/fs0/share/bin/orted --no-daemonize --bootproxy 1 --name 0.0.2 --num_procs 5 --vpid_start 0 --nodename node-r1-u31-c6-p10-o22.local --univer
se chekh at node-r1-u14-c21-p11-o11.local:default-universe-17633 --nsreplica "0.0.0;tcp://10.10.73.49:33330" --gprreplica "0.0.0;tcp://10.10.73.49:3
3330"
chekh    17636  0.0  0.0  32556  3896 ?        S    16:51   0:00  |           \_ qrsh -inherit -noshell -nostdin -V node-r1-u21-c14-p10-o23.local
 /gpfs/fs0/share/bin/orted --no-daemonize --bootproxy 1 --name 0.0.3 --num_procs 5 --vpid_start 0 --nodename node-r1-u21-c14-p10-o23.local --univ
erse chekh at node-r1-u14-c21-p11-o11.local:default-universe-17633 --nsreplica "0.0.0;tcp://10.10.73.49:33330" --gprreplica "0.0.0;tcp://10.10.73.49
:33330"
chekh    17637  0.0  0.0  32560  3896 ?        S    16:51   0:00  |           \_ qrsh -inherit -noshell -nostdin -V node-r1-u17-c18-p10-o13.local
 /gpfs/fs0/share/bin/orted --no-daemonize --bootproxy 1 --name 0.0.4 --num_procs 5 --vpid_start 0 --nodename node-r1-u17-c18-p10-o13.local --univ
erse chekh at node-r1-u14-c21-p11-o11.local:default-universe-17633 --nsreplica "0.0.0;tcp://10.10.73.49:33330" --gprreplica "0.0.0;tcp://10.10.73.49
:33330"
root     17638  0.0  0.0  32832  3340 ?        S    16:51   0:00  \_ sge_shepherd-1176139 -bg
root     17639  0.0  0.0  33468  2924 ?        Ss   16:51   0:00      \_ sge_shepherd-1176139 -bg

but still the job just hung, without the shepherd crash and the output was like this:

$ ls -alh ~/*139
-rw-r--r-- 1 chekh pgfi 2.4K Dec 17 16:52 /gpfs/fs0/u/chekh/mpi1.txt.e1176139
-rw-r--r-- 1 chekh pgfi    0 Dec 17 16:51 /gpfs/fs0/u/chekh/mpi1.txt.o1176139
-rw-r--r-- 1 chekh pgfi    0 Dec 17 16:51 /gpfs/fs0/u/chekh/mpi1.txt.pe1176139
-rw-r--r-- 1 chekh pgfi    0 Dec 17 16:51 /gpfs/fs0/u/chekh/mpi1.txt.po1176139
[chekh at beta.genomics.upenn.edu] ~/mpi [0] 
$ cat ~/*139
error: error reading returncode of remote command
[node-r1-u14-c21-p11-o11.local:17633] ERROR: A daemon on node node-r1-u14-c21-p11-o11.local failed to start as expected.
[node-r1-u14-c21-p11-o11.local:17633] ERROR: There may be more information available from
[node-r1-u14-c21-p11-o11.local:17633] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[node-r1-u14-c21-p11-o11.local:17633] ERROR: If the problem persists, please restart the
[node-r1-u14-c21-p11-o11.local:17633] ERROR: Grid Engine PE job
[node-r1-u14-c21-p11-o11.local:17633] ERROR: The daemon exited unexpectedly with status 255.
error: error reading returncode of remote command
[node-r1-u14-c21-p11-o11.local:17633] ERROR: A daemon on node node-r1-u31-c6-p10-o22.local failed to start as expected.
[node-r1-u14-c21-p11-o11.local:17633] ERROR: There may be more information available from
[node-r1-u14-c21-p11-o11.local:17633] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[node-r1-u14-c21-p11-o11.local:17633] ERROR: If the problem persists, please restart the
[node-r1-u14-c21-p11-o11.local:17633] ERROR: Grid Engine PE job
[node-r1-u14-c21-p11-o11.local:17633] ERROR: The daemon exited unexpectedly with status 255.
error: error reading returncode of remote command
[node-r1-u14-c21-p11-o11.local:17633] ERROR: A daemon on node node-r1-u21-c14-p10-o23.local failed to start as expected.
[node-r1-u14-c21-p11-o11.local:17633] ERROR: There may be more information available from
[node-r1-u14-c21-p11-o11.local:17633] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[node-r1-u14-c21-p11-o11.local:17633] ERROR: If the problem persists, please restart the
[node-r1-u14-c21-p11-o11.local:17633] ERROR: Grid Engine PE job
[node-r1-u14-c21-p11-o11.local:17633] ERROR: The daemon exited unexpectedly with status 255.
error: error reading returncode of remote command
[node-r1-u14-c21-p11-o11.local:17633] ERROR: A daemon on node node-r1-u17-c18-p10-o13.local failed to start as expected.
[node-r1-u14-c21-p11-o11.local:17633] ERROR: There may be more information available from
[node-r1-u14-c21-p11-o11.local:17633] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[node-r1-u14-c21-p11-o11.local:17633] ERROR: If the problem persists, please restart the
[node-r1-u14-c21-p11-o11.local:17633] ERROR: Grid Engine PE job
[node-r1-u14-c21-p11-o11.local:17633] ERROR: The daemon exited unexpectedly with status 255.


Then I tried removing the -V
$ cat mpi1.txt 
#$ -pe OpenMPI 4
/gpfs/fs0/share/bin/mpirun -np 4 a.out

$ qstat -t
job-ID  prior   name       user         state submit/start at     queue                          master ja-task-ID task-ID state cpu        mem     io      stat failed 
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
1176143 0.60500 mpi1.txt   chekh        r     12/17/2008 17:02:05 all.q at node-r1-u19-c16-p10-o14. SLAVE            1.node-r1-u19-c16-p10-o14 r     00:00:00 0.00000 0.00000 
1176143 0.60500 mpi1.txt   chekh        r     12/17/2008 17:02:05 all.q at node-r2-u34-c3-p14-o18.l SLAVE            1.node-r2-u34-c3-p14-o18 r     00:00:00 0.00000 0.00000 
1176143 0.60500 mpi1.txt   chekh        r     12/17/2008 17:02:05 all.q at node-r4-u29-c10-p16-o6.l MASTER                        r     00:00:00 0.00280 0.00000 
                                                                  all.q at node-r4-u29-c10-p16-o6.l SLAVE            1.node-r4-u29-c10-p16-o6 r     00:00:00 0.00000 0.00000 
1176143 0.60500 mpi1.txt   chekh        r     12/17/2008 17:02:05 all.q at node-r4-u3-c36-p16-o23.l SLAVE            1.node-r4-u3-c36-p16-o23 r     00:00:00 0.00000 0.00000 
[chekh at beta.genomics.upenn.edu] ~/mpi [0] 
$ qstat -t
job-ID  prior   name       user         state submit/start at     queue                          master ja-task-ID task-ID state cpu        mem     io      stat failed 
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
1176143 0.60500 mpi1.txt   chekh        r     12/17/2008 17:02:05 all.q at node-r1-u19-c16-p10-o14. SLAVE         
1176143 0.60500 mpi1.txt   chekh        r     12/17/2008 17:02:05 all.q at node-r2-u34-c3-p14-o18.l SLAVE         
1176143 0.60500 mpi1.txt   chekh        r     12/17/2008 17:02:05 all.q at node-r4-u29-c10-p16-o6.l MASTER                        r     00:00:00 0.00280 0.00000 
                                                                  all.q at node-r4-u29-c10-p16-o6.l SLAVE         
1176143 0.60500 mpi1.txt   chekh        r     12/17/2008 17:02:05 all.q at node-r4-u3-c36-p16-o23.l SLAVE         
[chekh at beta.genomics.upenn.edu] ~/mpi [0] 
$ ls -alh ../*143
-rw-r--r-- 1 chekh pgfi 2.4K Dec 17 17:03 ../mpi1.txt.e1176143
-rw-r--r-- 1 chekh pgfi    0 Dec 17 17:02 ../mpi1.txt.o1176143
-rw-r--r-- 1 chekh pgfi    0 Dec 17 17:02 ../mpi1.txt.pe1176143
-rw-r--r-- 1 chekh pgfi    0 Dec 17 17:02 ../mpi1.txt.po1176143
[chekh at beta.genomics.upenn.edu] ~/mpi [0] 
$ cat ../mpi1.txt.e1176143 
error: error reading returncode of remote command
error: error reading returncode of remote command
[node-r4-u29-c10-p16-o6.local:26812] ERROR: A daemon on node node-r4-u29-c10-p16-o6.local failed to start as expected.
[node-r4-u29-c10-p16-o6.local:26812] ERROR: There may be more information available from
[node-r4-u29-c10-p16-o6.local:26812] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[node-r4-u29-c10-p16-o6.local:26812] ERROR: If the problem persists, please restart the
[node-r4-u29-c10-p16-o6.local:26812] ERROR: Grid Engine PE job
[node-r4-u29-c10-p16-o6.local:26812] ERROR: The daemon exited unexpectedly with status 255.
[node-r4-u29-c10-p16-o6.local:26812] ERROR: A daemon on node node-r2-u34-c3-p14-o18.local failed to start as expected.
[node-r4-u29-c10-p16-o6.local:26812] ERROR: There may be more information available from
[node-r4-u29-c10-p16-o6.local:26812] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[node-r4-u29-c10-p16-o6.local:26812] ERROR: If the problem persists, please restart the
[node-r4-u29-c10-p16-o6.local:26812] ERROR: Grid Engine PE job
[node-r4-u29-c10-p16-o6.local:26812] ERROR: The daemon exited unexpectedly with status 255.
error: error reading returncode of remote command
[node-r4-u29-c10-p16-o6.local:26812] ERROR: A daemon on node node-r4-u3-c36-p16-o23.local failed to start as expected.
[node-r4-u29-c10-p16-o6.local:26812] ERROR: There may be more information available from
[node-r4-u29-c10-p16-o6.local:26812] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[node-r4-u29-c10-p16-o6.local:26812] ERROR: If the problem persists, please restart the
[node-r4-u29-c10-p16-o6.local:26812] ERROR: Grid Engine PE job
[node-r4-u29-c10-p16-o6.local:26812] ERROR: The daemon exited unexpectedly with status 255.
error: error reading returncode of remote command
[node-r4-u29-c10-p16-o6.local:26812] ERROR: A daemon on node node-r1-u19-c16-p10-o14.local failed to start as expected.
[node-r4-u29-c10-p16-o6.local:26812] ERROR: There may be more information available from
[node-r4-u29-c10-p16-o6.local:26812] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[node-r4-u29-c10-p16-o6.local:26812] ERROR: If the problem persists, please restart the
[node-r4-u29-c10-p16-o6.local:26812] ERROR: Grid Engine PE job
[node-r4-u29-c10-p16-o6.local:26812] ERROR: The daemon exited unexpectedly with status 255.

It is not clear to me what I _should_ be seeing.  Where can I look up more details on what qrsh does exactly?

Regards,
Alex

On Wed, 17 Dec 2008 22:28:17 +0100
reuti <reuti at staff.uni-marburg.de> wrote:

> Hi,
> 
> Am 17.12.2008 um 21:04 schrieb Alex Chekholko:
> 
> > Hi all,
> >
> > Thanks for your responses.  I did read that FAQ.
> >
> > I tried Gerald's suggestion, and SGE submits the job correctly and  
> > I can see the four slots vi qstat.
> >
> > $ qstat -t
> > job-ID  prior   name       user         state submit/start at      
> > queue                          master ja-task-ID task-ID state  
> > cpu        mem     io      stat failed
> > ---------------------------------------------------------------------- 
> > ---------------------------------------------------------------------- 
> > ---------------------------
> > 1176128 0.60500 mpi1.txt   chekh        r     12/17/2008 15:02:50  
> > all.q at node-r2-u17-c18-p12-o12. SLAVE
> > 1176128 0.60500 mpi1.txt   chekh        r     12/17/2008 15:02:50  
> > all.q at node-r2-u18-c17-p13-o12. SLAVE            1.node-r2-u18-c17- 
> > p13-o12 r     00:00:00 0.00000 0.00000
> > 1176128 0.60500 mpi1.txt   chekh        r     12/17/2008 15:02:50  
> > all.q at node-r2-u32-c5-p13-o22.l MASTER                        r
> >                                                                    
> > all.q at node-r2-u32-c5-p13-o22.l SLAVE
> > 1176128 0.60500 mpi1.txt   chekh        r     12/17/2008 15:02:50  
> > all.q at node-r4-u26-c13-p15-o10. SLAVE
> >
> >
> > However, it looks like sge_shepherdd crashes on each of the nodes  
> > that gets the job:
> > sge_shepherd[17462]: segfault at 0000000000000001 rip  
> > 00000032350607a7 rsp 00007fffa3f2ac50 error 4
> 
> this is severe of course. What OS, i,e, kernel version..., are you  
> using? Does it also happen when you submit without the -V option? You  
> tried also to give the mpirun the number of to be used slots?
> 
> Other serial and parallel jobs are running fine I assume, when I look  
> at the job number in the output.
> 
> -- Reuti
> 
> 
> > Odd.  Any suggestions?
> >
> > Regards,
> > Alex
> >
> >
> > On Wed, 17 Dec 2008 08:52:07 -0500
> > Chansup Byun <chansup.byun at sun.com> wrote:
> >
> >> I'm not sure if you checked the following FAQ:
> >>
> >> http://www.open-mpi.org/faq/?category=running#run-n1ge-or-sge
> >>
> >> - Chansup
> >>
> >> On 12/16/08 17:59, Gerald Ragghianti wrote:
> >>> OpenMPI can detect that you are running within SGE, and shouldn't
> >>> require many of the options to mpirun that you are providing.  I
> >>> recommend the following submit file:
> >>>
> >>> #$ -V
> >>> #$ -pe OpenMPI 4
> >>> /gpfs/fs0/share/bin/mpirun a.out
> >>>
> >>> Submit the job as follows:
> >>>
> >>> qsub submitfile.txt
> >>>
> >>> Also, make sure that mpirun is the one provided by openmpi 1.2.8.
> >>>
> >>> - Gerald
> >>>
> >>> Alex Chekholko wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> I'm running SGE 6.1u3 on x86_64 and I just installed OpenMPI  
> >>>> 1.2.8 and I'm trying to get it working.
> >>>>
> >>>> I can run mpirun commands on the headnode, so that works.
> >>>>
> >>>> I can qsub a non-parallel job that runs mpirun, so that works as  
> >>>> well, so all my env vars are OK, I think.
> >>>>
> >>>> I'm trying to run a parallel job now, after creating the PE and  
> >>>> adding the PE to my queue.
> >>>>
> >>>> # qconf -sp OpenMPI
> >>>> pe_name           OpenMPI
> >>>> slots             256
> >>>> user_lists        NONE
> >>>> xuser_lists       NONE
> >>>> start_proc_args   /bin/true
> >>>> stop_proc_args    /bin/true
> >>>> allocation_rule   $round_robin
> >>>> control_slaves    TRUE
> >>>> job_is_first_task FALSE
> >>>> urgency_slots     min
> >>>>
> >>>> Trying to run a job like this:
> >>>> $ cat mpi/test_mpi.sh
> >>>> #!/bin/bash
> >>>> /gpfs/fs0/share/bin/mpirun --mca pls_gridengine_verbose 1 --mca  
> >>>> plm_rsh_agent ssh -np 4 a.out
> >>>>
> >>>> Where a.out is this code:
> >>>> http://en.wikipedia.org/wiki/ 
> >>>> Message_Passing_Interface#Example_program
> >>>>
> >>>> via a command like this:
> >>>> qsub -V -pe OpenMPI 4 mpi/test_mpi.sh
> >>>>
> >>>> Get an error output like this:
> >>>> $ cat  test_mpi.sh.e1176114
> >>>> local configuration node-r1-u32-c5-p11-o22.local not defined -  
> >>>> using global configuration
> >>>> local configuration node-r1-u32-c5-p11-o22.local not defined -  
> >>>> using global configuration
> >>>> Starting server daemon at host "node-r1-u32-c5-p11-o22.local"
> >>>> local configuration node-r1-u32-c5-p11-o22.local not defined -  
> >>>> using global configuration
> >>>> Starting server daemon at host "node-r1-u30-c7-p11-o21.local"
> >>>> Starting server daemon at host "node-r4-u15-c24-p16-o16.local"
> >>>> local configuration node-r1-u32-c5-p11-o22.local not defined -  
> >>>> using global configuration
> >>>> Starting server daemon at host "node-r2-u34-c3-p14-o18.local"
> >>>> Server daemon successfully started with task id "1.node-r1-u32- 
> >>>> c5-p11-o22"
> >>>> Establishing /usr/bin/ssh -o StrictHostChecking=no session to  
> >>>> host node-r1-u32-c5-p11-o22.local ...
> >>>> /usr/bin/ssh -o StrictHostChecking=no exited on signal 13 (PIPE)
> >>>> reading exit code from shepherd ... Server daemon successfully  
> >>>> started with task id "1.node-r4-u15-c24-p16-o16"
> >>>> Server daemon successfully started with task id "1.node-r1-u30- 
> >>>> c7-p11-o21"
> >>>> Establishing /usr/bin/ssh -o StrictHostChecking=no session to  
> >>>> host node-r1-u30-c7-p11-o21.local ...
> >>>> /usr/bin/ssh -o StrictHostChecking=no exited on signal 13 (PIPE)
> >>>> reading exit code from shepherd ... Establishing /usr/bin/ssh -o  
> >>>> StrictHostChecking=no session to host node-r4-u15-c24-p16- 
> >>>> o16.local ...
> >>>> /usr/bin/ssh -o StrictHostChecking=no exited on signal 13 (PIPE)
> >>>> reading exit code from shepherd ... Server daemon successfully  
> >>>> started with task id "1.node-r2-u34-c3-p14-o18"
> >>>> Establishing /usr/bin/ssh -o StrictHostChecking=no session to  
> >>>> host node-r2-u34-c3-p14-o18.local ...
> >>>> /usr/bin/ssh -o StrictHostChecking=no exited on signal 13 (PIPE)
> >>>> reading exit code from shepherd ... timeout (60 s) expired while  
> >>>> waiting on socket fd 5
> >>>>
> >>>> How do I diagnose this "signal 13 (PIPE)" message?  My qlogin/ 
> >>>> qrsh/qsh are configured per
> >>>> http://gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html
> >>>> except I also added the "-o StrictHostChecking=no"
> >>>>
> >>>> Also, I'm using LDAP for user accounts, does that matter?  One  
> >>>> thread I found said I _must_ use local accounts?
> >>>> http://www.open-mpi.org/community/lists/users/2007/03/2826.php
> >>>>
> >>>> What am I missing?
> >>>>
> >>>> Thanks,
> >>>>
> >
> > ------------------------------------------------------
> > http://gridengine.sunsource.net/ds/viewMessage.do? 
> > dsForumId=38&dsMessageId=93035
> >
> > To unsubscribe from this discussion, e-mail: [users- 
> > unsubscribe at gridengine.sunsource.net].
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=93041
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].


-- 
Alex Chekholko  office: 215-573-4523 cell: 347-401-4860 chekh at pcbi.upenn.edu

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=93047

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list