[GE users] SGE 6.1u3 + OpenMPI 1.2.8 - what am I missing?

reuti reuti at staff.uni-marburg.de
Wed Dec 17 23:04:05 GMT 2008


Hi,

the best thing to start with a new integration of a parallel startup  
is to start with a small mpihello, which will show a distribution  
like below on the master (2+2 slots):

$ ps -e f
...
12250 ?        Sl     1:17 /usr/sge/bin/lx24-x86/sge_execd
  1147 ?        S      0:00  \_ sge_shepherd-848 -bg
  1149 ?        Ss     0:00  |   \_ /bin/sh /var/spool/sge/pc15381/ 
job_scripts/848
  1150 ?        R      0:00  |       \_ mpirun -np 4 /home/reuti/ 
mpihello                      <=== main job starts two qrsh, one to  
each node
  1151 ?        Sl     0:00  |           \_ qrsh -inherit -noshell - 
nostdin -V pc15381 /home/reuti/local/openmpi-1.2.8/bin/orted --no-
  1152 ?        Sl     0:00  |           \_ qrsh -inherit -noshell - 
nostdin -V pc15370 /home/reuti/local/openmpi-1.2.8/bin/orted --no-
  1153 ?        Sl     0:00  \_ sge_shepherd-848 -bg
  1154 ?        Ss     0:00      \_ /usr/sge/utilbin/lx24-x86/ 
qrsh_starter /var/spool/sge/pc15381/active_jobs/848.1/1.pc15381 noshell
  1170 ?        S      0:00          \_ /home/reuti/local/ 
openmpi-1.2.8/bin/orted --no-daemonize --bootproxy 1 --name 0.0.1 -- 
num_pro
  1171 ?        R      0:02              \_ /home/reuti/mpihello
  1172 ?        R      0:02              \_ /home/reuti/mpihello

and on the slave:

28591 ?        Sl     1:45 /usr/sge/bin/lx24-x86/sge_execd
31462 ?        Sl     0:00  \_ sge_shepherd-848 -bg
31463 ?        Ss     0:00      \_ /usr/sge/utilbin/lx24-x86/ 
qrsh_starter /var/spool/sge/pc15370/active_jobs/848.1/1.pc15370 noshell
31470 ?        S      0:00          \_ /home/reuti/local/ 
openmpi-1.2.8/bin/orted --no-daemonize --bootproxy 1 --name 0.0.2 -- 
num_pro
31472 ?        R      0:26              \_ /home/reuti/mpihello
31473 ?        R      0:26              \_ /home/reuti/mpihello

Instead of -V, it might also be possible to put some variable  
definitions in .bashrc in your case. I wonder about "--num_procs 5"  
in your output - it should be 3 for 4 jobs slots AFAICS.

-- Reuti



Am 17.12.2008 um 23:07 schrieb Alex Chekholko:

> Hi Reuti, all,
>
> This is on EL5.2, CentOS
> Linux node-r1-u19-c16-p10-o14.local 2.6.18-92.1.13.el5 #1 SMP Wed  
> Sep 24 19:32:05 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
>
>
> I tried adding "-np 4" to my mpi.txt:
>
>
> $ qstat -t
> job-ID  prior   name       user         state submit/start at      
> queue                          master ja-task-ID task-ID state  
> cpu        mem     io      stat failed
> ---------------------------------------------------------------------- 
> ---------------------------------------------------------------------- 
> ---------------------------
> 1176139 0.60500 mpi1.txt   chekh        r     12/17/2008 16:51:06  
> all.q at node-r1-u14-c21-p11-o11. MASTER
>                                                                    
> all.q at node-r1-u14-c21-p11-o11. SLAVE
> 1176139 0.60500 mpi1.txt   chekh        r     12/17/2008 16:51:06  
> all.q at node-r1-u17-c18-p10-o13. SLAVE
> 1176139 0.60500 mpi1.txt   chekh        r     12/17/2008 16:51:06  
> all.q at node-r1-u21-c14-p10-o23. SLAVE
> 1176139 0.60500 mpi1.txt   chekh        r     12/17/2008 16:51:06  
> all.q at node-r1-u31-c6-p10-o22.l SLAVE
> [chekh at beta.genomics.upenn.edu] ~/mpi [0]
> $ qstat -t
> job-ID  prior   name       user         state submit/start at      
> queue                          master ja-task-ID task-ID state  
> cpu        mem     io      stat failed
> ---------------------------------------------------------------------- 
> ---------------------------------------------------------------------- 
> ---------------------------
> 1176139 0.60500 mpi1.txt   chekh        r     12/17/2008 16:51:06  
> all.q at node-r1-u14-c21-p11-o11. MASTER                        r      
> 00:00:00 0.00170 0.00000
>                                                                    
> all.q at node-r1-u14-c21-p11-o11. SLAVE            1.node-r1-u14-c21- 
> p11-o11 r     00:00:00 0.00000 0.00000
> 1176139 0.60500 mpi1.txt   chekh        r     12/17/2008 16:51:06  
> all.q at node-r1-u17-c18-p10-o13. SLAVE            1.node-r1-u17-c18- 
> p10-o13 r     00:00:00 0.00000 0.00000
> 1176139 0.60500 mpi1.txt   chekh        r     12/17/2008 16:51:06  
> all.q at node-r1-u21-c14-p10-o23. SLAVE            1.node-r1-u21-c14- 
> p10-o23 r     00:00:00 0.00000 0.00000
> 1176139 0.60500 mpi1.txt   chekh        r     12/17/2008 16:51:06  
> all.q at node-r1-u31-c6-p10-o22.l SLAVE            1.node-r1-u31-c6- 
> p10-o22 r     00:00:00 0.00000 0.00000
>
>
> The results were better:
> root      3504  0.1  0.0  88420  4492 ?        S    Dec01  29:33 / 
> gpfs/fs0/share/ge-6.1u3/bin/lx24-amd64/sge_execd
> root     17630  0.0  0.0  32832  3332 ?        S    16:51   0:00   
> \_ sge_shepherd-1176139 -bg
> chekh    17632  0.0  0.0  63844  1068 ?        Ss   16:51   0:00   
> |   \_ bash /gpfs/fs0/share/ge-6.1u3/PGFI3/spool/node-r1-u14-c21- 
> p11-o11/job_sc
> ripts/1176139
> chekh    17633  0.0  0.0  96864  4684 ?        S    16:51   0:00   
> |       \_ /gpfs/fs0/share/bin/mpirun -np 4 a.out
> chekh    17634  0.0  0.0  32556  3896 ?        S    16:51   0:00   
> |           \_ qrsh -inherit -noshell -nostdin -V node-r1-u14-c21- 
> p11-o11.local
>  /gpfs/fs0/share/bin/orted --no-daemonize --bootproxy 1 --name  
> 0.0.1 --num_procs 5 --vpid_start 0 --nodename node-r1-u14-c21-p11- 
> o11.local --univ
> erse chekh at node-r1-u14-c21-p11-o11.local:default-universe-17633 -- 
> nsreplica "0.0.0;tcp://10.10.73.49:33330" --gprreplica "0.0.0;tcp:// 
> 10.10.73.49
> :33330"
> chekh    17635  0.0  0.0  32556  3896 ?        S    16:51   0:00   
> |           \_ qrsh -inherit -noshell -nostdin -V node-r1-u31-c6- 
> p10-o22.local
> /gpfs/fs0/share/bin/orted --no-daemonize --bootproxy 1 --name 0.0.2  
> --num_procs 5 --vpid_start 0 --nodename node-r1-u31-c6-p10- 
> o22.local --univer
> se chekh at node-r1-u14-c21-p11-o11.local:default-universe-17633 -- 
> nsreplica "0.0.0;tcp://10.10.73.49:33330" --gprreplica "0.0.0;tcp:// 
> 10.10.73.49:3
> 3330"
> chekh    17636  0.0  0.0  32556  3896 ?        S    16:51   0:00   
> |           \_ qrsh -inherit -noshell -nostdin -V node-r1-u21-c14- 
> p10-o23.local
>  /gpfs/fs0/share/bin/orted --no-daemonize --bootproxy 1 --name  
> 0.0.3 --num_procs 5 --vpid_start 0 --nodename node-r1-u21-c14-p10- 
> o23.local --univ
> erse chekh at node-r1-u14-c21-p11-o11.local:default-universe-17633 -- 
> nsreplica "0.0.0;tcp://10.10.73.49:33330" --gprreplica "0.0.0;tcp:// 
> 10.10.73.49
> :33330"
> chekh    17637  0.0  0.0  32560  3896 ?        S    16:51   0:00   
> |           \_ qrsh -inherit -noshell -nostdin -V node-r1-u17-c18- 
> p10-o13.local
>  /gpfs/fs0/share/bin/orted --no-daemonize --bootproxy 1 --name  
> 0.0.4 --num_procs 5 --vpid_start 0 --nodename node-r1-u17-c18-p10- 
> o13.local --univ
> erse chekh at node-r1-u14-c21-p11-o11.local:default-universe-17633 -- 
> nsreplica "0.0.0;tcp://10.10.73.49:33330" --gprreplica "0.0.0;tcp:// 
> 10.10.73.49
> :33330"
> root     17638  0.0  0.0  32832  3340 ?        S    16:51   0:00   
> \_ sge_shepherd-1176139 -bg
> root     17639  0.0  0.0  33468  2924 ?        Ss   16:51    
> 0:00      \_ sge_shepherd-1176139 -bg
>
> but still the job just hung, without the shepherd crash and the  
> output was like this:
>
> $ ls -alh ~/*139
> -rw-r--r-- 1 chekh pgfi 2.4K Dec 17 16:52 /gpfs/fs0/u/chekh/ 
> mpi1.txt.e1176139
> -rw-r--r-- 1 chekh pgfi    0 Dec 17 16:51 /gpfs/fs0/u/chekh/ 
> mpi1.txt.o1176139
> -rw-r--r-- 1 chekh pgfi    0 Dec 17 16:51 /gpfs/fs0/u/chekh/ 
> mpi1.txt.pe1176139
> -rw-r--r-- 1 chekh pgfi    0 Dec 17 16:51 /gpfs/fs0/u/chekh/ 
> mpi1.txt.po1176139
> [chekh at beta.genomics.upenn.edu] ~/mpi [0]
> $ cat ~/*139
> error: error reading returncode of remote command
> [node-r1-u14-c21-p11-o11.local:17633] ERROR: A daemon on node node- 
> r1-u14-c21-p11-o11.local failed to start as expected.
> [node-r1-u14-c21-p11-o11.local:17633] ERROR: There may be more  
> information available from
> [node-r1-u14-c21-p11-o11.local:17633] ERROR: the 'qstat -t' command  
> on the Grid Engine tasks.
> [node-r1-u14-c21-p11-o11.local:17633] ERROR: If the problem  
> persists, please restart the
> [node-r1-u14-c21-p11-o11.local:17633] ERROR: Grid Engine PE job
> [node-r1-u14-c21-p11-o11.local:17633] ERROR: The daemon exited  
> unexpectedly with status 255.
> error: error reading returncode of remote command
> [node-r1-u14-c21-p11-o11.local:17633] ERROR: A daemon on node node- 
> r1-u31-c6-p10-o22.local failed to start as expected.
> [node-r1-u14-c21-p11-o11.local:17633] ERROR: There may be more  
> information available from
> [node-r1-u14-c21-p11-o11.local:17633] ERROR: the 'qstat -t' command  
> on the Grid Engine tasks.
> [node-r1-u14-c21-p11-o11.local:17633] ERROR: If the problem  
> persists, please restart the
> [node-r1-u14-c21-p11-o11.local:17633] ERROR: Grid Engine PE job
> [node-r1-u14-c21-p11-o11.local:17633] ERROR: The daemon exited  
> unexpectedly with status 255.
> error: error reading returncode of remote command
> [node-r1-u14-c21-p11-o11.local:17633] ERROR: A daemon on node node- 
> r1-u21-c14-p10-o23.local failed to start as expected.
> [node-r1-u14-c21-p11-o11.local:17633] ERROR: There may be more  
> information available from
> [node-r1-u14-c21-p11-o11.local:17633] ERROR: the 'qstat -t' command  
> on the Grid Engine tasks.
> [node-r1-u14-c21-p11-o11.local:17633] ERROR: If the problem  
> persists, please restart the
> [node-r1-u14-c21-p11-o11.local:17633] ERROR: Grid Engine PE job
> [node-r1-u14-c21-p11-o11.local:17633] ERROR: The daemon exited  
> unexpectedly with status 255.
> error: error reading returncode of remote command
> [node-r1-u14-c21-p11-o11.local:17633] ERROR: A daemon on node node- 
> r1-u17-c18-p10-o13.local failed to start as expected.
> [node-r1-u14-c21-p11-o11.local:17633] ERROR: There may be more  
> information available from
> [node-r1-u14-c21-p11-o11.local:17633] ERROR: the 'qstat -t' command  
> on the Grid Engine tasks.
> [node-r1-u14-c21-p11-o11.local:17633] ERROR: If the problem  
> persists, please restart the
> [node-r1-u14-c21-p11-o11.local:17633] ERROR: Grid Engine PE job
> [node-r1-u14-c21-p11-o11.local:17633] ERROR: The daemon exited  
> unexpectedly with status 255.
>
>
> Then I tried removing the -V
> $ cat mpi1.txt
> #$ -pe OpenMPI 4
> /gpfs/fs0/share/bin/mpirun -np 4 a.out
>
> $ qstat -t
> job-ID  prior   name       user         state submit/start at      
> queue                          master ja-task-ID task-ID state  
> cpu        mem     io      stat failed
> ---------------------------------------------------------------------- 
> ---------------------------------------------------------------------- 
> ---------------------------
> 1176143 0.60500 mpi1.txt   chekh        r     12/17/2008 17:02:05  
> all.q at node-r1-u19-c16-p10-o14. SLAVE            1.node-r1-u19-c16- 
> p10-o14 r     00:00:00 0.00000 0.00000
> 1176143 0.60500 mpi1.txt   chekh        r     12/17/2008 17:02:05  
> all.q at node-r2-u34-c3-p14-o18.l SLAVE            1.node-r2-u34-c3- 
> p14-o18 r     00:00:00 0.00000 0.00000
> 1176143 0.60500 mpi1.txt   chekh        r     12/17/2008 17:02:05  
> all.q at node-r4-u29-c10-p16-o6.l MASTER                        r      
> 00:00:00 0.00280 0.00000
>                                                                    
> all.q at node-r4-u29-c10-p16-o6.l SLAVE            1.node-r4-u29-c10- 
> p16-o6 r     00:00:00 0.00000 0.00000
> 1176143 0.60500 mpi1.txt   chekh        r     12/17/2008 17:02:05  
> all.q at node-r4-u3-c36-p16-o23.l SLAVE            1.node-r4-u3-c36- 
> p16-o23 r     00:00:00 0.00000 0.00000
> [chekh at beta.genomics.upenn.edu] ~/mpi [0]
> $ qstat -t
> job-ID  prior   name       user         state submit/start at      
> queue                          master ja-task-ID task-ID state  
> cpu        mem     io      stat failed
> ---------------------------------------------------------------------- 
> ---------------------------------------------------------------------- 
> ---------------------------
> 1176143 0.60500 mpi1.txt   chekh        r     12/17/2008 17:02:05  
> all.q at node-r1-u19-c16-p10-o14. SLAVE
> 1176143 0.60500 mpi1.txt   chekh        r     12/17/2008 17:02:05  
> all.q at node-r2-u34-c3-p14-o18.l SLAVE
> 1176143 0.60500 mpi1.txt   chekh        r     12/17/2008 17:02:05  
> all.q at node-r4-u29-c10-p16-o6.l MASTER                        r      
> 00:00:00 0.00280 0.00000
>                                                                    
> all.q at node-r4-u29-c10-p16-o6.l SLAVE
> 1176143 0.60500 mpi1.txt   chekh        r     12/17/2008 17:02:05  
> all.q at node-r4-u3-c36-p16-o23.l SLAVE
> [chekh at beta.genomics.upenn.edu] ~/mpi [0]
> $ ls -alh ../*143
> -rw-r--r-- 1 chekh pgfi 2.4K Dec 17 17:03 ../mpi1.txt.e1176143
> -rw-r--r-- 1 chekh pgfi    0 Dec 17 17:02 ../mpi1.txt.o1176143
> -rw-r--r-- 1 chekh pgfi    0 Dec 17 17:02 ../mpi1.txt.pe1176143
> -rw-r--r-- 1 chekh pgfi    0 Dec 17 17:02 ../mpi1.txt.po1176143
> [chekh at beta.genomics.upenn.edu] ~/mpi [0]
> $ cat ../mpi1.txt.e1176143
> error: error reading returncode of remote command
> error: error reading returncode of remote command
> [node-r4-u29-c10-p16-o6.local:26812] ERROR: A daemon on node node- 
> r4-u29-c10-p16-o6.local failed to start as expected.
> [node-r4-u29-c10-p16-o6.local:26812] ERROR: There may be more  
> information available from
> [node-r4-u29-c10-p16-o6.local:26812] ERROR: the 'qstat -t' command  
> on the Grid Engine tasks.
> [node-r4-u29-c10-p16-o6.local:26812] ERROR: If the problem  
> persists, please restart the
> [node-r4-u29-c10-p16-o6.local:26812] ERROR: Grid Engine PE job
> [node-r4-u29-c10-p16-o6.local:26812] ERROR: The daemon exited  
> unexpectedly with status 255.
> [node-r4-u29-c10-p16-o6.local:26812] ERROR: A daemon on node node- 
> r2-u34-c3-p14-o18.local failed to start as expected.
> [node-r4-u29-c10-p16-o6.local:26812] ERROR: There may be more  
> information available from
> [node-r4-u29-c10-p16-o6.local:26812] ERROR: the 'qstat -t' command  
> on the Grid Engine tasks.
> [node-r4-u29-c10-p16-o6.local:26812] ERROR: If the problem  
> persists, please restart the
> [node-r4-u29-c10-p16-o6.local:26812] ERROR: Grid Engine PE job
> [node-r4-u29-c10-p16-o6.local:26812] ERROR: The daemon exited  
> unexpectedly with status 255.
> error: error reading returncode of remote command
> [node-r4-u29-c10-p16-o6.local:26812] ERROR: A daemon on node node- 
> r4-u3-c36-p16-o23.local failed to start as expected.
> [node-r4-u29-c10-p16-o6.local:26812] ERROR: There may be more  
> information available from
> [node-r4-u29-c10-p16-o6.local:26812] ERROR: the 'qstat -t' command  
> on the Grid Engine tasks.
> [node-r4-u29-c10-p16-o6.local:26812] ERROR: If the problem  
> persists, please restart the
> [node-r4-u29-c10-p16-o6.local:26812] ERROR: Grid Engine PE job
> [node-r4-u29-c10-p16-o6.local:26812] ERROR: The daemon exited  
> unexpectedly with status 255.
> error: error reading returncode of remote command
> [node-r4-u29-c10-p16-o6.local:26812] ERROR: A daemon on node node- 
> r1-u19-c16-p10-o14.local failed to start as expected.
> [node-r4-u29-c10-p16-o6.local:26812] ERROR: There may be more  
> information available from
> [node-r4-u29-c10-p16-o6.local:26812] ERROR: the 'qstat -t' command  
> on the Grid Engine tasks.
> [node-r4-u29-c10-p16-o6.local:26812] ERROR: If the problem  
> persists, please restart the
> [node-r4-u29-c10-p16-o6.local:26812] ERROR: Grid Engine PE job
> [node-r4-u29-c10-p16-o6.local:26812] ERROR: The daemon exited  
> unexpectedly with status 255.
>
> It is not clear to me what I _should_ be seeing.  Where can I look  
> up more details on what qrsh does exactly?
>
> Regards,
> Alex
>
> On Wed, 17 Dec 2008 22:28:17 +0100
> reuti <reuti at staff.uni-marburg.de> wrote:
>
>> Hi,
>>
>> Am 17.12.2008 um 21:04 schrieb Alex Chekholko:
>>
>>> Hi all,
>>>
>>> Thanks for your responses.  I did read that FAQ.
>>>
>>> I tried Gerald's suggestion, and SGE submits the job correctly and
>>> I can see the four slots vi qstat.
>>>
>>> $ qstat -t
>>> job-ID  prior   name       user         state submit/start at
>>> queue                          master ja-task-ID task-ID state
>>> cpu        mem     io      stat failed
>>> -------------------------------------------------------------------- 
>>> --
>>> -------------------------------------------------------------------- 
>>> --
>>> ---------------------------
>>> 1176128 0.60500 mpi1.txt   chekh        r     12/17/2008 15:02:50
>>> all.q at node-r2-u17-c18-p12-o12. SLAVE
>>> 1176128 0.60500 mpi1.txt   chekh        r     12/17/2008 15:02:50
>>> all.q at node-r2-u18-c17-p13-o12. SLAVE            1.node-r2-u18-c17-
>>> p13-o12 r     00:00:00 0.00000 0.00000
>>> 1176128 0.60500 mpi1.txt   chekh        r     12/17/2008 15:02:50
>>> all.q at node-r2-u32-c5-p13-o22.l MASTER                        r
>>>
>>> all.q at node-r2-u32-c5-p13-o22.l SLAVE
>>> 1176128 0.60500 mpi1.txt   chekh        r     12/17/2008 15:02:50
>>> all.q at node-r4-u26-c13-p15-o10. SLAVE
>>>
>>>
>>> However, it looks like sge_shepherdd crashes on each of the nodes
>>> that gets the job:
>>> sge_shepherd[17462]: segfault at 0000000000000001 rip
>>> 00000032350607a7 rsp 00007fffa3f2ac50 error 4
>>
>> this is severe of course. What OS, i,e, kernel version..., are you
>> using? Does it also happen when you submit without the -V option? You
>> tried also to give the mpirun the number of to be used slots?
>>
>> Other serial and parallel jobs are running fine I assume, when I look
>> at the job number in the output.
>>
>> -- Reuti
>>
>>
>>> Odd.  Any suggestions?
>>>
>>> Regards,
>>> Alex
>>>
>>>
>>> On Wed, 17 Dec 2008 08:52:07 -0500
>>> Chansup Byun <chansup.byun at sun.com> wrote:
>>>
>>>> I'm not sure if you checked the following FAQ:
>>>>
>>>> http://www.open-mpi.org/faq/?category=running#run-n1ge-or-sge
>>>>
>>>> - Chansup
>>>>
>>>> On 12/16/08 17:59, Gerald Ragghianti wrote:
>>>>> OpenMPI can detect that you are running within SGE, and shouldn't
>>>>> require many of the options to mpirun that you are providing.  I
>>>>> recommend the following submit file:
>>>>>
>>>>> #$ -V
>>>>> #$ -pe OpenMPI 4
>>>>> /gpfs/fs0/share/bin/mpirun a.out
>>>>>
>>>>> Submit the job as follows:
>>>>>
>>>>> qsub submitfile.txt
>>>>>
>>>>> Also, make sure that mpirun is the one provided by openmpi 1.2.8.
>>>>>
>>>>> - Gerald
>>>>>
>>>>> Alex Chekholko wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm running SGE 6.1u3 on x86_64 and I just installed OpenMPI
>>>>>> 1.2.8 and I'm trying to get it working.
>>>>>>
>>>>>> I can run mpirun commands on the headnode, so that works.
>>>>>>
>>>>>> I can qsub a non-parallel job that runs mpirun, so that works as
>>>>>> well, so all my env vars are OK, I think.
>>>>>>
>>>>>> I'm trying to run a parallel job now, after creating the PE and
>>>>>> adding the PE to my queue.
>>>>>>
>>>>>> # qconf -sp OpenMPI
>>>>>> pe_name           OpenMPI
>>>>>> slots             256
>>>>>> user_lists        NONE
>>>>>> xuser_lists       NONE
>>>>>> start_proc_args   /bin/true
>>>>>> stop_proc_args    /bin/true
>>>>>> allocation_rule   $round_robin
>>>>>> control_slaves    TRUE
>>>>>> job_is_first_task FALSE
>>>>>> urgency_slots     min
>>>>>>
>>>>>> Trying to run a job like this:
>>>>>> $ cat mpi/test_mpi.sh
>>>>>> #!/bin/bash
>>>>>> /gpfs/fs0/share/bin/mpirun --mca pls_gridengine_verbose 1 --mca
>>>>>> plm_rsh_agent ssh -np 4 a.out
>>>>>>
>>>>>> Where a.out is this code:
>>>>>> http://en.wikipedia.org/wiki/
>>>>>> Message_Passing_Interface#Example_program
>>>>>>
>>>>>> via a command like this:
>>>>>> qsub -V -pe OpenMPI 4 mpi/test_mpi.sh
>>>>>>
>>>>>> Get an error output like this:
>>>>>> $ cat  test_mpi.sh.e1176114
>>>>>> local configuration node-r1-u32-c5-p11-o22.local not defined -
>>>>>> using global configuration
>>>>>> local configuration node-r1-u32-c5-p11-o22.local not defined -
>>>>>> using global configuration
>>>>>> Starting server daemon at host "node-r1-u32-c5-p11-o22.local"
>>>>>> local configuration node-r1-u32-c5-p11-o22.local not defined -
>>>>>> using global configuration
>>>>>> Starting server daemon at host "node-r1-u30-c7-p11-o21.local"
>>>>>> Starting server daemon at host "node-r4-u15-c24-p16-o16.local"
>>>>>> local configuration node-r1-u32-c5-p11-o22.local not defined -
>>>>>> using global configuration
>>>>>> Starting server daemon at host "node-r2-u34-c3-p14-o18.local"
>>>>>> Server daemon successfully started with task id "1.node-r1-u32-
>>>>>> c5-p11-o22"
>>>>>> Establishing /usr/bin/ssh -o StrictHostChecking=no session to
>>>>>> host node-r1-u32-c5-p11-o22.local ...
>>>>>> /usr/bin/ssh -o StrictHostChecking=no exited on signal 13 (PIPE)
>>>>>> reading exit code from shepherd ... Server daemon successfully
>>>>>> started with task id "1.node-r4-u15-c24-p16-o16"
>>>>>> Server daemon successfully started with task id "1.node-r1-u30-
>>>>>> c7-p11-o21"
>>>>>> Establishing /usr/bin/ssh -o StrictHostChecking=no session to
>>>>>> host node-r1-u30-c7-p11-o21.local ...
>>>>>> /usr/bin/ssh -o StrictHostChecking=no exited on signal 13 (PIPE)
>>>>>> reading exit code from shepherd ... Establishing /usr/bin/ssh -o
>>>>>> StrictHostChecking=no session to host node-r4-u15-c24-p16-
>>>>>> o16.local ...
>>>>>> /usr/bin/ssh -o StrictHostChecking=no exited on signal 13 (PIPE)
>>>>>> reading exit code from shepherd ... Server daemon successfully
>>>>>> started with task id "1.node-r2-u34-c3-p14-o18"
>>>>>> Establishing /usr/bin/ssh -o StrictHostChecking=no session to
>>>>>> host node-r2-u34-c3-p14-o18.local ...
>>>>>> /usr/bin/ssh -o StrictHostChecking=no exited on signal 13 (PIPE)
>>>>>> reading exit code from shepherd ... timeout (60 s) expired while
>>>>>> waiting on socket fd 5
>>>>>>
>>>>>> How do I diagnose this "signal 13 (PIPE)" message?  My qlogin/
>>>>>> qrsh/qsh are configured per
>>>>>> http://gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html
>>>>>> except I also added the "-o StrictHostChecking=no"
>>>>>>
>>>>>> Also, I'm using LDAP for user accounts, does that matter?  One
>>>>>> thread I found said I _must_ use local accounts?
>>>>>> http://www.open-mpi.org/community/lists/users/2007/03/2826.php
>>>>>>
>>>>>> What am I missing?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>> dsForumId=38&dsMessageId=93035
>>>
>>> To unsubscribe from this discussion, e-mail: [users-
>>> unsubscribe at gridengine.sunsource.net].
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=93041
>>
>> To unsubscribe from this discussion, e-mail: [users- 
>> unsubscribe at gridengine.sunsource.net].
>
>
> -- 
> Alex Chekholko  office: 215-573-4523 cell: 347-401-4860  
> chekh at pcbi.upenn.edu
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=93047
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=93054

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list