[GE users] Open MPI and SGE

Bernard Li bli at bcgsc.ca
Thu Apr 27 05:53:08 BST 2006


Hi Rayson:

This is the output I got when I run with 'mpirun -d':

[node01:29742] [0,0,0] setting up session dir with
[node01:29742]  universe default-universe
[node01:29742]  user bli
[node01:29742]  host node01
[node01:29742]  jobid 0
[node01:29742]  procid 0
[node01:29742] procdir:
/tmp/47.1.all.q/openmpi-sessions-bli at node01_0/default-universe/0/0
[node01:29742] jobdir:
/tmp/47.1.all.q/openmpi-sessions-bli at node01_0/default-universe/0
[node01:29742] unidir:
/tmp/47.1.all.q/openmpi-sessions-bli at node01_0/default-universe
[node01:29742] top: openmpi-sessions-bli at node01_0
[node01:29742] tmp: /tmp/47.1.all.q
[node01:29742] [0,0,0] contact_file
/tmp/47.1.all.q/openmpi-sessions-bli at node01_0/default-universe/universe-
setup.txt
[node01:29742] [0,0,0] wrote setup file
[node01:29742] spawn: in job_state_callback(jobid = 1, state = 0x1)
[node01:29742] pls:rsh: local csh: 0, local bash: 1
[node01:29742] pls:rsh: assuming same remote shell as local shell
[node01:29742] pls:rsh: remote csh: 0, remote bash: 1
[node01:29742] pls:rsh: final template argv:
[node01:29742] pls:rsh:     /usr/bin/ssh <template> orted --debug
--bootproxy 1 --name <template> --num_procs 2 --vpid_start 0 --nodename
<template> --universe bli at node01:default-universe --nsreplica
"0.0.0;tcp://192.168.22.2:33016" --gprreplica
"0.0.0;tcp://192.168.22.2:33016" --mpi-call-yield 0
[node01:29742] pls:rsh: launching on node node01
[node01:29742] pls:rsh: oversubscribed -- setting mpi_yield_when_idle to
1 (1 4)
[node01:29742] pls:rsh: node01 is a LOCAL node
[node01:29742] pls:rsh: reset PATH:
/opt/openmpi-1.0.2/bin:/tmp/47.1.all.q:/usr/kerberos/bin:/usr/local/bin:
/bin:/usr/bin:/usr/X11R6/bin:/opt/kernel_picker/bin:/opt/pvm3/lib:/opt/p
vm3/lib/LINUX64:/opt/pvm3/bin/LINUX64:/opt/openmpi/1.0.2-1/bin/:/opt/sge
/bin/lx26-amd64:/opt/env-switcher/bin:/opt/lam-7.1.2/bin:/opt/c3-4/:/hom
e/bli/bin
[node01:29742] pls:rsh: reset LD_LIBRARY_PATH: /opt/openmpi-1.0.2/lib
[node01:29742] pls:rsh: changing to directory /home/bli
[node01:29742] pls:rsh: executing: orted --debug --bootproxy 1 --name
0.0.1 --num_procs 2 --vpid_start 0 --nodename node01 --universe
bli at node01:default-universe --nsreplica "0.0.0;tcp://192.168.22.2:33016"
--gprreplica "0.0.0;tcp://192.168.22.2:33016" --mpi-call-yield 1
[node01:29742] pls:rsh: execv failed with errno=2
[node01:29742] sess_dir_finalize: proc session dir not empty - leaving
[node01:29742] spawn: in job_state_callback(jobid = 1, state = 0xa)
[node01:29742] sess_dir_finalize: proc session dir not empty - leaving
[node01:29742] sess_dir_finalize: proc session dir not empty - leaving
[node01:29742] sess_dir_finalize: proc session dir not empty - leaving
[node01:29742] spawn: in job_state_callback(jobid = 1, state = 0x9)
[node01:29742] ERROR: A daemon on node node01 failed to start as
expected.
[node01:29742] ERROR: There may be more information available from
[node01:29742] ERROR: the remote shell (see above).
[node01:29742] ERROR: The daemon exited unexpectedly with status 255.
[node01:29742] sess_dir_finalize: found proc session dir empty -
deleting
[node01:29742] sess_dir_finalize: found job session dir empty - deleting
[node01:29742] sess_dir_finalize: found univ session dir empty -
deleting
[node01:29742] sess_dir_finalize: found top session dir empty - deleting
rm: cannot remove `/tmp/47.1.all.q/rsh': No such file or directory

Thanks,

Bernard

> -----Original Message-----
> From: Rayson Ho [mailto:rayrayson at gmail.com] 
> Sent: Wednesday, April 26, 2006 21:20
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Open MPI and SGE
> 
> Since it was execv(2) that failed, it seems to me that it was trying
> to execute something which was not there...
> 
> May be you can add some debug echo in the scripts to see what gets
> passed to rsh/qrsh.
> 
> Rayson
> 
> 
> 
> On 4/27/06, Bernard Li <bli at bcgsc.ca> wrote:
> > Thanks Rayson.
> >
> > But I wonder what it could be...  the machine file is 
> there, the scripts
> > are there, the pe_hostfile is there...  some times I wish 
> these error
> > messages are more descriptive :-)
> >
> > Cheers,
> >
> > Bernard
> >
> > > -----Original Message-----
> > > From: Rayson Ho [mailto:rayrayson at gmail.com]
> > > Sent: Wednesday, April 26, 2006 20:24
> > > To: users at gridengine.sunsource.net
> > > Subject: Re: [GE users] Open MPI and SGE
> > >
> > > Hmm... errno = 2:
> > >
> > > /usr/include/asm-generic/errno-base.h
> > > #define ENOENT           2      /* No such file or directory */
> > >
> > > Rayson
> > >
> > >
> > > On 4/26/06, Bernard Li <bli at bcgsc.ca> wrote:
> > > > Has anybody been successful with getting Open MPI
> > > integrated with SGE (I
> > > > think Reuti has ;-) ).
> > > >
> > > > Anyways, I think I'm pretty close, but I'm stuck with 
> this issue:
> > > >
> > > > [node1:28248] pls:rsh: execv failed with errno=2
> > > >
> > > > Anybody knows what it means?
> > > >
> > > > I basically set it up like Reuti recommended in the 
> following email:
> > > >
> > > >
> > > http://gridengine.sunsource.net/servlets/ReadMsg?list=users&ms
> > > gNo=15176
> > > >
> > > > My template looks like this:
> > > >
> > > > pe_name           openmpi
> > > > slots             999
> > > > user_lists        NONE
> > > > xuser_lists       NONE
> > > > start_proc_args   /opt/sge/mpi/startmpi.sh $pe_hostfile
> > > > stop_proc_args    /opt/sge/mpi/stopmpi.sh
> > > > allocation_rule   $round_robin
> > > > control_slaves    FALSE
> > > > job_is_first_task TRUE
> > > > urgency_slots     min
> > > >
> > > > It might be helpful to post a working integration in the
> > > SGE website.
> > > >
> > > > Thanks!
> > > >
> > > > Bernard
> > > >
> > > >
> > > 
> ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: 
> users-unsubscribe at gridengine.sunsource.net
> > > > For additional commands, e-mail: 
> users-help at gridengine.sunsource.net
> > > >
> > > >
> > >
> > > 
> ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > > For additional commands, e-mail: 
> users-help at gridengine.sunsource.net
> > >
> > >
> >
> > 
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list