[GE users] SGE+OpenMPI: ERROR: A daemon on node xyz failed to start as expected

Reuti reuti at staff.uni-marburg.de
Tue Jul 1 12:27:23 BST 2008


All looks perfect. The thing I could imagine is that a firewall is  
blocking the communication to other nodes. Are you using ssh to the  
nodes normally instead?

-- Reuti


Am 01.07.2008 um 11:09 schrieb Azhar Ali Shah:

> --- On Mon, 6/30/08, Reuti <reuti at staff.uni-marburg.de> wrote:
> >You also used the mpicc from Open MPI?
> Yes! and as I mentioned earlier if i run the job on single node it  
> runs well!
>
> >Do you see more in the process listing below when you append -- 
> cols=500
> >to see the full orted line? Any probs with the nodenames?
> No problems with nodenames!
>
> 13699     1 13699 /usr/SGE6/bin/lx24-x86/sge_execd
> 18259 13699 18259  \_ sge_shepherd-200 -bg
> 18265 18259 18265  |   \_ bash /usr/SGE6/default/spool/justice/ 
> job_scripts/200
> 18269 18265 18265  |       \_ mpirun -n 9 /home/aas/mpihello
> 18270 18269 18265  |           \_ qrsh -inherit -noshell -nostdin - 
> V justice /home/aas/local/openmpi/bin/orted --no-daemonize -- 
> bootproxy 1 --name 0.0.1 --num_procs 7 --vpid_start 0 --nodename  
> justice --universe aas at justice:default-universe-18269 --nsreplica  
> "0.0.0;tcp://128.243.24.110:35258" --gprreplica "0.0.0;tcp:// 
> 128.243.24.110:35258"
> 18279 18270 18265  |           |   \_ /usr/SGE6/utilbin/lx24-x86/ 
> rsh -n -p 35272 justice exec '/usr/SGE6/utilbin/lx24-x86/ 
> qrsh_starter' '/usr/SGE6/default/spool/justice/active_jobs/ 
> 200.1/1.justice' noshell
> 18271 18269 18265  |           \_ qrsh -inherit -noshell -nostdin - 
> V aragorn /home/aas/local/openmpi/bin/orted --no-daemonize -- 
> bootproxy 1 --name 0.0.2 --num_procs 7 --vpid_start 0 --nodename  
> aragorn --universe aas at justice:default-universe-18269 --nsreplica  
> "0.0.0;tcp://128.243.24.110:35258" --gprreplica "0.0.0;tcp:// 
> 128.243.24.110:35258"
> 18285 18271 18265  |           |   \_ /usr/SGE6/utilbin/lx24-x86/ 
> rsh -n -p 60950 aragorn exec '/usr/SGE6/utilbin/lx24-x86/ 
> qrsh_starter' '/usr/SGE6/default/spool/aragorn/active_jobs/ 
> 200.1/1.aragorn' noshell
> 18272 18269 18265  |           \_ qrsh -inherit -noshell -nostdin - 
> V smeg.cs.nott.ac.uk /home/aas/local/openmpi/bin/orted --no- 
> daemonize --bootproxy 1 --name 0.0.3 --num_procs 7 --vpid_start 0 -- 
> nodename smeg --universe aas at justice:default-universe-18269 -- 
> nsreplica "0.0.0;tcp://128.243.24.110:35258" --gprreplica  
> "0.0.0;tcp://128.243.24.110:35258"
> 18281 18272 18265  |           |   \_ /usr/SGE6/utilbin/lx24-x86/ 
> rsh -n -p 34978 smeg.cs.nott.ac.uk exec '/usr/SGE6/utilbin/lx24-x86/ 
> qrsh_starter' '/usr/SGE6/default/spool/smeg/active_jobs/ 
> 200.1/1.smeg' noshell
> 18273 18269 18265  |           \_ qrsh -inherit -noshell -nostdin - 
> V taramel.cs.nott.ac.uk /home/aas/local/openmpi/bin/orted --no- 
> daemonize --bootproxy 1 --name 0.0.4 --num_procs 7 --vpid_start 0 -- 
> nodename taramel --universe aas at justice:default-universe-18269 -- 
> nsreplica "0.0.0;tcp://128.243.24.110:35258" --gprreplica  
> "0.0.0;tcp://128.243.24.110:35258"
> 18278 18273 18265  |           |   \_ /usr/SGE6/utilbin/lx24-x86/ 
> rsh -n -p 48076 taramel exec '/usr/SGE6/utilbin/lx24-x86/ 
> qrsh_starter' '/usr/SGE6/default/spool/taramel/active_jobs/ 
> 200.1/1.taramel' noshell
> 18274 18269 18265  |           \_ qrsh -inherit -noshell -nostdin - 
> V legolas /home/aas/local/openmpi/bin/orted --no-daemonize -- 
> bootproxy 1 --name 0.0.5 --num_procs 7 --vpid_start 0 --nodename  
> legolas.cs.nott.ac.uk --universe aas at justice:default-universe-18269  
> --nsreplica "0.0.0;tcp://128.243.24.110:35258" --gprreplica  
> "0.0.0;tcp://128.243.24.110:35258"
> 18284 18274 18265  |           |   \_ /usr/SGE6/utilbin/lx24-x86/ 
> rsh -n -p 39295 legolas.cs.nott.ac.uk exec '/usr/SGE6/utilbin/lx24- 
> x86/qrsh_starter' '/usr/SGE6/default/spool/legolas/active_jobs/ 
> 200.1/1.legolas' noshell
> 18275 18269 18265  |           \_ qrsh -inherit -noshell -nostdin - 
> V eomer /home/aas/local/openmpi/bin/orted --no-daemonize -- 
> bootproxy 1 --name 0.0.6 --num_procs 7 --vpid_start 0 --nodename  
> eomer.cs.nott.ac.uk --universe aas at justice:default-universe-18269 -- 
> nsreplica "0.0.0;tcp://128.243.24.110:35258" --gprreplica  
> "0.0.0;tcp://128.243.24.110:35258"
> 18283 18275 18265  |               \_ /usr/SGE6/utilbin/lx24-x86/ 
> rsh -n -p 36236 eomer exec '/usr/SGE6/utilbin/lx24-x86/ 
> qrsh_starter' '/usr/SGE6/default/spool/eomer/active_jobs/ 
> 200.1/1.eomer' noshell
> 18276 13699 18276  \_ sge_shepherd-200 -bg
> 18277 18276 18277      \_ /usr/SGE6/utilbin/lx24-x86/rshd -l
> 18280 18277 18280          \_ /usr/SGE6/utilbin/lx24-x86/ 
> qrsh_starter /usr/SGE6/default/spool/justice/active_jobs/ 
> 200.1/1.justice noshell
> 18282 18280 18282              \_ /home/aas/local/openmpi/bin/orted  
> --no-daemonize --bootproxy 1 --name 0.0.1 --num_procs 7 -- 
> vpid_start 0 --nodename justice --universe aas at justice:default- 
> universe-18269 --nsreplica "0.0.0;tcp://128.243.24.110:35258" -- 
> gprreplica "0.0.0;tcp://128.243.24.110:35258"
> 18286 18282 18282                  \_ /home/aas/mpihello
>
> >Any errors on the slave nodes - like firewall or similar in the tcp-
> >wrapper? Is something in the messages on the nodes in $SGE_ROOT/
> >default/spool/comp1/messages et al.?
> No message on any node!
>
> -- Azhar
>
>
>
>
> From: Reuti <reuti at staff.uni-marburg.de>
> Subject: Re: [GE users] SGE+OpenMPI: ERROR: A daemon on node xyz  
> failed to start as expected
> To: users at gridengine.sunsource.net
> Date: Monday, June 30, 2008, 10:24 PM
>
> Am 30.06.2008 um 20:03 schrieb Azhar Ali Shah:
>
> >
> > --- On Mon, 6/30/08, Reuti <reuti at staff.uni-marburg.de> wrote:
> > >Can you try a simple mpihello?
> > It also gives following error:
> >
> > [taramel:05999] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/
> > orte_init_stage1.c at line 214
> > [taramel:06000] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/
> > orte_init_stage1.c at line 214
> >  
> ----------------------------------------------------------------------
> > ----
> > Sorry! You were supposed to get help about:
> > orte_init:startup:internal-failure
> > from the file:help-orte-runtime But I couldn't find any file
> > matching that name. Sorry!
>
> Completely strange :-?!? You also used the mpicc from Open MPI? Do
> you see more in the process listing below when you append --cols=500
> to see the full orted line? Any probs with the nodenames?
>
> Any errors on the slave nodes - like firewall or similar in the tcp-
> wrapper? Is something in the messages on the nodes in $SGE_ROOT/
> default/spool/comp1/messages et al.?
>
> -- Reuti
>
>
> >  
> ----------------------------------------------------------------------
> > ----
> > *** An error occurred in MPI_Init
> > *** before MPI was initialized
> > *** MPI_ERRORS_ARE_FATAL (goodbye)
> > [taramel:5999] Abort before MPI_INIT completed successfully; not
> > able to guarantee that all other processes were k
> > illed!
> >
> >
> > >Are the processes allocated correctly on the granted nodes?
> > Well, the ps -e f gives:
> >
> > 9197 9982 9197 \_ sge_shepherd-199 -bg
> > 9202 9197 9202 | \_ bash /usr/SGE6/default/spool/smeg/job_scripts/ 
> 199
> > 9206 9202 9202 | \_ mpirun -n 9 /home/aas/mpihello
> > 9207 9206 9202 | \_ qrsh -inherit -noshell -nostdin -V comp3 /home/
> > aas/local/openmpi/bi
> > 9216 9207 9202 | | \_ /usr/SGE6/utilbin/lx24-x86/rsh -n -p 34959
> > comp3 exec '/usr/SGE
> > 9208 9206 9202 | \_ qrsh -inherit -noshell -nostdin -V comp4 /home/
> > aas/local/openmpi
> > 9219 9208 9202 | | \_ /usr/SGE6/utilbin/lx24-x86/rsh -n -p 35247
> > comp4 exec '/usr/
> > 9209 9206 9202 | \_ qrsh -inherit -noshell -nostdin -V comp1 /home/
> > aas/local/openmpi
> > 9214 9209 9202 | | \_ /usr/SGE6/utilbin/lx24-x86/rsh -n -p 39905
> > comp1 exec '/usr/
> > 9210 9206 9202 | \_ qrsh -inherit -noshell -nostdin -V comp6 /home/
> > aas/local/openmpi
> > 9222 9210 9202 | | \_ /usr/SGE6/utilbin/lx24-x86/rsh -n -p 41378
> > comp6 exec '/usr/
> > 9211 9206 9202 | \_ qrsh -inherit -noshell -nostdin -V comp4 /home/
> > aas/local/openmpi
> > 9221 9211 9202 | | \_ /usr/SGE6/utilbin/lx24-x86/rsh -n -p 48105
> > comp4 exec '/usr/
> > 9212 9206 9202 | \_ qrsh -inherit -noshell -nostdin -V comp2 /home/
> > aas/local/openmpi/b
> > 9220 9212 9202 | \_ /usr/SGE6/utilbin/lx24-x86/rsh -n -p 36224
> > comp2 exec '/usr/SG
> > 9213 9982 9213 \_ sge_shepherd-199 -bg
> > 9215 9213 9215 \_ /usr/SGE6/utilbin/lx24-x86/rshd -l
> > 9217 9215 9217 \_ /usr/SGE6/utilbin/lx24-x86/qrsh_starter /usr/SGE6/
> > default/spool/comp3/active_jobs/199
> > 9218 9217 9218 \_ /home/aas/local/openmpi/bin/orted --no-daemonize
> > --bootproxy 1 --name 0.0.1 --nu
> > 9223 9218 9218 \_ /home/aas/mpihello
> >
> > Which to me seems correct.
> >
> > >Are you using special MPI-2 techniques like spawning additional
> > >processes to the slave-nodes?
> > No.
> >
> > thanks for your time.
> > Azhar
> >
> >
> >
> >
> > From: Reuti <reuti at staff.uni-marburg.de>
> > Subject: Re: [GE users] SGE+OpenMPI: ERROR: A daemon on node xyz
> > failed to start as expected
> > To: users at gridengine.sunsource.net
> > Date: Monday, June 30, 2008, 6:17 PM
> >
> > Am 30.06.2008 um 18:50 schrieb Azhar Ali Shah:
> >
> > >
> > > --- On Mon, 6/30/08, Reuti <reuti at staff.uni-marburg.de> wrote:
> > > >What did your jobscript look like?
> > >
> > > The job script is:
> > > #$ -S /bin/bash
> > > #$ -M aas at xxx
> > > #$ -m be
> > > #$ -N fast-ds250-9p-openmpi
> > > #
> > >
> > >
> > > export PATH=/home/aas/local/openmpi/bin:$PATH
> > > echo "Got $NSLOTS slots."
> > > echo Running on host `hostname`
> > > echo Time is `date`
> > > echo Directory is `pwd`
> > > echo This job runs on the following processors:
> > > # cat $TMPDIR/machines
> > > echo This job has allocated $NSLOTS processors
> > >
> > > mpirun -n $NSLOTS ~/par_procksi_Alone
> >
> > Mmh - all looks fine. Can you try a simle mpihello like inside  
> http://
> > gridengine.sunsource.net/howto/mpich2-integration/mpihello.tgz with
> > this setup please (which needs to be qdel'ed by intention)? Are the
> > processes allocated correctly on the granted nodes?
> >
> > Are you using special MPI-2 techniques like spawning additional
> > processes to the slave-nodes?
> >
> > -- Reuti
> >
> >
> > > exit 0
> > >
> > > Further, I have passwordless ssh/rsh on all nodes.
> > > Please let me know if any other information would be useful to
> > > rectify the cause?
> > >
> > > Thanks,
> > > Azhar
> > >
> > >
> > > From: Reuti <reuti at staff.uni-marburg.de>
> > > Subject: Re: [GE users] SGE+OpenMPI: ERROR: A daemon on node xyz
> > > failed to start as expected
> > > To: users at gridengine.sunsource.net
> > > Date: Monday, June 30, 2008, 5:40 PM
> > >
> > > Hi,
> > >
> > > Am 30.06.2008 um 17:52 schrieb Azhar Ali Shah:
> > >
> > > > Having installed OpenMPI 1.2.6 on each node of a Linux cluster,
>
> > SGE
> > > > 6.1u3 gives following error when executing a test parallel job:
> > > > error: executing task of job 198 failed:
> > > > [taramel:04947] ERROR: A daemon on node xyz failed to start as
> > > > expected.
> > > > [taramel:04947] ERROR: There may be more information available
> > from
> > > > [taramel:04947] ERROR: the 'qstat -t' command on the
> Grid
> > Engine
> > > > tasks.
> > > > [taramel:04947] ERROR: If the problem persists, please restart
> the
> > > > [taramel:04947] ERROR: Grid Engine PE job
> > > > [taramel:04947] ERROR: The daemon exited unexpectedly with
> > status 1.
> > > >
> > > > The message log for node in subject says:
> > > > 06/30/2008 16:24:09|execd|xyz|E|no free queue for job 198 of
> user
> > > > aas at abc
> > > > .uk (localhost = xyz)
> > >
> > > strange - if there is no free slot, the job shouldn't get
> scheduled
> > > at all. IIRC this message appears only for a wrong setting of
> > > "job_is_first_task", but your setting of "false"
> is
> > fine.
> > >
> > > What did your jobscript look like?
> > >
> > > -- Reuti
> > >
> > >
> > > > To my surprise all the nodes are free and qstat -f doesn't
> > display
> > > > them in error/unreachable/running etc.
> > > > Also, when I submit the job requesting only one node it runs
> > > > without any problem on that node. This is true for nodes except
>
> > the
> > > > master (which gives same problem).
> > > >
> > > > I am using following configuration for OpenMPI:
> > > > pe_name openmpi
> > > > slots 999
> > > > user_lists NONE
> > > > xuser_lists NONE
> > > > start_proc_args /bin/true
> > > > stop_proc_args /bin/true
> > > > allocation_rule $round_robin
> > > > control_slaves TRUE
> > > > job_is_first_task FALSE
> > > > urgency_slots min
> > > >
> > > > Any pointers on how to correct the startup of daemons please?
> > > >
> > > > thanks
> > > > Azhar
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> > >
> >  
> ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > > For additional commands, e-mail: users- 
> help at gridengine.sunsource.net
> > >
> >
> >
> >  
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list