[GE users] SGE+OpenMPI: ERROR: A daemon on node xyz failed to start as expected

Reuti reuti at staff.uni-marburg.de
Tue Jul 1 16:54:28 BST 2008


Hi,

Am 01.07.2008 um 15:32 schrieb Azhar Ali Shah:

>
> --- On Tue, 7/1/08, Reuti <reuti at staff.uni-marburg.de> wrote:
> >All looks perfect. The thing I could imagine is that a firewall is
> >blocking the communication to other nodes. Are you using ssh to the
> >nodes normally instead?
>
> Double checked for passwordless ssh which seems OK for all nodes.

Open MPI will use qrsh, and there was no SSH involved in your output  
- I only see rsh and rshd calls. If you must use SSH, you have to  
change the settings for rsh_daemon/rsh_login.

> Each node runs the job perfectly except the headnode!.
> Keeping the headnode queue suspended, the 'mpihello' program runs  
> well on 5 nodes (7 slots) giving following output:
>
> Hello World from Node 2.
> Hello World from Node 1.
> Hello World from Node 0.
> Hello World from Node 5.
> Hello World from Node 4.
> Hello World from Node 6.
> Hello World from Node 3.

Maybe inside the cluster plain rsh is working, but the headnode  
firewall is working on all interfaces (supposed you have more than  
one) and not leaving the internal one to the nodes free?

BTW: is the headnode also a file server and/or a login server? I  
wouldn't include it as an execution host at all.

-- Reuti


> But once head node is resumed it gives error:
>
> [taramel:04878] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/ 
> orte_init_st
> age1.c at line 214
> ---------------------------------------------------------------------- 
> ----
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (goodbye)
> [taramel:4878] Abort before MPI_INIT completed successfully; not  
> able to guarant
> ee that all other processes were killed!
>
> Have tried to reinstall OpenMPI on headnode but doesn't work!.
> Have also checked for <ssh headnode date> which works fine.
> Also tried to restart the headnode but to no vain!
>
> No idea of what to be done?
>
> -- Azhar
>
>
> Double checked with <which mpicc and mpirun> which seems ok for  
> openmpi.
>
>
>
>
>
>
>
>
>
> Subject: Re: [GE users] SGE+OpenMPI: ERROR: A daemon on node xyz  
> failed to start as expected
> To: users at gridengine.sunsource.net
> Date: Tuesday, July 1, 2008, 12:27 PM
>
> All looks perfect. The thing I could imagine is that a firewall is
> blocking the communication to other nodes. Are you using ssh to the
> nodes normally instead?
>
> -- Reuti
>
>
> Am 01.07.2008 um 11:09 schrieb Azhar Ali Shah:
>
> > --- On Mon, 6/30/08, Reuti <reuti at staff.uni-marburg.de> wrote:
> > >You also used the mpicc from Open MPI?
> > Yes! and as I mentioned earlier if i run the job on single node it
> > runs well!
> >
> > >Do you see more in the process listing below when you append --
> > cols=500
> > >to see the full orted line? Any probs with the nodenames?
> > No problems with nodenames!
> >
> > 13699 1 13699 /usr/SGE6/bin/lx24-x86/sge_execd
> > 18259 13699 18259 \_ sge_shepherd-200 -bg
> > 18265 18259 18265 | \_ bash /usr/SGE6/default/spool/justice/
> > job_scripts/200
> > 18269 18265 18265 | \_ mpirun -n 9 /home/aas/mpihello
> > 18270 18269 18265 | \_ qrsh -inherit -noshell -nostdin -
> > V justice /home/aas/local/openmpi/bin/orted --no-daemonize --
> > bootproxy 1 --name 0.0.1 --num_procs 7 --vpid_start 0 --nodename
> > justice --universe aas at justice:default-universe-18269 --nsreplica
> > "0.0.0;tcp://128.243.24.110:35258" --gprreplica
> "0.0.0;tcp://
> > 128.243.24.110:35258"
> > 18279 18270 18265 | | \_ /usr/SGE6/utilbin/lx24-x86/
> > rsh -n -p 35272 justice exec '/usr/SGE6/utilbin/lx24-x86/
> > qrsh_starter' '/usr/SGE6/default/spool/justice/active_jobs/
> > 200.1/1.justice' noshell
> > 18271 18269 18265 | \_ qrsh -inherit -noshell -nostdin -
> > V aragorn /home/aas/local/openmpi/bin/orted --no-daemonize --
> > bootproxy 1 --name 0.0.2 --num_procs 7 --vpid_start 0 --nodename
> > aragorn --universe aas at justice:default-universe-18269 --nsreplica
> > "0.0.0;tcp://128.243.24.110:35258" --gprreplica
> "0.0.0;tcp://
> > 128.243.24.110:35258"
> > 18285 18271 18265 | | \_ /usr/SGE6/utilbin/lx24-x86/
> > rsh -n -p 60950 aragorn exec '/usr/SGE6/utilbin/lx24-x86/
> > qrsh_starter' '/usr/SGE6/default/spool/aragorn/active_jobs/
> > 200.1/1.aragorn' noshell
> > 18272 18269 18265 | \_ qrsh -inherit -noshell -nostdin -
> > V smeg.cs.nott.ac.uk /home/aas/local/openmpi/bin/orted --no-
> > daemonize --bootproxy 1 --name 0.0.3 --num_procs 7 --vpid_start 0 --
> > nodename smeg --universe aas at justice:default-universe-18269 --
> > nsreplica "0.0.0;tcp://128.243.24.110:35258" --gprreplica
> > "0.0.0;tcp://128.243.24.110:35258"
> > 18281 18272 18265 | | \_ /usr/SGE6/utilbin/lx24-x86/
> > rsh -n -p 34978 smeg.cs.nott.ac.uk exec '/usr/SGE6/utilbin/lx24-x86/
> > qrsh_starter' '/usr/SGE6/default/spool/smeg/active_jobs/
> > 200.1/1.smeg' noshell
> > 18273 18269 18265 | \_ qrsh -inherit -noshell -nostdin -
> > V taramel.cs.nott.ac.uk /home/aas/local/openmpi/bin/orted --no-
> > daemonize --bootproxy 1 --name 0.0.4 --num_procs 7 --vpid_start 0 --
> > nodename taramel --universe aas at justice:default-universe-18269 --
> > nsreplica "0.0.0;tcp://128.243.24.110:35258" --gprreplica
> > "0.0.0;tcp://128.243.24.110:35258"
> > 18278 18273 18265 | | \_ /usr/SGE6/utilbin/lx24-x86/
> > rsh -n -p 48076 taramel exec '/usr/SGE6/utilbin/lx24-x86/
> > qrsh_starter' '/usr/SGE6/default/spool/taramel/active_jobs/
> > 200.1/1.taramel' noshell
> > 18274 18269 18265 | \_ qrsh -inherit -noshell -nostdin -
> > V legolas /home/aas/local/openmpi/bin/orted --no-daemonize --
> > bootproxy 1 --name 0.0.5 --num_procs 7 --vpid_start 0 --nodename
> > legolas.cs.nott.ac.uk --universe aas at justice:default-universe-18269
> > --nsreplica "0.0.0;tcp://128.243.24.110:35258" --gprreplica
> > "0.0.0;tcp://128.243.24.110:35258"
> > 18284 18274 18265 | | \_ /usr/SGE6/utilbin/lx24-x86/
> > rsh -n -p 39295 legolas.cs.nott.ac.uk exec '/usr/SGE6/utilbin/lx24-
> > x86/qrsh_starter' '/usr/SGE6/default/spool/legolas/active_jobs/
> > 200.1/1.legolas' noshell
> > 18275 18269 18265 | \_ qrsh -inherit -noshell -nostdin -
> > V eomer /home/aas/local/openmpi/bin/orted --no-daemonize --
> > bootproxy 1 --name 0.0.6 --num_procs 7 --vpid_start 0 --nodename
> > eomer.cs.nott.ac.uk --universe aas at justice:default-universe-18269 --
> > nsreplica "0.0.0;tcp://128.243.24.110:35258" --gprreplica
> > "0.0.0;tcp://128.243.24.110:35258"
> > 18283 18275 18265 | \_ /usr/SGE6/utilbin/lx24-x86/
> > rsh -n -p 36236 eomer exec '/usr/SGE6/utilbin/lx24-x86/
> > qrsh_starter' '/usr/SGE6/default/spool/eomer/active_jobs/
> > 200.1/1.eomer' noshell
> > 18276 13699 18276 \_ sge_shepherd-200 -bg
> > 18277 18276 18277 \_ /usr/SGE6/utilbin/lx24-x86/rshd -l
> > 18280 18277 18280 \_ /usr/SGE6/utilbin/lx24-x86/
> > qrsh_starter /usr/SGE6/default/spool/justice/active_jobs/
> > 200.1/1.justice noshell
> > 18282 18280 18282 \_ /home/aas/local/openmpi/bin/orted
> > --no-daemonize --bootproxy 1 --name 0.0.1 --num_procs 7 --
> > vpid_start 0 --nodename justice --universe aas at justice:default-
> > universe-18269 --nsreplica "0.0.0;tcp://128.243.24.110:35258" --
>
> > gprreplica "0.0.0;tcp://128.243.24.110:35258"
> > 18286 18282 18282 \_ /home/aas/mpihello
> >
> > >Any errors on the slave nodes - like firewall or similar in the  
> tcp-
> > >wrapper? Is something in the messages on the nodes in $SGE_ROOT/
> > >default/spool/comp1/messages et al.?
> > No message on any node!
> >
> > -- Azhar
> >
> >
> >
> >
> > From: Reuti <reuti at staff.uni-marburg.de>
> > Subject: Re: [GE users] SGE+OpenMPI: ERROR: A daemon on node xyz
> > failed to start as expected
> > To: users at gridengine.sunsource.net
> > Date: Monday, June 30, 2008, 10:24 PM
> >
> > Am 30.06.2008 um 20:03 schrieb Azhar Ali Shah:
> >
> > >
> > > --- On Mon, 6/30/08, Reuti <reuti at staff.uni-marburg.de> wrote:
> > > >Can you try a simple mpihello?
> > > It also gives following error:
> > >
> > > [taramel:05999] [NO-NAME] ORTE_ERROR_LOG: Not found in file  
> runtime/
> > > orte_init_stage1.c at line 214
> > > [taramel:06000] [NO-NAME] ORTE_ERROR_LOG: Not found in file  
> runtime/
> > > orte_init_stage1.c at line 214
> > >
> >  
> ----------------------------------------------------------------------
> > > ----
> > > Sorry! You were supposed to get help about:
> > > orte_init:startup:internal-failure
> > > from the file:help-orte-runtime But I couldn't find any file
> > > matching that name. Sorry!
> >
> > Completely strange :-?!? You also used the mpicc from Open MPI? Do
> > you see more in the process listing below when you append --cols=500
> > to see the full orted line? Any probs with the nodenames?
> >
> > Any errors on the slave nodes - like firewall or similar in the tcp-
> > wrapper? Is something in the messages on the nodes in $SGE_ROOT/
> > default/spool/comp1/messages et al.?
> >
> > -- Reuti
> >
> >
> > >
> >  
> ----------------------------------------------------------------------
> > > ----
> > > *** An error occurred in MPI_Init
> > > *** before MPI was initialized
> > > *** MPI_ERRORS_ARE_FATAL (goodbye)
> > > [taramel:5999] Abort before MPI_INIT completed successfully; not
> > > able to guarantee that all other processes were k
> > > illed!
> > >
> > >
> > > >Are the processes allocated correctly on the granted nodes?
> > > Well, the ps -e f gives:
> > >
> > > 9197 9982 9197 \_ sge_shepherd-199 -bg
> > > 9202 9197 9202 | \_ bash
> /usr/SGE6/default/spool/smeg/job_scripts/
> > 199
> > > 9206 9202 9202 | \_ mpirun -n 9 /home/aas/mpihello
> > > 9207 9206 9202 | \_ qrsh -inherit -noshell -nostdin -V comp3
> /home/
> > > aas/local/openmpi/bi
> > > 9216 9207 9202 | | \_ /usr/SGE6/utilbin/lx24-x86/rsh -n -p 34959
> > > comp3 exec '/usr/SGE
> > > 9208 9206 9202 | \_ qrsh -inherit -noshell -nostdin -V comp4
> /home/
> > > aas/local/openmpi
> > > 9219 9208 9202 | | \_ /usr/SGE6/utilbin/lx24-x86/rsh -n -p 35247
> > > comp4 exec '/usr/
> > > 9209 9206 9202 | \_ qrsh -inherit -noshell -nostdin -V comp1
> /home/
> > > aas/local/openmpi
> > > 9214 9209 9202 | | \_ /usr/SGE6/utilbin/lx24-x86/rsh -n -p 39905
> > > comp1 exec '/usr/
> > > 9210 9206 9202 | \_ qrsh -inherit -noshell -nostdin -V comp6
> /home/
> > > aas/local/openmpi
> > > 9222 9210 9202 | | \_ /usr/SGE6/utilbin/lx24-x86/rsh -n -p 41378
> > > comp6 exec '/usr/
> > > 9211 9206 9202 | \_ qrsh -inherit -noshell -nostdin -V comp4
> /home/
> > > aas/local/openmpi
> > > 9221 9211 9202 | | \_ /usr/SGE6/utilbin/lx24-x86/rsh -n -p 48105
> > > comp4 exec '/usr/
> > > 9212 9206 9202 | \_ qrsh -inherit -noshell -nostdin -V comp2
> /home/
> > > aas/local/openmpi/b
> > > 9220 9212 9202 | \_ /usr/SGE6/utilbin/lx24-x86/rsh -n -p 36224
> > > comp2 exec '/usr/SG
> > > 9213 9982 9213 \_ sge_shepherd-199 -bg
> > > 9215 9213 9215 \_ /usr/SGE6/utilbin/lx24-x86/rshd -l
> > > 9217 9215 9217 \_ /usr/SGE6/utilbin/lx24-x86/qrsh_starter
> /usr/SGE6/
> > > default/spool/comp3/active_jobs/199
> > > 9218 9217 9218 \_ /home/aas/local/openmpi/bin/orted
> --no-daemonize
> > > --bootproxy 1 --name 0.0.1 --nu
> > > 9223 9218 9218 \_ /home/aas/mpihello
> > >
> > > Which to me seems correct.
> > >
> > > >Are you using special MPI-2 techniques like spawning additional
> > > >processes to the slave-nodes?
> > > No.
> > >
> > > thanks for your time.
> > > Azhar
> > >
> > >
> > >
> > >
> > > From: Reuti <reuti at staff.uni-marburg.de>
> > > Subject: Re: [GE users] SGE+OpenMPI: ERROR: A daemon on node xyz
> > > failed to start as expected
> > > To: users at gridengine.sunsource.net
> > > Date: Monday, June 30, 2008, 6:17 PM
> > >
> > > Am 30.06.2008 um 18:50 schrieb Azhar Ali Shah:
> > >
> > > >
> > > > --- On Mon, 6/30/08, Reuti <reuti at staff.uni-marburg.de>
> wrote:
> > > > >What did your jobscript look like?
> > > >
> > > > The job script is:
> > > > #$ -S /bin/bash
> > > > #$ -M aas at xxx
> > > > #$ -m be
> > > > #$ -N fast-ds250-9p-openmpi
> > > > #
> > > >
> > > >
> > > > export PATH=/home/aas/local/openmpi/bin:$PATH
> > > > echo "Got $NSLOTS slots."
> > > > echo Running on host `hostname`
> > > > echo Time is `date`
> > > > echo Directory is `pwd`
> > > > echo This job runs on the following processors:
> > > > # cat $TMPDIR/machines
> > > > echo This job has allocated $NSLOTS processors
> > > >
> > > > mpirun -n $NSLOTS ~/par_procksi_Alone
> > >
> > > Mmh - all looks fine. Can you try a simle mpihello like inside
> > http://
> > > gridengine.sunsource.net/howto/mpich2-integration/mpihello.tgz  
> with
> > > this setup please (which needs to be qdel'ed by intention)? Are
> the
> > > processes allocated correctly on the granted nodes?
> > >
> > > Are you using special MPI-2 techniques like spawning additional
> > > processes to the slave-nodes?
> > >
> > > -- Reuti
> > >
> > >
> > > > exit 0
> > > >
> > > > Further, I have passwordless ssh/rsh on all nodes.
> > > > Please let me know if any other information would be useful to
> > > > rectify the cause?
> > > >
> > > > Thanks,
> > > > Azhar
> > > >
> > > >
> > > > From: Reuti <reuti at staff.uni-marburg.de>
> > > > Subject: Re: [GE users] SGE+OpenMPI: ERROR: A daemon on node xyz
> > > > failed to start as expected
> > > > To: users at gridengine.sunsource.net
> > > > Date: Monday, June 30, 2008, 5:40 PM
> > > >
> > > > Hi,
> > > >
> > > > Am 30.06.2008 um 17:52 schrieb Azhar Ali Shah:
> > > >
> > > > > Having installed OpenMPI 1.2.6 on each node of a Linux
> cluster,
> >
> > > SGE
> > > > > 6.1u3 gives following error when executing a test parallel
> job:
> > > > > error: executing task of job 198 failed:
> > > > > [taramel:04947] ERROR: A daemon on node xyz failed to start
> as
> > > > > expected.
> > > > > [taramel:04947] ERROR: There may be more information
> available
> > > from
> > > > > [taramel:04947] ERROR: the 'qstat -t' command on
> the
> > Grid
> > > Engine
> > > > > tasks.
> > > > > [taramel:04947] ERROR: If the problem persists, please
> restart
> > the
> > > > > [taramel:04947] ERROR: Grid Engine PE job
> > > > > [taramel:04947] ERROR: The daemon exited unexpectedly with
> > > status 1.
> > > > >
> > > > > The message log for node in subject says:
> > > > > 06/30/2008 16:24:09|execd|xyz|E|no free queue for job 198
> of
> > user
> > > > > aas at abc
> > > > > .uk (localhost = xyz)
> > > >
> > > > strange - if there is no free slot, the job shouldn't get
> > scheduled
> > > > at all. IIRC this message appears only for a wrong setting of
> > > > "job_is_first_task", but your setting of
> "false"
> > is
> > > fine.
> > > >
> > > > What did your jobscript look like?
> > > >
> > > > -- Reuti
> > > >
> > > >
> > > > > To my surprise all the nodes are free and qstat -f
> doesn't
> > > display
> > > > > them in error/unreachable/running etc.
> > > > > Also, when I submit the job requesting only one node it
> runs
> > > > > without any problem on that node. This is true for nodes
> except
> >
> > > the
> > > > > master (which gives same problem).
> > > > >
> > > > > I am using following configuration for OpenMPI:
> > > > > pe_name openmpi
> > > > > slots 999
> > > > > user_lists NONE
> > > > > xuser_lists NONE
> > > > > start_proc_args /bin/true
> > > > > stop_proc_args /bin/true
> > > > > allocation_rule $round_robin
> > > > > control_slaves TRUE
> > > > > job_is_first_task FALSE
> > > > > urgency_slots min
> > > > >
> > > > > Any pointers on how to correct the startup of daemons
> please?
> > > > >
> > > > > thanks
> > > > > Azhar
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > >
> >  
> ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail:
> users-unsubscribe at gridengine.sunsource.net
> > > > For additional commands, e-mail: users-
> > help at gridengine.sunsource.net
> > > >
> > >
> > >
> > >
> >  
> ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > > For additional commands, e-mail: users- 
> help at gridengine.sunsource.net
> > >
> >
> >
> >  
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list