[GE users] SGE+OpenMPI: ERROR: A daemon on node xyz failed to start as expected

Azhar Ali Shah aas_lakyari at yahoo.com
Tue Jul 1 14:32:10 BST 2008


--- On Tue, 7/1/08, Reuti <reuti at staff.uni-marburg.de> wrote:
>All looks perfect. The thing I could imagine is that a firewall is  
>blocking the communication to other nodes. Are you using ssh to the  
>nodes normally instead?

Double checked for passwordless ssh which seems OK for all nodes.
Each node runs the job perfectly except the headnode!. 
Keeping the headnode queue suspended, the 'mpihello' program runs well on 5 nodes (7 slots) giving following output:

Hello World from Node 2.
Hello World from Node 1.
Hello World from Node 0.
Hello World from Node 5.
Hello World from Node 4.
Hello World from Node 6.
Hello World from Node 3.

But once head node is resumed it gives error:

[taramel:04878] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_init_st
age1.c at line 214
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
[taramel:4878] Abort before MPI_INIT completed successfully; not able to guarant
ee that all other processes were killed!

Have tried to reinstall OpenMPI on headnode but doesn't work!. 
Have also checked for <ssh headnode date> which works fine. 
Also tried to restart the headnode but to no vain!

No idea of what to be done?

-- Azhar


Double checked with <which mpicc and mpirun> which seems ok for openmpi. 









Subject: Re: [GE users] SGE+OpenMPI:  ERROR: A daemon on node xyz failed to start as expected
To: users at gridengine.sunsource.net
Date: Tuesday, July 1, 2008, 12:27 PM

All looks perfect. The thing I could imagine is that a firewall is  
blocking the communication to other nodes. Are you using ssh to the  
nodes normally instead?

-- Reuti


Am 01.07.2008 um 11:09 schrieb Azhar Ali Shah:

> --- On Mon, 6/30/08, Reuti <reuti at staff.uni-marburg.de> wrote:
> >You also used the mpicc from Open MPI?
> Yes! and as I mentioned earlier if i run the job on single node it  
> runs well!
>
> >Do you see more in the process listing below when you append -- 
> cols=500
> >to see the full orted line? Any probs with the nodenames?
> No problems with nodenames!
>
> 13699     1 13699 /usr/SGE6/bin/lx24-x86/sge_execd
> 18259 13699 18259  \_ sge_shepherd-200 -bg
> 18265 18259 18265  |   \_ bash /usr/SGE6/default/spool/justice/ 
> job_scripts/200
> 18269 18265 18265  |       \_ mpirun -n 9 /home/aas/mpihello
> 18270 18269 18265  |           \_ qrsh -inherit -noshell -nostdin - 
> V justice /home/aas/local/openmpi/bin/orted --no-daemonize -- 
> bootproxy 1 --name 0.0.1 --num_procs 7 --vpid_start 0 --nodename  
> justice --universe aas at justice:default-universe-18269 --nsreplica  
> "0.0.0;tcp://128.243.24.110:35258" --gprreplica
"0.0.0;tcp:// 
> 128.243.24.110:35258"
> 18279 18270 18265  |           |   \_ /usr/SGE6/utilbin/lx24-x86/ 
> rsh -n -p 35272 justice exec '/usr/SGE6/utilbin/lx24-x86/ 
> qrsh_starter' '/usr/SGE6/default/spool/justice/active_jobs/ 
> 200.1/1.justice' noshell
> 18271 18269 18265  |           \_ qrsh -inherit -noshell -nostdin - 
> V aragorn /home/aas/local/openmpi/bin/orted --no-daemonize -- 
> bootproxy 1 --name 0.0.2 --num_procs 7 --vpid_start 0 --nodename  
> aragorn --universe aas at justice:default-universe-18269 --nsreplica  
> "0.0.0;tcp://128.243.24.110:35258" --gprreplica
"0.0.0;tcp:// 
> 128.243.24.110:35258"
> 18285 18271 18265  |           |   \_ /usr/SGE6/utilbin/lx24-x86/ 
> rsh -n -p 60950 aragorn exec '/usr/SGE6/utilbin/lx24-x86/ 
> qrsh_starter' '/usr/SGE6/default/spool/aragorn/active_jobs/ 
> 200.1/1.aragorn' noshell
> 18272 18269 18265  |           \_ qrsh -inherit -noshell -nostdin - 
> V smeg.cs.nott.ac.uk /home/aas/local/openmpi/bin/orted --no- 
> daemonize --bootproxy 1 --name 0.0.3 --num_procs 7 --vpid_start 0 -- 
> nodename smeg --universe aas at justice:default-universe-18269 -- 
> nsreplica "0.0.0;tcp://128.243.24.110:35258" --gprreplica  
> "0.0.0;tcp://128.243.24.110:35258"
> 18281 18272 18265  |           |   \_ /usr/SGE6/utilbin/lx24-x86/ 
> rsh -n -p 34978 smeg.cs.nott.ac.uk exec '/usr/SGE6/utilbin/lx24-x86/ 
> qrsh_starter' '/usr/SGE6/default/spool/smeg/active_jobs/ 
> 200.1/1.smeg' noshell
> 18273 18269 18265  |           \_ qrsh -inherit -noshell -nostdin - 
> V taramel.cs.nott.ac.uk /home/aas/local/openmpi/bin/orted --no- 
> daemonize --bootproxy 1 --name 0.0.4 --num_procs 7 --vpid_start 0 -- 
> nodename taramel --universe aas at justice:default-universe-18269 -- 
> nsreplica "0.0.0;tcp://128.243.24.110:35258" --gprreplica  
> "0.0.0;tcp://128.243.24.110:35258"
> 18278 18273 18265  |           |   \_ /usr/SGE6/utilbin/lx24-x86/ 
> rsh -n -p 48076 taramel exec '/usr/SGE6/utilbin/lx24-x86/ 
> qrsh_starter' '/usr/SGE6/default/spool/taramel/active_jobs/ 
> 200.1/1.taramel' noshell
> 18274 18269 18265  |           \_ qrsh -inherit -noshell -nostdin - 
> V legolas /home/aas/local/openmpi/bin/orted --no-daemonize -- 
> bootproxy 1 --name 0.0.5 --num_procs 7 --vpid_start 0 --nodename  
> legolas.cs.nott.ac.uk --universe aas at justice:default-universe-18269  
> --nsreplica "0.0.0;tcp://128.243.24.110:35258" --gprreplica  
> "0.0.0;tcp://128.243.24.110:35258"
> 18284 18274 18265  |           |   \_ /usr/SGE6/utilbin/lx24-x86/ 
> rsh -n -p 39295 legolas.cs.nott.ac.uk exec '/usr/SGE6/utilbin/lx24- 
> x86/qrsh_starter' '/usr/SGE6/default/spool/legolas/active_jobs/ 
> 200.1/1.legolas' noshell
> 18275 18269 18265  |           \_ qrsh -inherit -noshell -nostdin - 
> V eomer /home/aas/local/openmpi/bin/orted --no-daemonize -- 
> bootproxy 1 --name 0.0.6 --num_procs 7 --vpid_start 0 --nodename  
> eomer.cs.nott.ac.uk --universe aas at justice:default-universe-18269 -- 
> nsreplica "0.0.0;tcp://128.243.24.110:35258" --gprreplica  
> "0.0.0;tcp://128.243.24.110:35258"
> 18283 18275 18265  |               \_ /usr/SGE6/utilbin/lx24-x86/ 
> rsh -n -p 36236 eomer exec '/usr/SGE6/utilbin/lx24-x86/ 
> qrsh_starter' '/usr/SGE6/default/spool/eomer/active_jobs/ 
> 200.1/1.eomer' noshell
> 18276 13699 18276  \_ sge_shepherd-200 -bg
> 18277 18276 18277      \_ /usr/SGE6/utilbin/lx24-x86/rshd -l
> 18280 18277 18280          \_ /usr/SGE6/utilbin/lx24-x86/ 
> qrsh_starter /usr/SGE6/default/spool/justice/active_jobs/ 
> 200.1/1.justice noshell
> 18282 18280 18282              \_ /home/aas/local/openmpi/bin/orted  
> --no-daemonize --bootproxy 1 --name 0.0.1 --num_procs 7 -- 
> vpid_start 0 --nodename justice --universe aas at justice:default- 
> universe-18269 --nsreplica "0.0.0;tcp://128.243.24.110:35258" --

> gprreplica "0.0.0;tcp://128.243.24.110:35258"
> 18286 18282 18282                  \_ /home/aas/mpihello
>
> >Any errors on the slave nodes - like firewall or similar in the tcp-
> >wrapper? Is something in the messages on the nodes in $SGE_ROOT/
> >default/spool/comp1/messages et al.?
> No message on any node!
>
> -- Azhar
>
>
>
>
> From: Reuti <reuti at staff.uni-marburg.de>
> Subject: Re: [GE users] SGE+OpenMPI: ERROR: A daemon on node xyz  
> failed to start as expected
> To: users at gridengine.sunsource.net
> Date: Monday, June 30, 2008, 10:24 PM
>
> Am 30.06.2008 um 20:03 schrieb Azhar Ali Shah:
>
> >
> > --- On Mon, 6/30/08, Reuti <reuti at staff.uni-marburg.de> wrote:
> > >Can you try a simple mpihello?
> > It also gives following error:
> >
> > [taramel:05999] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/
> > orte_init_stage1.c at line 214
> > [taramel:06000] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/
> > orte_init_stage1.c at line 214
> >  
> ----------------------------------------------------------------------
> > ----
> > Sorry! You were supposed to get help about:
> > orte_init:startup:internal-failure
> > from the file:help-orte-runtime But I couldn't find any file
> > matching that name. Sorry!
>
> Completely strange :-?!? You also used the mpicc from Open MPI? Do
> you see more in the process listing below when you append --cols=500
> to see the full orted line? Any probs with the nodenames?
>
> Any errors on the slave nodes - like firewall or similar in the tcp-
> wrapper? Is something in the messages on the nodes in $SGE_ROOT/
> default/spool/comp1/messages et al.?
>
> -- Reuti
>
>
> >  
> ----------------------------------------------------------------------
> > ----
> > *** An error occurred in MPI_Init
> > *** before MPI was initialized
> > *** MPI_ERRORS_ARE_FATAL (goodbye)
> > [taramel:5999] Abort before MPI_INIT completed successfully; not
> > able to guarantee that all other processes were k
> > illed!
> >
> >
> > >Are the processes allocated correctly on the granted nodes?
> > Well, the ps -e f gives:
> >
> > 9197 9982 9197 \_ sge_shepherd-199 -bg
> > 9202 9197 9202 | \_ bash
/usr/SGE6/default/spool/smeg/job_scripts/ 
> 199
> > 9206 9202 9202 | \_ mpirun -n 9 /home/aas/mpihello
> > 9207 9206 9202 | \_ qrsh -inherit -noshell -nostdin -V comp3
/home/
> > aas/local/openmpi/bi
> > 9216 9207 9202 | | \_ /usr/SGE6/utilbin/lx24-x86/rsh -n -p 34959
> > comp3 exec '/usr/SGE
> > 9208 9206 9202 | \_ qrsh -inherit -noshell -nostdin -V comp4
/home/
> > aas/local/openmpi
> > 9219 9208 9202 | | \_ /usr/SGE6/utilbin/lx24-x86/rsh -n -p 35247
> > comp4 exec '/usr/
> > 9209 9206 9202 | \_ qrsh -inherit -noshell -nostdin -V comp1
/home/
> > aas/local/openmpi
> > 9214 9209 9202 | | \_ /usr/SGE6/utilbin/lx24-x86/rsh -n -p 39905
> > comp1 exec '/usr/
> > 9210 9206 9202 | \_ qrsh -inherit -noshell -nostdin -V comp6
/home/
> > aas/local/openmpi
> > 9222 9210 9202 | | \_ /usr/SGE6/utilbin/lx24-x86/rsh -n -p 41378
> > comp6 exec '/usr/
> > 9211 9206 9202 | \_ qrsh -inherit -noshell -nostdin -V comp4
/home/
> > aas/local/openmpi
> > 9221 9211 9202 | | \_ /usr/SGE6/utilbin/lx24-x86/rsh -n -p 48105
> > comp4 exec '/usr/
> > 9212 9206 9202 | \_ qrsh -inherit -noshell -nostdin -V comp2
/home/
> > aas/local/openmpi/b
> > 9220 9212 9202 | \_ /usr/SGE6/utilbin/lx24-x86/rsh -n -p 36224
> > comp2 exec '/usr/SG
> > 9213 9982 9213 \_ sge_shepherd-199 -bg
> > 9215 9213 9215 \_ /usr/SGE6/utilbin/lx24-x86/rshd -l
> > 9217 9215 9217 \_ /usr/SGE6/utilbin/lx24-x86/qrsh_starter
/usr/SGE6/
> > default/spool/comp3/active_jobs/199
> > 9218 9217 9218 \_ /home/aas/local/openmpi/bin/orted
--no-daemonize
> > --bootproxy 1 --name 0.0.1 --nu
> > 9223 9218 9218 \_ /home/aas/mpihello
> >
> > Which to me seems correct.
> >
> > >Are you using special MPI-2 techniques like spawning additional
> > >processes to the slave-nodes?
> > No.
> >
> > thanks for your time.
> > Azhar
> >
> >
> >
> >
> > From: Reuti <reuti at staff.uni-marburg.de>
> > Subject: Re: [GE users] SGE+OpenMPI: ERROR: A daemon on node xyz
> > failed to start as expected
> > To: users at gridengine.sunsource.net
> > Date: Monday, June 30, 2008, 6:17 PM
> >
> > Am 30.06.2008 um 18:50 schrieb Azhar Ali Shah:
> >
> > >
> > > --- On Mon, 6/30/08, Reuti <reuti at staff.uni-marburg.de>
wrote:
> > > >What did your jobscript look like?
> > >
> > > The job script is:
> > > #$ -S /bin/bash
> > > #$ -M aas at xxx
> > > #$ -m be
> > > #$ -N fast-ds250-9p-openmpi
> > > #
> > >
> > >
> > > export PATH=/home/aas/local/openmpi/bin:$PATH
> > > echo "Got $NSLOTS slots."
> > > echo Running on host `hostname`
> > > echo Time is `date`
> > > echo Directory is `pwd`
> > > echo This job runs on the following processors:
> > > # cat $TMPDIR/machines
> > > echo This job has allocated $NSLOTS processors
> > >
> > > mpirun -n $NSLOTS ~/par_procksi_Alone
> >
> > Mmh - all looks fine. Can you try a simle mpihello like inside  
> http://
> > gridengine.sunsource.net/howto/mpich2-integration/mpihello.tgz with
> > this setup please (which needs to be qdel'ed by intention)? Are
the
> > processes allocated correctly on the granted nodes?
> >
> > Are you using special MPI-2 techniques like spawning additional
> > processes to the slave-nodes?
> >
> > -- Reuti
> >
> >
> > > exit 0
> > >
> > > Further, I have passwordless ssh/rsh on all nodes.
> > > Please let me know if any other information would be useful to
> > > rectify the cause?
> > >
> > > Thanks,
> > > Azhar
> > >
> > >
> > > From: Reuti <reuti at staff.uni-marburg.de>
> > > Subject: Re: [GE users] SGE+OpenMPI: ERROR: A daemon on node xyz
> > > failed to start as expected
> > > To: users at gridengine.sunsource.net
> > > Date: Monday, June 30, 2008, 5:40 PM
> > >
> > > Hi,
> > >
> > > Am 30.06.2008 um 17:52 schrieb Azhar Ali Shah:
> > >
> > > > Having installed OpenMPI 1.2.6 on each node of a Linux
cluster,
>
> > SGE
> > > > 6.1u3 gives following error when executing a test parallel
job:
> > > > error: executing task of job 198 failed:
> > > > [taramel:04947] ERROR: A daemon on node xyz failed to start
as
> > > > expected.
> > > > [taramel:04947] ERROR: There may be more information
available
> > from
> > > > [taramel:04947] ERROR: the 'qstat -t' command on
the
> Grid
> > Engine
> > > > tasks.
> > > > [taramel:04947] ERROR: If the problem persists, please
restart
> the
> > > > [taramel:04947] ERROR: Grid Engine PE job
> > > > [taramel:04947] ERROR: The daemon exited unexpectedly with
> > status 1.
> > > >
> > > > The message log for node in subject says:
> > > > 06/30/2008 16:24:09|execd|xyz|E|no free queue for job 198
of
> user
> > > > aas at abc
> > > > .uk (localhost = xyz)
> > >
> > > strange - if there is no free slot, the job shouldn't get
> scheduled
> > > at all. IIRC this message appears only for a wrong setting of
> > > "job_is_first_task", but your setting of
"false"
> is
> > fine.
> > >
> > > What did your jobscript look like?
> > >
> > > -- Reuti
> > >
> > >
> > > > To my surprise all the nodes are free and qstat -f
doesn't
> > display
> > > > them in error/unreachable/running etc.
> > > > Also, when I submit the job requesting only one node it
runs
> > > > without any problem on that node. This is true for nodes
except
>
> > the
> > > > master (which gives same problem).
> > > >
> > > > I am using following configuration for OpenMPI:
> > > > pe_name openmpi
> > > > slots 999
> > > > user_lists NONE
> > > > xuser_lists NONE
> > > > start_proc_args /bin/true
> > > > stop_proc_args /bin/true
> > > > allocation_rule $round_robin
> > > > control_slaves TRUE
> > > > job_is_first_task FALSE
> > > > urgency_slots min
> > > >
> > > > Any pointers on how to correct the startup of daemons
please?
> > > >
> > > > thanks
> > > > Azhar
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> > >
> >  
> ---------------------------------------------------------------------
> > > To unsubscribe, e-mail:
users-unsubscribe at gridengine.sunsource.net
> > > For additional commands, e-mail: users- 
> help at gridengine.sunsource.net
> > >
> >
> >
> >  
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net


      



More information about the gridengine-users mailing list