[GE users] SGE+OpenMPI: ERROR: A daemon on node xyz failed to start as expected

Joe Landman landman at scalableinformatics.com
Tue Jul 1 14:53:52 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Azhar Ali Shah wrote:
> 
> --- On *Tue, 7/1/08, Reuti /<reuti at staff.uni-marburg.de>/* wrote:
> 
>>All looks perfect. The thing I could imagine is that a firewall is  
>>blocking the communication to other nodes. Are you using ssh to the  
>>nodes normally instead?
> 
> Double checked for passwordless ssh which seems OK for all nodes.
> Each node runs the job perfectly except the headnode!. 
> Keeping the headnode queue suspended, the 'mpihello' program runs well on 5 nodes (7 slots) giving following output:
> 
> Hello World from Node 2.
> Hello World from Node 1.
> Hello World from Node 0.
> Hello World from Node 5.
> Hello World from Node 4.
> Hello World from Node 6.
> Hello World from Node 3.
> 
> But once head node is resumed it gives error:
> 
> [taramel:04878] [NO-NAME] ORTE_ERROR_LOG: Not found in file
>  runtime/orte_init_st
> age1.c at line 214

Could it be that the paths (installation paths for OpenMPI, or the path 
to the binary) are somehow different between the head node and the 
compute nodes?  This seems to be saying that it cannot find something 
for the OpenMPI Run Time Environment (ORTE).

A few things could do this ... incorrect naming (dns name not matching 
system host name), paths that diverge between two hosts (OpenMPI assumes 
that the paths are the same on all machines unless you tell it 
explicitly that they are not).

On each machine, could you do an

	which mpirun

and see if we are looking at the same version (head node and one of the 
compute nodes)

> --------------------------------------------------------------------------
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (goodbye)
> [taramel:4878] Abort before MPI_INIT completed successfully; not able to guarant
> ee that all other processes were killed!
> 
> Have tried to reinstall OpenMPI on headnode but doesn't work!. 
> Have also checked for <ssh headnode date> which works fine. 
> Also tried to restart the headnode but to no vain!
> 
> No idea of what to be done?
> 
> -- Azhar
> 
> 
> Double checked with <which mpicc and mpirun> which seems ok for openmpi. 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>     Subject: Re: [GE users] SGE+OpenMPI: ERROR: A daemon on node xyz
>     failed to start as expected
>     To: users at gridengine.sunsource.net
>     Date: Tuesday, July 1, 2008, 12:27 PM
> 
>     All looks perfect. The thing I could imagine is that a firewall is  
>     blocking the communication to other nodes. Are you using ssh to the  
>     nodes normally instead?
> 
>     -- Reuti
> 
> 
>     Am 01.07.2008 um 11:09 schrieb Azhar Ali Shah:
> 
>     > --- On Mon, 6/30/08, Reuti <reuti at staff.uni-marburg.de> wrote:
>     > >You also used the mpicc from Open MPI?
>     > Yes! and as I mentioned earlier if i run the job on single node it  
>     > runs well!
>     >
>     > >Do you see more in the process listing below when you append -- 
>     > cols=500
>     > >to see the full orted line? Any probs with the nodenames?
>     > No problems with nodenames!
>     >
>     > 13699     1 13699 /usr/SGE6/bin/lx24-x86/sge_execd
>     > 18259 13699 18259  \_ sge_shepherd-200 -bg
>     > 18265 18259 18265  |   \_ bash /usr/SGE6/default/spool/justice/ 
>     >
>      job_scripts/200
>     > 18269 18265 18265  |       \_ mpirun -n 9 /home/aas/mpihello
>     > 18270 18269 18265  |           \_ qrsh -inherit -noshell -nostdin - 
>     > V justice /home/aas/local/openmpi/bin/orted --no-daemonize -- 
>     > bootproxy 1 --name 0.0.1 --num_procs 7 --vpid_start 0 --nodename  
>     > justice --universe aas at justice:default-universe-18269 --nsreplica  
>     > "0.0.0;tcp://128.243.24.110:35258" --gprreplica
>     "0.0.0;tcp:// 
>     > 128.243.24.110:35258"
>     > 18279 18270 18265  |           |   \_ /usr/SGE6/utilbin/lx24-x86/ 
>     > rsh -n -p 35272 justice exec '/usr/SGE6/utilbin/lx24-x86/ 
>     > qrsh_starter' '/usr/SGE6/default/spool/justice/active_jobs/ 
>     > 200.1/1.justice' noshell
>     > 18271 18269 18265  |           \_ qrsh -inherit -noshell -nostdin - 
>     > V aragorn /home/aas/local/openmpi/bin/orted --no-daemonize -- 
>     > bootproxy 1 --name 0.0.2 --num_procs 7 --vpid_start 0 --nodename  
>     >
>      aragorn --universe aas at justice:default-universe-18269 --nsreplica  
>     > "0.0.0;tcp://128.243.24.110:35258" --gprreplica
>     "0.0.0;tcp:// 
>     > 128.243.24.110:35258"
>     > 18285 18271 18265  |           |   \_ /usr/SGE6/utilbin/lx24-x86/ 
>     > rsh -n -p 60950 aragorn exec '/usr/SGE6/utilbin/lx24-x86/ 
>     > qrsh_starter' '/usr/SGE6/default/spool/aragorn/active_jobs/ 
>     > 200.1/1.aragorn' noshell
>     > 18272 18269 18265  |           \_ qrsh -inherit -noshell -nostdin - 
>     > V smeg.cs.nott.ac.uk /home/aas/local/openmpi/bin/orted --no- 
>     > daemonize --bootproxy 1 --name 0.0.3 --num_procs 7 --vpid_start 0 -- 
>     > nodename smeg --universe aas at justice:default-universe-18269 -- 
>     > nsreplica "0.0.0;tcp://128.243.24.110:35258" --gprreplica  
>     > "0.0.0;tcp://128.243.24.110:35258"
>     > 18281 18272 18265  |           |   \_ /usr/SGE6/utilbin/lx24-x86/ 
>     > rsh -n -p 34978 smeg.cs.nott.ac.uk exec
>      '/usr/SGE6/utilbin/lx24-x86/ 
>     > qrsh_starter' '/usr/SGE6/default/spool/smeg/active_jobs/ 
>     > 200.1/1.smeg' noshell
>     > 18273 18269 18265  |           \_ qrsh -inherit -noshell -nostdin - 
>     > V taramel.cs.nott.ac.uk /home/aas/local/openmpi/bin/orted --no- 
>     > daemonize --bootproxy 1 --name 0.0.4 --num_procs 7 --vpid_start 0 -- 
>     > nodename taramel --universe aas at justice:default-universe-18269 -- 
>     > nsreplica "0.0.0;tcp://128.243.24.110:35258" --gprreplica  
>     > "0.0.0;tcp://128.243.24.110:35258"
>     > 18278 18273 18265  |           |   \_ /usr/SGE6/utilbin/lx24-x86/ 
>     > rsh -n -p 48076 taramel exec '/usr/SGE6/utilbin/lx24-x86/ 
>     > qrsh_starter' '/usr/SGE6/default/spool/taramel/active_jobs/ 
>     > 200.1/1.taramel' noshell
>     > 18274 18269 18265  |           \_ qrsh -inherit -noshell -nostdin - 
>     > V legolas /home/aas/local/openmpi/bin/orted --no-daemonize -- 
>     > bootproxy 1 --name 0.0.5
>      --num_procs 7 --vpid_start 0 --nodename  
>     > legolas.cs.nott.ac.uk --universe aas at justice:default-universe-18269  
>     > --nsreplica "0.0.0;tcp://128.243.24.110:35258" --gprreplica  
>     > "0.0.0;tcp://128.243.24.110:35258"
>     > 18284 18274 18265  |           |   \_ /usr/SGE6/utilbin/lx24-x86/ 
>     > rsh -n -p 39295 legolas.cs.nott.ac.uk exec '/usr/SGE6/utilbin/lx24- 
>     > x86/qrsh_starter' '/usr/SGE6/default/spool/legolas/active_jobs/ 
>     > 200.1/1.legolas' noshell
>     > 18275 18269 18265  |           \_ qrsh -inherit -noshell -nostdin - 
>     > V eomer /home/aas/local/openmpi/bin/orted --no-daemonize -- 
>     > bootproxy 1 --name 0.0.6 --num_procs 7 --vpid_start 0 --nodename  
>     > eomer.cs.nott.ac.uk --universe aas at justice:default-universe-18269 -- 
>     > nsreplica "0.0.0;tcp://128.243.24.110:35258" --gprreplica  
>     > "0.0.0;tcp://128.243.24.110:35258"
>     > 18283 18275 18265  |               \_
>      /usr/SGE6/utilbin/lx24-x86/ 
>     > rsh -n -p 36236 eomer exec '/usr/SGE6/utilbin/lx24-x86/ 
>     > qrsh_starter' '/usr/SGE6/default/spool/eomer/active_jobs/ 
>     > 200.1/1.eomer' noshell
>     > 18276 13699 18276  \_ sge_shepherd-200 -bg
>     > 18277 18276 18277      \_ /usr/SGE6/utilbin/lx24-x86/rshd -l
>     > 18280 18277 18280          \_ /usr/SGE6/utilbin/lx24-x86/ 
>     > qrsh_starter /usr/SGE6/default/spool/justice/active_jobs/ 
>     > 200.1/1.justice noshell
>     > 18282 18280 18282              \_ /home/aas/local/openmpi/bin/orted  
>     > --no-daemonize --bootproxy 1 --name 0.0.1 --num_procs 7 -- 
>     > vpid_start 0 --nodename justice --universe aas at justice:default- 
>     > universe-18269 --nsreplica "0.0.0;tcp://128.243.24.110:35258" --
> 
>     > gprreplica "0.0.0;tcp://128.243.24.110:35258"
>     > 18286 18282 18282                  \_ /home/aas/mpihello
>     >
>     > >Any errors on the slave nodes - like firewall or
>      similar in the tcp-
>     > >wrapper? Is something in the messages on the nodes in $SGE_ROOT/
>     > >default/spool/comp1/messages et al.?
>     > No message on any node!
>     >
>     > -- Azhar
>     >
>     >
>     >
>     >
>     > From: Reuti <reuti at staff.uni-marburg.de>
>     > Subject: Re: [GE users] SGE+OpenMPI: ERROR: A daemon on node xyz  
>     > failed to start as expected
>     > To: users at gridengine.sunsource.net
>     > Date: Monday, June 30, 2008, 10:24 PM
>     >
>     > Am 30.06.2008 um 20:03 schrieb Azhar Ali Shah:
>     >
>     > >
>     > > --- On Mon, 6/30/08, Reuti <reuti at staff.uni-marburg.de> wrote:
>     > > >Can you try a simple mpihello?
>     > > It also gives following error:
>     > >
>     > > [taramel:05999] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/
>     > > orte_init_stage1.c at line 214
>     > > [taramel:06000] [NO-NAME] ORTE_ERROR_LOG: Not found in file
>      runtime/
>     > > orte_init_stage1.c at line 214
>     > >  
>     > ----------------------------------------------------------------------
>     > > ----
>     > > Sorry! You were supposed to get help about:
>     > > orte_init:startup:internal-failure
>     > > from the file:help-orte-runtime But I couldn't find any file
>     > > matching that name. Sorry!
>     >
>     > Completely strange :-?!? You also used the mpicc from Open MPI? Do
>     > you see more in the process listing below when you append --cols=500
>     > to see the full orted line? Any probs with the nodenames?
>     >
>     > Any errors on the slave nodes - like firewall or similar in the tcp-
>     > wrapper? Is something in the messages on the nodes in $SGE_ROOT/
>     > default/spool/comp1/messages et al.?
>     >
>     > -- Reuti
>     >
>     >
>     > >  
>     > ----------------------------------------------------------------------
>     > >
>      ----
>     > > *** An error occurred in MPI_Init
>     > > *** before MPI was initialized
>     > > *** MPI_ERRORS_ARE_FATAL (goodbye)
>     > > [taramel:5999] Abort before MPI_INIT completed successfully; not
>     > > able to guarantee that all other processes were k
>     > > illed!
>     > >
>     > >
>     > > >Are the processes allocated correctly on the granted nodes?
>     > > Well, the ps -e f gives:
>     > >
>     > > 9197 9982 9197 \_ sge_shepherd-199 -bg
>     > > 9202 9197 9202 | \_ bash
>     /usr/SGE6/default/spool/smeg/job_scripts/ 
>     > 199
>     > > 9206 9202 9202 | \_ mpirun -n 9 /home/aas/mpihello
>     > > 9207 9206 9202 | \_ qrsh -inherit -noshell -nostdin -V comp3
>     /home/
>     > > aas/local/openmpi/bi
>     > > 9216 9207 9202 | | \_ /usr/SGE6/utilbin/lx24-x86/rsh -n -p 34959
>     > > comp3 exec '/usr/SGE
>     > > 9208 9206 9202 | \_ qrsh -inherit -noshell -nostdin
>      -V comp4
>     /home/
>     > > aas/local/openmpi
>     > > 9219 9208 9202 | | \_ /usr/SGE6/utilbin/lx24-x86/rsh -n -p 35247
>     > > comp4 exec '/usr/
>     > > 9209 9206 9202 | \_ qrsh -inherit -noshell -nostdin -V comp1
>     /home/
>     > > aas/local/openmpi
>     > > 9214 9209 9202 | | \_ /usr/SGE6/utilbin/lx24-x86/rsh -n -p 39905
>     > > comp1 exec '/usr/
>     > > 9210 9206 9202 | \_ qrsh -inherit -noshell -nostdin -V comp6
>     /home/
>     > > aas/local/openmpi
>     > > 9222 9210 9202 | | \_ /usr/SGE6/utilbin/lx24-x86/rsh -n -p 41378
>     > > comp6 exec '/usr/
>     > > 9211 9206 9202 | \_ qrsh -inherit -noshell -nostdin -V comp4
>     /home/
>     > > aas/local/openmpi
>     > > 9221 9211 9202 | | \_ /usr/SGE6/utilbin/lx24-x86/rsh -n -p 48105
>     > > comp4 exec '/usr/
>     > > 9212 9206 9202 | \_ qrsh -inherit -noshell -nostdin -V comp2
>     /home/
>     > > aas/local/openmpi/b
>     > > 9220
>      9212 9202 | \_ /usr/SGE6/utilbin/lx24-x86/rsh -n -p 36224
>     > > comp2 exec '/usr/SG
>     > > 9213 9982 9213 \_ sge_shepherd-199 -bg
>     > > 9215 9213 9215 \_ /usr/SGE6/utilbin/lx24-x86/rshd -l
>     > > 9217 9215 9217 \_ /usr/SGE6/utilbin/lx24-x86/qrsh_starter
>     /usr/SGE6/
>     > > default/spool/comp3/active_jobs/199
>     > > 9218 9217 9218 \_ /home/aas/local/openmpi/bin/orted
>     --no-daemonize
>     > > --bootproxy 1 --name 0.0.1 --nu
>     > > 9223 9218 9218 \_ /home/aas/mpihello
>     > >
>     > > Which to me seems correct.
>     > >
>     > > >Are you using special MPI-2 techniques like spawning additional
>     > > >processes to the slave-nodes?
>     > > No.
>     > >
>     > > thanks for your time.
>     > > Azhar
>     > >
>     > >
>     > >
>     > >
>     > > From: Reuti <reuti at staff.uni-marburg.de>
>     > > Subject: Re: [GE users] SGE+OpenMPI: ERROR:
>      A daemon on node xyz
>     > > failed to start as expected
>     > > To: users at gridengine.sunsource.net
>     > > Date: Monday, June 30, 2008, 6:17 PM
>     > >
>     > > Am 30.06.2008 um 18:50 schrieb Azhar Ali Shah:
>     > >
>     > > >
>     > > > --- On Mon, 6/30/08, Reuti <reuti at staff.uni-marburg.de>
>     wrote:
>     > > > >What did your jobscript look like?
>     > > >
>     > > > The job script is:
>     > > > #$ -S /bin/bash
>     > > > #$ -M aas at xxx
>     > > > #$ -m be
>     > > > #$ -N fast-ds250-9p-openmpi
>     > > > #
>     > > >
>     > > >
>     > > > export PATH=/home/aas/local/openmpi/bin:$PATH
>     > > > echo "Got $NSLOTS slots."
>     > > > echo Running on host `hostname`
>     > > > echo Time is `date`
>     > > > echo Directory is `pwd`
>     > > > echo This job runs on the following
>      processors:
>     > > > # cat $TMPDIR/machines
>     > > > echo This job has allocated $NSLOTS processors
>     > > >
>     > > > mpirun -n $NSLOTS ~/par_procksi_Alone
>     > >
>     > > Mmh - all looks fine. Can you try a simle mpihello like inside  
>     > http://
>     > > gridengine.sunsource.net/howto/mpich2-integration/mpihello.tgz with
>     > > this setup please (which needs to be qdel'ed by intention)? Are
>     the
>     > > processes allocated correctly on the granted nodes?
>     > >
>     > > Are you using special MPI-2 techniques like spawning additional
>     > > processes to the slave-nodes?
>     > >
>     > > -- Reuti
>     > >
>     > >
>     > > > exit 0
>     > > >
>     > > > Further, I have passwordless ssh/rsh on all nodes.
>     > > > Please let me know if any other information would be useful to
>     > > > rectify the cause?
>     > >
>      >
>     > > > Thanks,
>     > > > Azhar
>     > > >
>     > > >
>     > > > From: Reuti <reuti at staff.uni-marburg.de>
>     > > > Subject: Re: [GE users] SGE+OpenMPI: ERROR: A daemon on node xyz
>     > > > failed to start as expected
>     > > > To: users at gridengine.sunsource.net
>     > > > Date: Monday, June 30, 2008, 5:40 PM
>     > > >
>     > > > Hi,
>     > > >
>     > > > Am 30.06.2008 um 17:52 schrieb Azhar Ali Shah:
>     > > >
>     > > > > Having installed OpenMPI 1.2.6 on each node of a Linux
>     cluster,
>     >
>     > > SGE
>     > > > > 6.1u3 gives following error when executing a test parallel
>     job:
>     > > > > error: executing task of job 198 failed:
>     > > > > [taramel:04947] ERROR: A daemon on node xyz failed to start
>     as
>     > > > > expected.
>     > > > > [taramel:04947]
>      ERROR: There may be more information
>     available
>     > > from
>     > > > > [taramel:04947] ERROR: the 'qstat -t' command on
>     the
>     > Grid
>     > > Engine
>     > > > > tasks.
>     > > > > [taramel:04947] ERROR: If the problem persists, please
>     restart
>     > the
>     > > > > [taramel:04947] ERROR: Grid Engine PE job
>     > > > > [taramel:04947] ERROR: The daemon exited unexpectedly with
>     > > status 1.
>     > > > >
>     > > > > The message log for node in subject says:
>     > > > > 06/30/2008 16:24:09|execd|xyz|E|no free queue for job 198
>     of
>     > user
>     > > > > aas at abc
>     > > > > .uk (localhost = xyz)
>     > > >
>     > > > strange - if there is no free slot, the job shouldn't get
>     > scheduled
>     > > > at all. IIRC this message appears only for a wrong setting of
>     > > >
>      "job_is_first_task", but your setting of
>     "false"
>     > is
>     > > fine.
>     > > >
>     > > > What did your jobscript look like?
>     > > >
>     > > > -- Reuti
>     > > >
>     > > >
>     > > > > To my surprise all the nodes are free and qstat -f
>     doesn't
>     > > display
>     > > > > them in error/unreachable/running etc.
>     > > > > Also, when I submit the job requesting only one node it
>     runs
>     > > > > without any problem on that node. This is true for nodes
>     except
>     >
>     > > the
>     > > > > master (which gives same problem).
>     > > > >
>     > > > > I am using following configuration for OpenMPI:
>     > > > > pe_name openmpi
>     > > > > slots 999
>     > > > > user_lists NONE
>     > > > > xuser_lists NONE
>     > > > > start_proc_args /bin/true
>     >
>      > > > stop_proc_args /bin/true
>     > > > > allocation_rule $round_robin
>     > > > > control_slaves TRUE
>     > > > > job_is_first_task FALSE
>     > > > > urgency_slots min
>     > > > >
>     > > > > Any pointers on how to correct the startup of daemons
>     please?
>     > > > >
>     > > > > thanks
>     > > > > Azhar
>     > > > >
>     > > > >
>     > > > >
>     > > > >
>     > > > >
>     > > > >
>     > > > >
>     > > > >
>     > > >
>     > > >
>     > > >
>     > >  
>     > ---------------------------------------------------------------------
>     > > > To unsubscribe, e-mail:
>     users-unsubscribe at gridengine.sunsource.net
>     > > > For additional commands, e-mail: users- 
>     > help at gridengine.sunsource.net
>     > > >
>     >
>      >
>     > >
>     > >  
>     > ---------------------------------------------------------------------
>     > > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>     > > For additional commands, e-mail: users-help at gridengine.sunsource.net
>     > >
>     >
>     >
>     > ---------------------------------------------------------------------
>     > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>     > For additional commands, e-mail: users-help at gridengine.sunsource.net
>     >
> 
> 
>     ---------------------------------------------------------------------
>     To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>     For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list