[GE users] SGE+OpenMPI: ERROR: A daemon on node xyz failed to start as expected

Azhar Ali Shah aas_lakyari at yahoo.com
Tue Jul 1 10:09:31 BST 2008


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

--- On Mon, 6/30/08, Reuti <reuti at staff.uni-marburg.de> wrote:
>You also used the mpicc from Open MPI? 
Yes! and as I mentioned earlier if i run the job on single node it runs well!

>Do you see more in the process listing below when you append --cols=500  
>to see the full orted line? Any probs with the nodenames?
No problems with nodenames!

13699     1 13699 /usr/SGE6/bin/lx24-x86/sge_execd
18259 13699 18259  \_ sge_shepherd-200 -bg
18265 18259 18265  |   \_ bash /usr/SGE6/default/spool/justice/job_scripts/200
18269 18265 18265  |       \_ mpirun -n 9 /home/aas/mpihello
18270 18269 18265  |           \_ qrsh -inherit -noshell -nostdin -V justice /home/aas/local/openmpi/bin/orted --no-daemonize --bootproxy 1 --name 0.0.1 --num_procs 7 --vpid_start 0 --nodename justice --universe aas at justice:default-universe-18269 --nsreplica "0.0.0;tcp://128.243.24.110:35258" --gprreplica "0.0.0;tcp://128.243.24.110:35258"
18279 18270 18265  |           |   \_ /usr/SGE6/utilbin/lx24-x86/rsh -n -p 35272 justice exec '/usr/SGE6/utilbin/lx24-x86/qrsh_starter' '/usr/SGE6/default/spool/justice/active_jobs/200.1/1.justice' noshell
18271 18269 18265  |           \_ qrsh -inherit -noshell -nostdin -V aragorn /home/aas/local/openmpi/bin/orted --no-daemonize --bootproxy 1 --name 0.0.2 --num_procs 7 --vpid_start 0 --nodename aragorn --universe aas at justice:default-universe-18269 --nsreplica "0.0.0;tcp://128.243.24.110:35258" --gprreplica "0.0.0;tcp://128.243.24.110:35258"
18285 18271 18265  |           |   \_ /usr/SGE6/utilbin/lx24-x86/rsh -n -p 60950 aragorn exec '/usr/SGE6/utilbin/lx24-x86/qrsh_starter' '/usr/SGE6/default/spool/aragorn/active_jobs/200.1/1.aragorn' noshell
18272 18269 18265  |           \_ qrsh -inherit -noshell -nostdin -V smeg.cs.nott.ac.uk /home/aas/local/openmpi/bin/orted --no-daemonize --bootproxy 1 --name 0.0.3 --num_procs 7 --vpid_start 0 --nodename smeg --universe aas at justice:default-universe-18269 --nsreplica "0.0.0;tcp://128.243.24.110:35258" --gprreplica "0.0.0;tcp://128.243.24.110:35258"
18281 18272 18265  |           |   \_ /usr/SGE6/utilbin/lx24-x86/rsh -n -p 34978 smeg.cs.nott.ac.uk exec '/usr/SGE6/utilbin/lx24-x86/qrsh_starter' '/usr/SGE6/default/spool/smeg/active_jobs/200.1/1.smeg' noshell
18273 18269 18265  |           \_ qrsh -inherit -noshell -nostdin -V taramel.cs.nott.ac.uk /home/aas/local/openmpi/bin/orted --no-daemonize --bootproxy 1 --name 0.0.4 --num_procs 7 --vpid_start 0 --nodename taramel --universe aas at justice:default-universe-18269 --nsreplica "0.0.0;tcp://128.243.24.110:35258" --gprreplica "0.0.0;tcp://128.243.24.110:35258"
18278 18273 18265  |           |   \_ /usr/SGE6/utilbin/lx24-x86/rsh -n -p 48076 taramel exec '/usr/SGE6/utilbin/lx24-x86/qrsh_starter' '/usr/SGE6/default/spool/taramel/active_jobs/200.1/1.taramel' noshell
18274 18269 18265  |           \_ qrsh -inherit -noshell -nostdin -V legolas /home/aas/local/openmpi/bin/orted --no-daemonize --bootproxy 1 --name 0.0.5 --num_procs 7 --vpid_start 0 --nodename legolas.cs.nott.ac.uk --universe aas at justice:default-universe-18269 --nsreplica "0.0.0;tcp://128.243.24.110:35258" --gprreplica "0.0.0;tcp://128.243.24.110:35258"
18284 18274 18265  |           |   \_ /usr/SGE6/utilbin/lx24-x86/rsh -n -p 39295 legolas.cs.nott.ac.uk exec '/usr/SGE6/utilbin/lx24-x86/qrsh_starter' '/usr/SGE6/default/spool/legolas/active_jobs/200.1/1.legolas' noshell
18275 18269 18265  |           \_ qrsh -inherit -noshell -nostdin -V eomer /home/aas/local/openmpi/bin/orted --no-daemonize --bootproxy 1 --name 0.0.6 --num_procs 7 --vpid_start 0 --nodename eomer.cs.nott.ac.uk --universe aas at justice:default-universe-18269 --nsreplica "0.0.0;tcp://128.243.24.110:35258" --gprreplica "0.0.0;tcp://128.243.24.110:35258"
18283 18275 18265  |               \_ /usr/SGE6/utilbin/lx24-x86/rsh -n -p 36236 eomer exec '/usr/SGE6/utilbin/lx24-x86/qrsh_starter' '/usr/SGE6/default/spool/eomer/active_jobs/200.1/1.eomer' noshell
18276 13699 18276  \_ sge_shepherd-200 -bg
18277 18276 18277      \_ /usr/SGE6/utilbin/lx24-x86/rshd -l
18280 18277 18280          \_ /usr/SGE6/utilbin/lx24-x86/qrsh_starter /usr/SGE6/default/spool/justice/active_jobs/200.1/1.justice noshell
18282 18280 18282              \_ /home/aas/local/openmpi/bin/orted --no-daemonize --bootproxy 1 --name 0.0.1 --num_procs 7 --vpid_start 0 --nodename justice --universe aas at justice:default-universe-18269 --nsreplica "0.0.0;tcp://128.243.24.110:35258" --gprreplica "0.0.0;tcp://128.243.24.110:35258"
18286 18282 18282                  \_ /home/aas/mpihello

>Any errors on the slave nodes - like firewall or similar in the tcp- 
>wrapper? Is something in the messages on the nodes in $SGE_ROOT/ 
>default/spool/comp1/messages et al.?
No message on any node!

-- Azhar





From: Reuti <reuti at staff.uni-marburg.de>
Subject: Re: [GE users] SGE+OpenMPI:  ERROR: A daemon on node xyz failed to start as expected
To: users at gridengine.sunsource.net
Date: Monday, June 30, 2008, 10:24 PM

Am 30.06.2008 um 20:03 schrieb Azhar Ali Shah:

>
> --- On Mon, 6/30/08, Reuti <reuti at staff.uni-marburg.de> wrote:
> >Can you try a simple mpihello?
> It also gives following error:
>
> [taramel:05999] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/ 
> orte_init_stage1.c at line 214
> [taramel:06000] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/ 
> orte_init_stage1.c at line 214
> ---------------------------------------------------------------------- 
> ----
> Sorry! You were supposed to get help about:  
> orte_init:startup:internal-failure
> from the file:help-orte-runtime But I couldn't find any file  
> matching that name. Sorry!

Completely strange :-?!? You also used the mpicc from Open MPI? Do  
you see more in the process listing below when you append --cols=500  
to see the full orted line? Any probs with the nodenames?

Any errors on the slave nodes - like firewall or similar in the tcp- 
wrapper? Is something in the messages on the nodes in $SGE_ROOT/ 
default/spool/comp1/messages et al.?

-- Reuti


> ---------------------------------------------------------------------- 
> ----
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (goodbye)
> [taramel:5999] Abort before MPI_INIT completed successfully; not  
> able to guarantee that all other processes were k
> illed!
>
>
> >Are the processes allocated correctly on the granted nodes?
> Well, the ps -e f gives:
>
> 9197 9982 9197 \_ sge_shepherd-199 -bg
> 9202 9197 9202 | \_ bash /usr/SGE6/default/spool/smeg/job_scripts/199
> 9206 9202 9202 | \_ mpirun -n 9 /home/aas/mpihello
> 9207 9206 9202 | \_ qrsh -inherit -noshell -nostdin -V comp3 /home/ 
> aas/local/openmpi/bi
> 9216 9207 9202 | | \_ /usr/SGE6/utilbin/lx24-x86/rsh -n -p 34959  
> comp3 exec '/usr/SGE
> 9208 9206 9202 | \_ qrsh -inherit -noshell -nostdin -V comp4 /home/ 
> aas/local/openmpi
> 9219 9208 9202 | | \_ /usr/SGE6/utilbin/lx24-x86/rsh -n -p 35247  
> comp4 exec '/usr/
> 9209 9206 9202 | \_ qrsh -inherit -noshell -nostdin -V comp1 /home/ 
> aas/local/openmpi
> 9214 9209 9202 | | \_ /usr/SGE6/utilbin/lx24-x86/rsh -n -p 39905  
> comp1 exec '/usr/
> 9210 9206 9202 | \_ qrsh -inherit -noshell -nostdin -V comp6 /home/ 
> aas/local/openmpi
> 9222 9210 9202 | | \_ /usr/SGE6/utilbin/lx24-x86/rsh -n -p 41378  
> comp6 exec '/usr/
> 9211 9206 9202 | \_ qrsh -inherit -noshell -nostdin -V comp4 /home/ 
> aas/local/openmpi
> 9221 9211 9202 | | \_ /usr/SGE6/utilbin/lx24-x86/rsh -n -p 48105  
> comp4 exec '/usr/
> 9212 9206 9202 | \_ qrsh -inherit -noshell -nostdin -V comp2 /home/ 
> aas/local/openmpi/b
> 9220 9212 9202 | \_ /usr/SGE6/utilbin/lx24-x86/rsh -n -p 36224  
> comp2 exec '/usr/SG
> 9213 9982 9213 \_ sge_shepherd-199 -bg
> 9215 9213 9215 \_ /usr/SGE6/utilbin/lx24-x86/rshd -l
> 9217 9215 9217 \_ /usr/SGE6/utilbin/lx24-x86/qrsh_starter /usr/SGE6/ 
> default/spool/comp3/active_jobs/199
> 9218 9217 9218 \_ /home/aas/local/openmpi/bin/orted --no-daemonize  
> --bootproxy 1 --name 0.0.1 --nu
> 9223 9218 9218 \_ /home/aas/mpihello
>
> Which to me seems correct.
>
> >Are you using special MPI-2 techniques like spawning additional
> >processes to the slave-nodes?
> No.
>
> thanks for your time.
> Azhar
>
>
>
>
> From: Reuti <reuti at staff.uni-marburg.de>
> Subject: Re: [GE users] SGE+OpenMPI: ERROR: A daemon on node xyz  
> failed to start as expected
> To: users at gridengine.sunsource.net
> Date: Monday, June 30, 2008, 6:17 PM
>
> Am 30.06.2008 um 18:50 schrieb Azhar Ali Shah:
>
> >
> > --- On Mon, 6/30/08, Reuti <reuti at staff.uni-marburg.de> wrote:
> > >What did your jobscript look like?
> >
> > The job script is:
> > #$ -S /bin/bash
> > #$ -M aas at xxx
> > #$ -m be
> > #$ -N fast-ds250-9p-openmpi
> > #
> >
> >
> > export PATH=/home/aas/local/openmpi/bin:$PATH
> > echo "Got $NSLOTS slots."
> > echo Running on host `hostname`
> > echo Time is `date`
> > echo Directory is `pwd`
> > echo This job runs on the following processors:
> > # cat $TMPDIR/machines
> > echo This job has allocated $NSLOTS processors
> >
> > mpirun -n $NSLOTS ~/par_procksi_Alone
>
> Mmh - all looks fine. Can you try a simle mpihello like inside http://
> gridengine.sunsource.net/howto/mpich2-integration/mpihello.tgz with
> this setup please (which needs to be qdel'ed by intention)? Are the
> processes allocated correctly on the granted nodes?
>
> Are you using special MPI-2 techniques like spawning additional
> processes to the slave-nodes?
>
> -- Reuti
>
>
> > exit 0
> >
> > Further, I have passwordless ssh/rsh on all nodes.
> > Please let me know if any other information would be useful to
> > rectify the cause?
> >
> > Thanks,
> > Azhar
> >
> >
> > From: Reuti <reuti at staff.uni-marburg.de>
> > Subject: Re: [GE users] SGE+OpenMPI: ERROR: A daemon on node xyz
> > failed to start as expected
> > To: users at gridengine.sunsource.net
> > Date: Monday, June 30, 2008, 5:40 PM
> >
> > Hi,
> >
> > Am 30.06.2008 um 17:52 schrieb Azhar Ali Shah:
> >
> > > Having installed OpenMPI 1.2.6 on each node of a Linux cluster, 

> SGE
> > > 6.1u3 gives following error when executing a test parallel job:
> > > error: executing task of job 198 failed:
> > > [taramel:04947] ERROR: A daemon on node xyz failed to start as
> > > expected.
> > > [taramel:04947] ERROR: There may be more information available  
> from
> > > [taramel:04947] ERROR: the 'qstat -t' command on the
Grid
> Engine
> > > tasks.
> > > [taramel:04947] ERROR: If the problem persists, please restart
the
> > > [taramel:04947] ERROR: Grid Engine PE job
> > > [taramel:04947] ERROR: The daemon exited unexpectedly with  
> status 1.
> > >
> > > The message log for node in subject says:
> > > 06/30/2008 16:24:09|execd|xyz|E|no free queue for job 198 of
user
> > > aas at abc
> > > .uk (localhost = xyz)
> >
> > strange - if there is no free slot, the job shouldn't get
scheduled
> > at all. IIRC this message appears only for a wrong setting of
> > "job_is_first_task", but your setting of "false"
is
> fine.
> >
> > What did your jobscript look like?
> >
> > -- Reuti
> >
> >
> > > To my surprise all the nodes are free and qstat -f doesn't
> display
> > > them in error/unreachable/running etc.
> > > Also, when I submit the job requesting only one node it runs
> > > without any problem on that node. This is true for nodes except 

> the
> > > master (which gives same problem).
> > >
> > > I am using following configuration for OpenMPI:
> > > pe_name openmpi
> > > slots 999
> > > user_lists NONE
> > > xuser_lists NONE
> > > start_proc_args /bin/true
> > > stop_proc_args /bin/true
> > > allocation_rule $round_robin
> > > control_slaves TRUE
> > > job_is_first_task FALSE
> > > urgency_slots min
> > >
> > > Any pointers on how to correct the startup of daemons please?
> > >
> > > thanks
> > > Azhar
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> >
> >
> >  
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net





More information about the gridengine-users mailing list