[GE users] OpenMPI+SGE tight integration works on E6600 core duo systems but not on Q9550 quads

flengyel flengyel at gc.cuny.edu
Tue Jul 7 19:36:20 BST 2009


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

OpenMPI+SGE tight integration works on E6600 core duo systems but not on Q9550 quads.
Could use some troubleshooting assistance. Thanks.

I'm running SGE 6.0u10 on a linux cluster running OpenSuse 11.

OpenMPI was compiled with SGE, and the required components are present:

[flengyel at nept OPENMPI]$ ompi_info | grep gridengine
                 MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2.7)
                 MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.2.7)


The parallel execution environment for OpenMPI is as follows:

[flengyel at nept OPENMPI]$ qconf -sp ompi
pe_name           ompi
slots             999
user_lists        Research
xuser_lists       NONE
start_proc_args   /bin/true
stop_proc_args    /bin/true
allocation_rule   $fill_up
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min

A trivial OpenMPI job using this pe will run on a queue for Intel E6600 core duo machines:

[flengyel at nept OPENMPI]$ cat sum2.sh

#!/bin/bash
#$ -S /bin/bash
#$ -q x86_64.q
#$ -N sum
#$ -pe ompi 4

#$ -cwd

export PATH=/home/nept/apps64/openmpi/bin:$PATH
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/nept/apps64/openmpi/lib
. /usr/local/sge/default/common/settings.sh
mpirun --mca pls_gridengine_verbose 2  --prefix /home/nept/apps64/openmpi -v  ./sum

Here are the results:

[flengyel at nept OPENMPI]$ qsub sum2.sh
Your job 23194 ("sum") has been submitted

[flengyel at nept OPENMPI]$ qstat -r -u flengyel

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
  23194 0.25007 sum        flengyel     r     07/07/2009 14:14:40 x86_64.q at m49.gc.cuny.edu           4
       Full jobname:     sum
       Master queue:     x86_64.q at m49.gc.cuny.edu
       Requested PE:     ompi 4
       Granted PE:       ompi 4
       Hard Resources:
       Soft Resources:
       Hard requested queues: x86_64.q


[flengyel at nept OPENMPI]$ more sum.o23194

The sum from 1 to 1000 is: 500500
[flengyel at nept OPENMPI]$ more sum.e23194
Starting server daemon at host "m49.gc.cuny.edu"
Starting server daemon at host "m33.gc.cuny.edu"
Server daemon successfully started with task id "1.m49"
Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host m49.gc.cuny.edu ...
Server daemon successfully started with task id "1.m33"
Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host m33.gc.cuny.edu ...
/usr/local/sge/utilbin/lx24-amd64/rsh exited with exit code 0
reading exit code from shepherd ...

But the same job with the queue set to quad.q for the Q9550 quad core machines
has daemon trouble:


[flengyel at nept OPENMPI]$ !qstat
qstat -r -u flengyel
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
  23196 0.25000 sum        flengyel     r     07/07/2009 14:26:21 quad.q at m09.gc.cuny.edu             2
       Full jobname:     sum
       Master queue:     quad.q at m09.gc.cuny.edu
       Requested PE:     ompi 2
       Granted PE:       ompi 2
       Hard Resources:
       Soft Resources:
       Hard requested queues: quad.q
[flengyel at nept OPENMPI]$ more sum.e23196
Starting server daemon at host "m15.gc.cuny.edu"
Starting server daemon at host "m09.gc.cuny.edu"
Server daemon successfully started with task id "1.m15"
Server daemon successfully started with task id "1.m09"
Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host m15.gc.cuny.e
du ...
/usr/local/sge/utilbin/lx24-amd64/rsh exited on signal 13 (PIPE)
reading exit code from shepherd ... Establishing /usr/local/sge/utilbin/lx24-amd
64/rsh session to host m09.gc.cuny.edu ...
/usr/local/sge/utilbin/lx24-amd64/rsh exited on signal 13 (PIPE)
reading exit code from shepherd ... 129
[m09.gc.cuny.edu:11413] ERROR: A daemon on node m15.gc.cuny.edu failed to start
as expected.
[m09.gc.cuny.edu:11413] ERROR: There may be more information available from
[m09.gc.cuny.edu:11413] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[m09.gc.cuny.edu:11413] ERROR: If the problem persists, please restart the
[m09.gc.cuny.edu:11413] ERROR: Grid Engine PE job
[m09.gc.cuny.edu:11413] ERROR: The daemon exited unexpectedly with status 129.
129
[m09.gc.cuny.edu:11413] ERROR: A daemon on node m09.gc.cuny.edu failed to start
as expected.
[m09.gc.cuny.edu:11413] ERROR: There may be more information available from
[m09.gc.cuny.edu:11413] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[m09.gc.cuny.edu:11413] ERROR: If the problem persists, please restart the
[m09.gc.cuny.edu:11413] ERROR: Grid Engine PE job
[m09.gc.cuny.edu:11413] ERROR: The daemon exited unexpectedly with status 129.
[flengyel at nept OPENMPI]$


-FL



More information about the gridengine-users mailing list