[GE users] Open MPI tight integration in HOWTO page

Heywood, Todd heywood at cshl.edu
Thu Feb 1 21:57:33 GMT 2007


Hi Andreas,

 

I set up a simple test parallel environment (adapting
$SGE_ROOT/mpi/startmpi.sh to simply create a macines file under $TMDIR),
and ran the script you sent me, which is:

 

 

#!/bin/sh

#$ -S /bin/sh

 

echo "starting $NSLOTS at host `hostname`"

 

I=1

while [ $I -le $NSLOTS ]; do

    host=`sed -n "${I}p" $TMPDIR/machines`

    cmd="$SGE_ROOT/bin/$ARC/qrsh -nostdin -noshell -inherit $host sleep
2"

    echo $cmd

    $cmd &

    I=`expr $I + 1`

done

wait

 

exit 0

 

 

I get the same "GMSH" error messages. Here's the output for a 16-way
job:

 

[heywood at bhmnode2 test]$ ls *82*

-rw-r--r--  1 heywood itstaff  963 Feb  1 16:20 testsge.sh.e8292

-rw-r--r--  1 heywood itstaff 1229 Feb  1 16:20 testsge.sh.o8292

-rw-r--r--  1 heywood itstaff    0 Feb  1 16:20 testsge.sh.pe8292

-rw-r--r--  1 heywood itstaff  210 Feb  1 16:20 testsge.sh.po8292

[heywood at bhmnode2 test]$ more testsge.sh.e8292

error: commlib error: can't read general message size header (GMSH)
(closing "blade193

.bluehelix.cshl.edu/execd/1")

error: commlib error: can't read general message size header (GMSH)
(closing "blade211

.bluehelix.cshl.edu/execd/1")

error: commlib error: can't read general message size header (GMSH)
(closing "blade212

.bluehelix.cshl.edu/execd/1")

error: commlib error: can't read general message size header (GMSH)
(closing "blade197

.bluehelix.cshl.edu/execd/1")

error: commlib error: can't read general message size header (GMSH)
(closing "blade203

.bluehelix.cshl.edu/execd/1")

error: commlib error: can't read general message size header (GMSH)
(closing "blade202

.bluehelix.cshl.edu/execd/1")

error: commlib error: can't read general message size header (GMSH)
(closing "blade201

.bluehelix.cshl.edu/execd/1")

error: commlib error: can't read general message size header (GMSH)
(closing "blade206

.bluehelix.cshl.edu/execd/1")

TERM environment variable not set.

[heywood at bhmnode2 test]$ more testsge.sh.o8292

starting 16 at host blade192

/opt/n1ge6/bin/lx24-amd64/qrsh -nostdin -noshell -inherit blade192 sleep
2

/opt/n1ge6/bin/lx24-amd64/qrsh -nostdin -noshell -inherit blade193 sleep
2

/opt/n1ge6/bin/lx24-amd64/qrsh -nostdin -noshell -inherit blade194 sleep
2

/opt/n1ge6/bin/lx24-amd64/qrsh -nostdin -noshell -inherit blade197 sleep
2

/opt/n1ge6/bin/lx24-amd64/qrsh -nostdin -noshell -inherit blade198 sleep
2

/opt/n1ge6/bin/lx24-amd64/qrsh -nostdin -noshell -inherit blade200 sleep
2

/opt/n1ge6/bin/lx24-amd64/qrsh -nostdin -noshell -inherit blade201 sleep
2

/opt/n1ge6/bin/lx24-amd64/qrsh -nostdin -noshell -inherit blade202 sleep
2

/opt/n1ge6/bin/lx24-amd64/qrsh -nostdin -noshell -inherit blade203 sleep
2

/opt/n1ge6/bin/lx24-amd64/qrsh -nostdin -noshell -inherit blade205 sleep
2

/opt/n1ge6/bin/lx24-amd64/qrsh -nostdin -noshell -inherit blade206 sleep
2

/opt/n1ge6/bin/lx24-amd64/qrsh -nostdin -noshell -inherit blade207 sleep
2

/opt/n1ge6/bin/lx24-amd64/qrsh -nostdin -noshell -inherit blade209 sleep
2

/opt/n1ge6/bin/lx24-amd64/qrsh -nostdin -noshell -inherit blade211 sleep
2

/opt/n1ge6/bin/lx24-amd64/qrsh -nostdin -noshell -inherit blade212 sleep
2

/opt/n1ge6/bin/lx24-amd64/qrsh -nostdin -noshell -inherit blade213 sleep
2

[heywood at bhmnode2 test]$ more testsge.sh.po8292

-catch_rsh /var/spool/sge/blade192/active_jobs/8292.1/pe_hostfile

blade192

blade193

blade194

blade197

blade198

blade200

blade201

blade202

blade203

blade205

blade206

blade207

blade209

blade211

blade212

blade213

[heywood at bhmnode2 test]$

 

 

That is one of my original errors. At least you get a successful run
regardless of the GMSH errors messages (for this test case, and also for
the OpenMPI integration, and also for MPICH2 tight integration).

 

I also ran submitted the script asking for 300 slots (the parallel
environments have 532 slots total). The new thing was about 10-20
occurences of this error mixed in with a multitude of SMSH errors in the
*.e<jobid> file:

 

error: executing task of job 8293 failed: failed receiving message from
execd: can't f

ind connection 1018

 

I then ran on 400 slots, and the error file now gives this (although
output is produced in the *.o<jobid> and *.po<jobid> files):

 

[heywood at bhmnode2 test]$ more testsge.sh.e8295

rcmd: socket: All ports in use

TERM environment variable not set.

 

 

I've also installed MPICH2 (SMPD daemon-based), and found some issues
with the .smpd configuration file being corrupted when too many tasks
access it at once. Having solved that to run large jobs successfully
outside of SGE, I tried Reuti's tight integration, and found that smpd
daemons hang when started with the "-d 0" option (hard coded into the
start_mpich2 program). But that's another story.

 

Todd

 

 

 

 

 

-----Original Message-----
From: Andreas.Haas at Sun.COM [mailto:Andreas.Haas at Sun.COM] 
Sent: Tuesday, January 30, 2007 8:08 AM
To: users at gridengine.sunsource.net
Subject: RE: [GE users] Open MPI tight integration in HOWTO page

 

Hi Todd,

 

On Mon, 29 Jan 2007, Heywood, Todd wrote:

 

> 

> Hi Andreas,

> 

> Thanks for the tip. My qmaster is starting up with a hard/soft

> descriptor limit of 8192. Also "ulimit -n" gives 1024. I tried

> increasing the ulimit and restarting qmaster, and got this...

> 

> 01/29/2007 10:00:51|qmaster|bhmnode2|I|qmaster hard descriptor limit
is

> set to 8192

> 01/29/2007 10:00:51|qmaster|bhmnode2|I|qmaster soft descriptor limit
is

> set to 8192

> 01/29/2007 10:00:51|qmaster|bhmnode2|I|qmaster will use max. 8172 file

> descriptors for communication

> 01/29/2007 10:00:51|qmaster|bhmnode2|I|qmaster will accept max. 99

> dynamic event clients

> 01/29/2007 10:00:51|qmaster|bhmnode2|I|starting up 6.0u8

> 01/29/2007 10:00:51|qmaster|bhmnode2|W|FD_SETSIZE is limited to 8192

> file descriptors on this system.

> 01/29/2007 10:00:51|qmaster|bhmnode2|W|If you want to support more
than

> 8172 qmaster clients you have to

> 01/29/2007 10:00:51|qmaster|bhmnode2|W|recompile the source code with
a

> higher FD_SETSIZE setting.

> 01/29/2007 10:00:51|qmaster|bhmnode2|W|Bug Link:

> http://gridengine.sunsource.net/issues/show_bug.cgi?id=1502

> 

> 

> First, how does qmaster come up with the number 8192 if "ulimit -n"

> gives 1024? Second the bug link given in the qmaster messages file

> implies that the recompiling with a higher FD_SETSIZE is only
necessary

> in the case you have more than 1004 exec daemons. I don't.

 

The difference could be an outcome of hard vs. soft. To check you

might try

 

    # ulimit -H -n

 

note Grid Engine enhances the fd limit to the maximum allowed by 

the hard limit, but 8192 really should suffice at all events.

 

> I have noticed that "ulimit -n" needs to be increased for large

> standalone (not using SGE) OpenMPI jobs, but I am getting errors with

> the OpenMPI/SGE integration for much smaller numbers of MPI tasks, for

> which the MPI program runs fine outside of SGE.

 

Ok. I understand these are no other errors than those in your recent
post.

 

> 

> I understand you were only suggesting to check the descriptor limits

> before digging further... but I'd like to understand why qmaster uses
a

> limit of 8192 when with ulimit -n = 1024. Also, Why do you have a
limit

> of 65536 for your cluster (how large is it!)?

 

It is actually a tiny cluster,

 

    # qhost | wc -l

       18

 

but the master runs on a Solaris machine.

 

As for this OpenMPI job launching problem my recommendation would 

be you try to further separate the SGE from the OpenMPI part. You 

could achieve this by setting up a second parallel environemnt

with the soley purpose of having a test case that mimics the 

SGE OpenMPI tight integration as close as ever possible, however 

without(!) actually using OpenMPI. This test case would then allow 

you to track down any possible SGE problem in an isolated fashion, 

if there is such. If you fail to reproduce the error behavour with 

this test case, it would be necessary to deliberate further.

 

If you agree such a proceeding in general could be worthwile I 

would send you a small job script that I use in similar cases. 

All this job does is a loop over the hostfile and then run 

'qrsh -inherit ... &' in order to launch a task for each entry. 

After the loop is through it does a 'wait' for synchronization.

 

Best regards,

Andreas

 

---------------------------------------------------------------------

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net

For additional commands, e-mail: users-help at gridengine.sunsource.net

 




More information about the gridengine-users mailing list