[GE users] OpenMPI+SGE tight integration works on E6600 core duo systems but not on Q9550 quads

flengyel flengyel at gc.cuny.edu
Wed Jul 8 22:13:47 BST 2009


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]



-----Original Message-----
From: fx [mailto:d.love at liverpool.ac.uk]
Sent: Wed 7/8/2009 12:19 PM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] OpenMPI+SGE tight integration works on E6600 core duo systems but not on Q9550 quads

flengyel <flengyel at gc.cuny.edu> writes:

> OpenMPI+SGE tight integration works on E6600 core duo systems but not
> on Q9550 quads.

Presumably it's nothing to do with the hardware, just your SGE
configuration somehow.  Check the logs for more information.

It doesn't say which version of OMPI, but note that 1.3.0 and 1.3.1 have
broken SGE integration -- see a post of mine in the archive for a
workaround if you can't use 1.3.2.



Hi,

I'm assuming you mean the message below. My original post mentioned
OpenMPI 1.2.7; the suggestion not to daemonize orteds (I almost wrote orcas)
did not work for me:

[flengyel at nept OPENMPI]$ tail -f sum.e23307
Starting server daemon at host "m18.gc.cuny.edu"
Starting server daemon at host "m19.gc.cuny.edu"
Server daemon successfully started with task id "1.m18"
Server daemon successfully started with task id "1.m19"
Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host m19.gc.cuny.edu ...
Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host m18.gc.cuny.edu ...
/usr/local/sge/utilbin/lx24-amd64/rsh exited on signal 13 (PIPE)
reading exit code from shepherd ... /usr/local/sge/utilbin/lx24-amd64/rsh exited on signal 13 (PIPE)
reading exit code from shepherd ... 129
[m19.gc.cuny.edu:05603] ERROR: A daemon on node m19.gc.cuny.edu failed to start as expected.
[m19.gc.cuny.edu:05603] ERROR: There may be more information available from
[m19.gc.cuny.edu:05603] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[m19.gc.cuny.edu:05603] ERROR: If the problem persists, please restart the
[m19.gc.cuny.edu:05603] ERROR: Grid Engine PE job
[m19.gc.cuny.edu:05603] ERROR: The daemon exited unexpectedly with status 129.
129
[m19.gc.cuny.edu:05603] ERROR: A daemon on node m18.gc.cuny.edu failed to start as expected.
[m19.gc.cuny.edu:05603] ERROR: There may be more information available from
[m19.gc.cuny.edu:05603] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[m19.gc.cuny.edu:05603] ERROR: If the problem persists, please restart the
[m19.gc.cuny.edu:05603] ERROR: Grid Engine PE job
[m19.gc.cuny.edu:05603] ERROR: The daemon exited unexpectedly with status 129.



[GE users] workaround for Open MPI 1.3

Author  fx
Full name       fx
Date    2009-03-30 08:24:25 PDT
Message         This was brought up on the Open MPI list, but I don't think it's been
noted here:

There's a bug in Open MPI 1.3.0 and .1 (to be fixed in 1.3.2) which
breaks tight integration in SGE. The workaround is to give `mpirun'
arguments `--mca orte_leave_session_attached 1' or
`--leave-session-attached'. Alternatively, configure the default;
adding this to your <openmpi home>/etc/openmpi?-mca-params.conf does the
trick:

  # Workaround for pre-1.3.2 bug -- avoid daemonizing orteds
  orte_leave_session_attached = 1

? Previous message in topic | 1 of 1 | Next message in topic ?



------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=206188

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].




More information about the gridengine-users mailing list