[GE users] Open MPI tight integration in HOWTO page

Heywood, Todd heywood at cshl.edu
Tue Feb 6 22:56:52 GMT 2007


Hi Andreas,

SGE and OpenMPI work fine on large jobs now, as long as the user
authentication is done locally on nodes, not through LDAP. SGE must just
assume that user authentication won't fail or time out once jobs are
running?

Even the GMSH error disappeared for small jobs (even when using LDAP).
Go figure.

It seems LDAP's slapd daemon gets overwhelmed by the large number of
SGE+MPI authentication requests. Running MPI jobs standalone, once there
is over X established connections to slapd, the MPI job hangs. X seems
to be a function of the speed of slapd (not the number of descriptors),
i.e. it is lower when slapd logging is enabled... the faster slapd, the
larger the number of MPI tasks can run.

It seems SGE doesn't know how to handle this situation, since qstat
gives the "fatal error... aborting" message.

I would have thought other clusters would have been using LDAP for
authentication. The OpenMPI people say they don't test with LDAP, but
assume authentication is done locally, per node.

Todd

-----Original Message-----
From: Andreas.Haas at Sun.COM [mailto:Andreas.Haas at Sun.COM] 
Sent: Monday, February 05, 2007 8:06 AM
To: users at gridengine.sunsource.net
Subject: RE: [GE users] Open MPI tight integration in HOWTO page

On Fri, 2 Feb 2007, Heywood, Todd wrote:

> Hi,
>
> If you recall, I had 2 classes of errors: (1) the GMSH error while
jobs
> give output, and (2) complete failure for a large enough number of MPI
> tasks, sometimes giving a grab bag of error messages (see first post
on
> this thread), and sometimes giving no output, but with qstat saying
> "critical error: unrecoverable error - contact systems manager.
> Aborted". This second case might be related to LDAP, as I gound the
> following messages in /var/log/messages of the job nodes:
>
> Feb  2 14:06:16 blade183 sge_execd: nss_ldap: reconnecting to LDAP
> server...
> Feb  2 14:06:16 blade183 sge_execd: nss_ldap: reconnected to LDAP
server
> after 1 attempt(s)
> Feb  2 14:06:16 blade183 sge_shepherd-9194: nss_ldap: reconnecting to
> LDAP server...
> Feb  2 14:06:16 blade183 sge_shepherd-9194: nss_ldap: reconnected to
> LDAP server after 1 attempt(s)
> Feb  2 14:06:17 blade183 sge_shepherd-9194: nss_ldap: reconnecting to
> LDAP server...
> Feb  2 14:06:17 blade183 sge_shepherd-9194: nss_ldap: reconnected to
> LDAP server after 1 attempt(s)
> Feb  2 14:07:19 blade183 sge_shepherd-9194: nss_ldap: reconnecting to
> LDAP server...
> Feb  2 14:07:19 blade183 sge_shepherd-9194: nss_ldap: reconnected to
> LDAP server after 1 attempt(s)
>
> Googling on LDAP plus various cluster/MPI/scalability topics shows up
> nothing.

Well, there is something wrong with your set-up in general, but I 
couldn't do more than guessing. Try searching for "reconnecting LDAP 
nss_ldap". This gets you a number of hits.

Note, in an earlier mail Reuti already asked this:

    "Are you using any special communication lib? Myrinet,
Infiniband,... ?"

If that were the case, there is a chance it is somehow related.

Regards,
Andreas

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list