[GE users] Open MPI tight integration in HOWTO page

Andreas.Haas at Sun.COM Andreas.Haas at Sun.COM
Wed Feb 7 13:35:05 GMT 2007


Glad to hear OpenMPI is now working for you and thousand 
thanks to you a lot for explaining!

First of all I honestly fail to understand, why Grid Engine 
communication layer actually throws these conditions. I mean
I know regular Grid Engine application code performs host name 
resolution via gethostbyname(3) through our communication layer,
but I'm quite curious all the same how this can happen. It 
would be kind of you, if you would get a chance to enable slapd 
daemon logging again just for testing purposes, but do a

    # setenv SGE_COMMLIB_DEBUG 3

before you run the qstat command. When you do this, it will get
you a number of Grid Engine loggings to stderr to help us
ananlyze the phenomenon as we did with

    http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=16841

Secondly I must say it is hard to understand, how unsuspicious enabled 
logging can have such a significant impact on slapd performance. Any 
idea about the root cause of this?

Kind regards,
Andreas


On Tue, 6 Feb 2007, Heywood, Todd wrote:

> Hi Andreas,
>
> SGE and OpenMPI work fine on large jobs now, as long as the user
> authentication is done locally on nodes, not through LDAP. SGE must just
> assume that user authentication won't fail or time out once jobs are
> running?
>
> Even the GMSH error disappeared for small jobs (even when using LDAP).
> Go figure.
>
> It seems LDAP's slapd daemon gets overwhelmed by the large number of
> SGE+MPI authentication requests. Running MPI jobs standalone, once there
> is over X established connections to slapd, the MPI job hangs. X seems
> to be a function of the speed of slapd (not the number of descriptors),
> i.e. it is lower when slapd logging is enabled... the faster slapd, the
> larger the number of MPI tasks can run.
>
> It seems SGE doesn't know how to handle this situation, since qstat
> gives the "fatal error... aborting" message.
>
> I would have thought other clusters would have been using LDAP for
> authentication. The OpenMPI people say they don't test with LDAP, but
> assume authentication is done locally, per node.
>
> Todd
>
> -----Original Message-----
> From: Andreas.Haas at Sun.COM [mailto:Andreas.Haas at Sun.COM]
> Sent: Monday, February 05, 2007 8:06 AM
> To: users at gridengine.sunsource.net
> Subject: RE: [GE users] Open MPI tight integration in HOWTO page
>
> On Fri, 2 Feb 2007, Heywood, Todd wrote:
>
>> Hi,
>>
>> If you recall, I had 2 classes of errors: (1) the GMSH error while
> jobs
>> give output, and (2) complete failure for a large enough number of MPI
>> tasks, sometimes giving a grab bag of error messages (see first post
> on
>> this thread), and sometimes giving no output, but with qstat saying
>> "critical error: unrecoverable error - contact systems manager.
>> Aborted". This second case might be related to LDAP, as I gound the
>> following messages in /var/log/messages of the job nodes:
>>
>> Feb  2 14:06:16 blade183 sge_execd: nss_ldap: reconnecting to LDAP
>> server...
>> Feb  2 14:06:16 blade183 sge_execd: nss_ldap: reconnected to LDAP
> server
>> after 1 attempt(s)
>> Feb  2 14:06:16 blade183 sge_shepherd-9194: nss_ldap: reconnecting to
>> LDAP server...
>> Feb  2 14:06:16 blade183 sge_shepherd-9194: nss_ldap: reconnected to
>> LDAP server after 1 attempt(s)
>> Feb  2 14:06:17 blade183 sge_shepherd-9194: nss_ldap: reconnecting to
>> LDAP server...
>> Feb  2 14:06:17 blade183 sge_shepherd-9194: nss_ldap: reconnected to
>> LDAP server after 1 attempt(s)
>> Feb  2 14:07:19 blade183 sge_shepherd-9194: nss_ldap: reconnecting to
>> LDAP server...
>> Feb  2 14:07:19 blade183 sge_shepherd-9194: nss_ldap: reconnected to
>> LDAP server after 1 attempt(s)
>>
>> Googling on LDAP plus various cluster/MPI/scalability topics shows up
>> nothing.
>
> Well, there is something wrong with your set-up in general, but I
> couldn't do more than guessing. Try searching for "reconnecting LDAP
> nss_ldap". This gets you a number of hits.
>
> Note, in an earlier mail Reuti already asked this:
>
>    "Are you using any special communication lib? Myrinet,
> Infiniband,... ?"
>
> If that were the case, there is a chance it is somehow related.
>
> Regards,
> Andreas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list