Opened 11 years ago

Last modified 9 years ago

#537 new defect

IZ2636: Execution hosts enter Error state: admin_user does not exist

Reported by: levesque Owned by:
Priority: normal Milestone:
Component: sge Version: 6.2beta
Severity: Keywords: Macintosh Mac execution
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2636]

        Issue #:      2636             Platform:     Macintosh   Reporter: levesque (levesque)
       Component:     gridengine          OS:        Mac OS X
     Subcomponent:    execution        Version:      6.2beta        CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    pollinger (pollinger)
      QA Contact:     pollinger
          URL:
       * Summary:     Execution hosts enter Error state: admin_user does not exist
   Status whiteboard:
      Attachments:
                      Date/filename:                                      Description:                                                           Submitted by:
                      Mon Jul 7 07:46:00 -0700 2008: sge-debug-output.txt Debug Level 4 Output from failing "qrsh hostname" command (text/plain) craffi

     Issue 2636 blocks:
   Votes for issue 2636:


   Opened: Wed Jul 2 07:37:00 -0700 2008 
------------------------


I have OS X Server 10.5 nodes running SGE 6.2b2. The nodes are using Open
Directory for authentication. The user "sgeadmin" is an OD account that is able
to log in on all nodes. The user "admin" is a local account on all nodes. Intel
10.5 execution nodes are entering an error state (often after running jobs
successfully several times) with qmaster reporting:

07/01/2008 12:45:22|worker|starbuck|W|job 37.1 failed on host
gaeta.mcb.harvard.edu general before prolog because: 07/01/2008 12:45:21
[501:33477]: admin_user "admin" does not exist

The error above is after reinstalling and configuring SGE to use the local
"admin" account. The same error is present with the network "sgeadmin" account.

Ian

   ------- Additional comments from craffi Wed Jul 2 07:40:49 -0700 2008 -------
I've got a cluster of Mac OS X server machines running 10.5.3 and can recreate
this problem.

For some reason, both local user accounts and accounts defined within Open
Directory / LDAP are not being recognized by Grid engine.


   ------- Additional comments from craffi Mon Jul 7 06:14:44 -0700 2008 -------
I've been able to replicate this problem on a clean Mac Mini (x86) running the
latest version of OS X Server (10.5.4).

***
The system is accessible via the internet and I'd be happy to grant remote
access to any interested party who may help in debugging.
***

The interesting thing is that the problem exhibits itself on a single-node
system with no NFS and no LDAP. With 10.5.4 I reliably get "admin user does not
exist" or "can't get password entry for user X" even with locally defined user
accounts.

-Chris



   ------- Additional comments from craffi Mon Jul 7 07:46:18 -0700 2008 -------
Created an attachment (id=178)
Debug Level 4 Output from failing "qrsh hostname" command

   ------- Additional comments from craffi Tue Jul 8 16:13:00 -0700 2008 -------
I have found a workaround for this issue. No root cause yet.

Hypothesis: An update to 10.5.4 server has rendered both the SystemStarter
framework and the actual 'sgemaster','sgeexecd' scripts unreliable for some reason.

This problem so far has only appeared on 10.5.4 Server versions of OS X.

The workaround: Integrate Grid Engine start/stop procedures via the new OS X
'launchd' framework.

The launchd scripts I used are published here:
http://blog.bioteam.net/2008/03/04/apple-os-x-105-launchd-scripts-for-grid-engine/

I've also updated the launchd page on the SGE wiki:
http://wiki.gridengine.info/wiki/index.php/GridEngine_launchd


Every problem described in this issue went away once we abandoned the SGE
start/stop scripts and moved everything into launchd. I have no idea why this is
so :)

-Chris


   ------- Additional comments from levesque Wed Jul 16 11:08:52 -0700 2008 -------
Chris' posted workaround did the trick for me on my OS X 10.5.2 Server cluster
running 6.2b2.

Attachments (1)

178 (196.4 KB) - added by dlove 9 years ago.

Download all attachments as: .zip

Change History (1)

Changed 9 years ago by dlove

Note: See TracTickets for help on using tickets.