[GE users] SGE on latest Mac OS X Server 10.5.4 - help with non-root users

Chris Dagdigian dag at sonsorol.org
Tue Jul 8 15:30:04 BST 2008


Hi Ian,

Already saw the bug report -- the comments and debug output attached  
to that issue were added by me ('craff' on sunsource). I was able to  
reproduce the problem on a clean Mac OS X 10.5.4 Server over the  
weekend without OpenDirectory/LDAP and without NFS so the problem is  
strictly with the OS and almost certainly with something that changed  
in a recent OS X Update. We can't reproduce the issue on our 10.5.4  
laptops running OS X Client which is a bit strange.

For a while my hypothesis was that accounts created using Workgroup  
Manager (both in LDAP or local mode) were somehow broken but I was  
able to create user and sgeadmin accounts using nothing but the  
command line 'dscl' program and those accounts also can not be  
resolved by SGE.

The issue is pretty consistent once you can get it to break (I have a  
different mac mini running OS X client 10.5.4 where SGE works  
perfectly) and it boils down to this:

- When installed as root, all user jobs fail with the 'cant' get  
password entry for user <username>' error
- When installed with a non-root admin user, all jobs fail with 'admin  
user <username> does not exist'

The good news is that people far smarter than me are taking a look at  
it and I've made my server system accessible to a few people who are  
looking into things.

The bad news is I may have to migrate a new client cluster to Platform  
LSF as not being able to get SGE to run for more than a week is pretty  
embarrassing.

-Chris


On Jul 8, 2008, at 9:18 AM, Ian Levesque wrote:

> Hi Chris,
>
> I posted to the list about this problem recently, you should see the  
> thread in the archives. I created a bug report on sunsource if you'd  
> like to add your observations: http://gridengine.sunsource.net/issues/show_bug.cgi?id=2636
>
> Cheers,
> Ian
>
>
> On Jul 3, 2008, at 5:45 PM, Chris Dagdigian wrote:
>
>> Hi folks,
>>
>> Skip this message if you don't want to be overwhelmed with SGE  
>> debug output ...
>>
>> I've got a brand new OS X Apple cluster running the 10.5.4 server  
>> release that only came out a few days ago.
>>
>> Right from the beginning I had "can't get password entry for  
>> user..." errors so I stripped the system down to the bare essentials:
>>
>> - No open directory / LDAP
>> - No NFS
>> - All user accounts local
>> - All user accounts using UIDs less than 1024
>>
>> My test account 'dag' is local and all system commands like 'id',  
>> 'finger' and even the OS X command line commands like 'dscl' all  
>> resolve the account info perfectly fine. The system search path is  
>> correct as well - pointing at /Local/Default and no LDAP servers.
>>
>> Even in a single-node, no-NFS, no-LDAP environment I still can't  
>> get SGE 6.0, 6.1 or 6.2beta2 to function for non-root users.
>>
>> With courtesy binaries, "qrsh hostname" will hang forever and the  
>> qmaster logs will simply show the same old "can't get password  
>> entry for user "dag". Either the user does not exist or NIS error!"  
>> error.
>>
>> If I take the SGE 6.1 source code and patch it according to the  
>> blog article here:
>> http://gridengine.info/articles/2008/03/03/building-6-1u3-on-mac-osx-10-5-2-leopard-server
>>
>> ... then it still does not work but at least I get the "can't get  
>> password" entry error coming to STDOUT instead of hanging the qrsh  
>> process.
>>
>> What is pretty interesting though is if I run "qrsh hostname" with  
>> debug mode turned on, using the patched binaries.
>>
>> It seems that some parts of SGE are able resolve my username and  
>> UID just fine and other parts (qrsh starter perhaps) are not able to.
>>
>> Cutting from the verbose output, this is the interesting bit:
>>
>>>  163   8332 -1602449504     qlogin_starter sent: 1:can't get  
>>> password entry for user "dag". Either the user does not exist or  
>>> NIS error!
>>>  164   8332 -1602449504     ../clients/qsh/qsh.c 890 1: can't get  
>>> password entry for user "dag". Either the user does not exist or  
>>> NIS error!
>>>
>>>  165   8332 -1602449504     sge_set_auth_info: username(uid) =  
>>> dag(511), groupname = staff(20)
>>
>>
>> So sge_set_auth_info correctly resolves my non-root user and treats  
>> it as if it exists, yet right above that line is the "you don't  
>> exist" error message ...
>>
>>
>> I'm going to attach a text file with the full debug output from a  
>> "qrsh hostname" command below, I'm hoping someone will have some  
>> pointers or insights as to how to keep on troubleshooting this ...
>>
>> Regards,
>> Chris
>>
>> <sge-error.txt>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list