[GE users] Re: User does not exist problems on Leopard Was: [GE users] Job does not exist

Chris Dagdigian dag at sonsorol.org
Wed Sep 3 14:05:00 BST 2008


Jonny --

Glad the launchd stuff worked for you. We have no idea why SGE fails  
to resolve users on some 10.5 Apple systems when started with the  
standard SGE scripts, yet functions perfectly when the daemons are  
under control of the launchd framework. I'm sure back in the mailing  
list archives are records of me whining about this for about a week  
and a half until we realized that the problem went away with launchd.

We (bioteam) even burned some of our ADC developer tokens to ask Apple  
engineers "what is different about launchd after 10.5.4?" and didn't  
really get a satisfactory answer.

To give you a data point -- all of the 10.5 SGE clusters we have  
converted over to launchd have been functioning without problems since  
the conversion.

Regards,
Chris



On Sep 3, 2008, at 3:12 AM, Jonathan Hunt wrote:

> Hi all,
>
> I have just found the answer to my problem. Here
> http://blog.bioteam.net/2008/07/15/sge-launchd-script-maker-for-apple-os-x-105-leopard/
>
> Thanks for people's help. I'm happy now - I NEEDed this cluster!
>
> Cheers,
> Jonny
>
> On Wed, Sep 3, 2008 at 4:57 PM, Jonathan Hunt <jjh at 42quarks.com>  
> wrote:
>> Hi,
>>
>> Just to recap. I am trying to setup SGE on Leopard 10.5.4 with NFS
>> shares and OpenDirectory users. The nodes work if I log in (under  
>> user
>> jhunt) and run
>> sudo killall sge_execd
>> sudo $SGEROOT/default/common/sgeexecd
>>
>> As soon as I log out the nodes fail with errors that the jobs do  
>> not exist.
>>
>> I found a corresponding error in the qmaster log files which I think
>> helps me understand a bit of what's going on. It says:
>>
>> 09/03/2008 16:34:01|worker|qbi-xgrid-01|W|job 71.1 failed on host
>> qbi-xgrid-02.qbi.uq.edu.au general before job because: 09/03/2008
>> 16:34:00 [0:64986]: can't get password entry for user "jhunt"
>> 09/03/2008 16:34:01|worker|qbi-xgrid-01|W|rescheduling job 71.1
>> 09/03/2008 16:34:01|worker|qbi-xgrid-01|E|queue all.q marked QERROR  
>> as
>> result of job 71's failure at host qbi-xgrid-02.qbi.uq.edu.au
>>
>> The user jhunt is an OpenDirectory user. I can ssh into the box with
>> for that user with no problems.  So somehow logging out is causing
>> problems finding my user password etc. It appears from Googling that
>> this problem was encountered when first porting SGE to Leopard. Does
>> anyone know how to fix it now? If anyone knows of binaries posted
>> online for SGE 6.2 that might work better than mine please let me
>> know.
>>
>> Any help appreciated.
>> Jonny
>>
>>
>> On Tue, Sep 2, 2008 at 7:48 PM, Jonathan Hunt <jjh at 42quarks.com>  
>> wrote:
>>> On Tue, Sep 2, 2008 at 7:44 PM, Ravi Chandra Nallan
>>> <Ravichandra.Nallan at sun.com> wrote:
>>>> Jonathan Hunt wrote:
>>>>>
>>>>> On Tue, Sep 2, 2008 at 2:17 AM, Reuti <reuti at staff.uni- 
>>>>> marburg.de> wrote:
>>>>>
>>>>>>
>>>>>> do you have the spool directory of the nodes local or also on  
>>>>>> NFS?
>>>>>>
>>>>>> -- Reuti
>>>>>>
>>>>>
>>>>> I have tried both local and on NFS and get the same problem.
>>>>>
>>>>> Thanks,
>>>>> Jonny
>>>>>
>>>>>
>>>>
>>>> Can you check the permissions of the local spool directory on the  
>>>> exec node?
>>>>
>>>> --
>>>> regards,
>>>> ~Ravi
>>>
>>> qbi-xgrid-02:qbi-xgrid-02 jhunt$ pwd
>>> /sge/default/spool/qbi-xgrid-02
>>> qbi-xgrid-02:qbi-xgrid-02 jhunt$ ls -le
>>> total 8
>>> drwxr-xr-x  2 nobody  nobody    68 Sep  1 23:46 active_jobs
>>> -rw-r--r--  1 nobody  nobody     6 Sep  1 23:39 execd.pid
>>> drwxr-xr-x  2 nobody  nobody    68 Sep  1 23:46 job_scripts
>>> drwxr-xr-x  2 nobody  nobody    68 Sep  1 23:46 jobs
>>> -rw-r--r--  1 nobody  nobody  1930 Sep  1 23:46 messages
>>> qbi-xgrid-02:qbi-xgrid-02 jhunt$
>>>
>>>
>>> Thanks for trying to help. Any conclusions much appreciated.,
>>> Jonny
>>>
>>> --
>>> Jonathan J Hunt <jjh at 42quarks.com>
>>> Homepage: http://www.42quarks.net.nz/wiki/JJH
>>> (Further contact details there)
>>> "Physics isn't the most important thing. Love is." Richard Feynman
>>>
>>
>>
>>
>> --
>> Jonathan J Hunt <jjh at 42quarks.com>
>> Homepage: http://www.42quarks.net.nz/wiki/JJH
>> (Further contact details there)
>> "Physics isn't the most important thing. Love is." Richard Feynman
>>
>
>
>
> -- 
> Jonathan J Hunt <jjh at 42quarks.com>
> Homepage: http://www.42quarks.net.nz/wiki/JJH
> (Further contact details there)
> "Physics isn't the most important thing. Love is." Richard Feynman
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list