[GE users] SGE 6.2u2 Install Fails "Admin User Missing" on all hosts

mhanby mhanby at uab.edu
Thu Sep 17 19:01:16 BST 2009


Yes, this is the hosts file:

127.0.0.1    localhost.localdomain localhost
172.99.99.1  rockstest.local rockstest # originally frontend-0-0
192.168.2.10 rockstest.uabgrid.uab.edu

This is on the head node, so it's multihoned, eth1 (192.168.2.10) connecting to the public network, eth0 to the private cluster network (172.99.99.1).

It would seem that all of the processes should either use the FQDN or the short name, rockstest, not both. 


-----Original Message-----
From: reuti [mailto:reuti at staff.uni-marburg.de] 
Sent: Thursday, September 17, 2009 12:36 PM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] SGE 6.2u2 Install Fails "Admin User Missing" on all hosts

Am 17.09.2009 um 19:24 schrieb mhanby:

> Some debugging, this code is in the start script:
>
> HOST=`$utilbin_dir/gethostname -aname`
>
> echo $HOST
> rockstest.uabgrid.uab.edu
>
> I can't find within the start script how it's coming up with  
> rockstest.local, I'm thinking this must be happening in one of the  
> executable binaries,
> perhaps sge_master?

Is it in /etc/hosts?

-- Reuti


> -----Original Message-----
> From: mhanby [mailto:mhanby at uab.edu]
> Sent: Thursday, September 17, 2009 12:18 PM
> To: users at gridengine.sunsource.net
> Subject: RE: [GE users] SGE 6.2u2 Install Fails "Admin User  
> Missing" on all hosts
>
> Now on to the second problem, the hostname of the qmaster.
>
> rockstest.uabgrid.uab.edu
> rockstest.local
>
> The sgemaster script and the $SGE_ROOT/default/common/act_qmaster  
> file can't seem to agree on what the host name should be.
>
> I notice that the installer started the sge_qmaster process. I then  
> decide to stop it:
>
> /etc/init.d/sgemaster.p536 stop
>
> no output
>
> /etc/init.d/sgemaster.p536 start
> sge_master didn't start!
> This is not a qmaster host!
> Please, check your act_qmaster file!
>
> cat $(find $SGE_ROOT -name act_qmaster)
> rockstest.local
>
> If I edit the act_qmaster file and replace "rockstest.local" with  
> "rockstest.uabgrid.uab.edu" and then run the stop command:
>
> /etc/init.d/sgemaster.p536 stop
>   shutting down Grid Engine qmaster
>
> And ps confirms that no sge processes are running.
>
> Now, if I try to start it again (remember that act_qmaster has the  
> FQDN in it)
> /etc/init.d/sgemaster.p536 start
>
>   starting sge_qmaster
> sge_qmaster is running on another host (rockstest.uabgrid.uab.edu)
>
> The ps command now shows sge_master is running
>
> If I cat act_qmaster again, the hostname is rockstest.local
>
> cat $(find $SGE_ROOT -name act_qmaster)
> rockstest.local
>
> The /etc/hosts file has these entries:
>
> 127.0.0.1    localhost.localdomain localhost
> 172.99.99.1  rockstest.local rockstest # originally frontend-0-0
> 192.168.2.10 rockstest.uabgrid.uab.edu
>
> If I swap "rockstest.lcoal rockstest" to "rockstest  
> rockstest.local" then rockstest will end up in the act_qmaster file.
>
> Any ideas why the host names are getting swapped around, bungled,  
> etc...?
>
> Thanks,
>
> Mike
>
> -----Original Message-----
> From: mhanby [mailto:mhanby at uab.edu]
> Sent: Thursday, September 17, 2009 12:03 PM
> To: users at gridengine.sunsource.net
> Subject: RE: [GE users] SGE 6.2u2 Install Fails "Admin User  
> Missing" on all hosts
>
> I used the GE 6.2u3 installer this time and encountered the same  
> issue where the installer reports that the Admin user doesn't exist.
>
> I found a work around.
>
> This is on a Rocks 5.1 test cluster that has Gridengine 6.1u5  
> installed. The scripts /etc/profile.d/sge-binaries.{sh,csh} were  
> causing problems with the installer. Those scripts are essentially  
> the settings.{sh,csh} from the 6.1u5 install, and apparently the  
> SGE_ROOT and other vars set in those scripts was causing problems.
>
> I was able to work around it by either temporarily removing those  
> from the qmaster node and exec nodes, or by unsetting the variables  
> in my /root/.bash_profile on all of the nodes.
>
> I would have thought the installer would override those variables  
> or unset them since you can install new versions while other  
> versions are running.
>
> anywho, figured I'd report the info.
>
> Mike
>
> -----Original Message-----
> From: mhanby [mailto:mhanby at uab.edu]
> Sent: Friday, March 27, 2009 9:29 AM
> To: users at gridengine.sunsource.net
> Subject: RE: [GE users] SGE 6.2u2 Install Fails "Admin User  
> Missing" on all hosts
>
> Thanks, that's what I figured. I've restored the virtual machine to a
> snapshot prior to the install of SGE 6.2u2 so I can try again.
>
> I'll download the latest binaries and give it another go.
>
> Mike
> -----Original Message-----
> From: Lubomir.Petrik at sun.com [mailto:Lubomir.Petrik at sun.com]
> Sent: Thursday, March 26, 2009 12:35 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] SGE 6.2u2 Install Fails "Admin User  
> Missing" on
> all hosts
>
> mhanby wrote:
>> And on the compute node
>> $ ssh compute-0-0 cat /tmp/check_test
>> /share/sge/utilbin/lx24-amd64/adminrun sge test -w /share/sge
>> exit_code=0
>>
>> Odd thing, this time the user lookup succeeded and both the qmaster
> and
>> exec host installed without error.
>>
>> I realized just now, I hadn't changed the permissions on the
> /share/sge
>> from root to sge prior to running the install the first time. I did
>> chown -R sge:sge /share/sge after the first install, so maybe that  
>> had
>> something to do with it?
>>
> That is strange. The adminrun has execute for everyone, so it doesn't
> matter who owns it. Actually doing chown -R sge:sge /share/sge is not
> very good idea. You may now want to call as root
> $SGE_ROOT/util/setfileperms.sh, some files must be owned by root  
> (e.g.:
> utilbin/$ARCH/authuser).
>
> Lubos.
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessage
> Id=144176
>
> To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=145345
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=217669
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=217671
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=217674
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=217677

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=217680

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list