[GE users] SGE 6.2u2 Install Fails "Admin User Missing" on all hosts

mhanby mhanby at uab.edu
Thu Sep 17 18:24:37 BST 2009


Some debugging, this code is in the start script:

HOST=`$utilbin_dir/gethostname -aname`

echo $HOST
rockstest.uabgrid.uab.edu

I can't find within the start script how it's coming up with rockstest.local, I'm thinking this must be happening in one of the executable binaries, perhaps sge_master?

-----Original Message-----
From: mhanby [mailto:mhanby at uab.edu] 
Sent: Thursday, September 17, 2009 12:18 PM
To: users at gridengine.sunsource.net
Subject: RE: [GE users] SGE 6.2u2 Install Fails "Admin User Missing" on all hosts

Now on to the second problem, the hostname of the qmaster.

rockstest.uabgrid.uab.edu
rockstest.local

The sgemaster script and the $SGE_ROOT/default/common/act_qmaster file can't seem to agree on what the host name should be.

I notice that the installer started the sge_qmaster process. I then decide to stop it:

/etc/init.d/sgemaster.p536 stop

no output

/etc/init.d/sgemaster.p536 start
sge_master didn't start!
This is not a qmaster host!
Please, check your act_qmaster file!

cat $(find $SGE_ROOT -name act_qmaster)
rockstest.local

If I edit the act_qmaster file and replace "rockstest.local" with "rockstest.uabgrid.uab.edu" and then run the stop command:

/etc/init.d/sgemaster.p536 stop
  shutting down Grid Engine qmaster

And ps confirms that no sge processes are running.

Now, if I try to start it again (remember that act_qmaster has the FQDN in it)
/etc/init.d/sgemaster.p536 start

  starting sge_qmaster
sge_qmaster is running on another host (rockstest.uabgrid.uab.edu)

The ps command now shows sge_master is running

If I cat act_qmaster again, the hostname is rockstest.local

cat $(find $SGE_ROOT -name act_qmaster)
rockstest.local

The /etc/hosts file has these entries:

127.0.0.1    localhost.localdomain localhost
172.99.99.1  rockstest.local rockstest # originally frontend-0-0
192.168.2.10 rockstest.uabgrid.uab.edu

If I swap "rockstest.lcoal rockstest" to "rockstest rockstest.local" then rockstest will end up in the act_qmaster file.

Any ideas why the host names are getting swapped around, bungled, etc...?

Thanks,

Mike

-----Original Message-----
From: mhanby [mailto:mhanby at uab.edu] 
Sent: Thursday, September 17, 2009 12:03 PM
To: users at gridengine.sunsource.net
Subject: RE: [GE users] SGE 6.2u2 Install Fails "Admin User Missing" on all hosts

I used the GE 6.2u3 installer this time and encountered the same issue where the installer reports that the Admin user doesn't exist.

I found a work around.

This is on a Rocks 5.1 test cluster that has Gridengine 6.1u5 installed. The scripts /etc/profile.d/sge-binaries.{sh,csh} were causing problems with the installer. Those scripts are essentially the settings.{sh,csh} from the 6.1u5 install, and apparently the SGE_ROOT and other vars set in those scripts was causing problems.

I was able to work around it by either temporarily removing those from the qmaster node and exec nodes, or by unsetting the variables in my /root/.bash_profile on all of the nodes.

I would have thought the installer would override those variables or unset them since you can install new versions while other versions are running.

anywho, figured I'd report the info.

Mike

-----Original Message-----
From: mhanby [mailto:mhanby at uab.edu] 
Sent: Friday, March 27, 2009 9:29 AM
To: users at gridengine.sunsource.net
Subject: RE: [GE users] SGE 6.2u2 Install Fails "Admin User Missing" on all hosts

Thanks, that's what I figured. I've restored the virtual machine to a
snapshot prior to the install of SGE 6.2u2 so I can try again.

I'll download the latest binaries and give it another go.

Mike
-----Original Message-----
From: Lubomir.Petrik at sun.com [mailto:Lubomir.Petrik at sun.com] 
Sent: Thursday, March 26, 2009 12:35 PM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] SGE 6.2u2 Install Fails "Admin User Missing" on
all hosts

mhanby wrote:
> And on the compute node
> $ ssh compute-0-0 cat /tmp/check_test 
> /share/sge/utilbin/lx24-amd64/adminrun sge test -w /share/sge
> exit_code=0
>
> Odd thing, this time the user lookup succeeded and both the qmaster
and
> exec host installed without error.
>
> I realized just now, I hadn't changed the permissions on the
/share/sge
> from root to sge prior to running the install the first time. I did
> chown -R sge:sge /share/sge after the first install, so maybe that had
> something to do with it?
>   
That is strange. The adminrun has execute for everyone, so it doesn't 
matter who owns it. Actually doing chown -R sge:sge /share/sge is not 
very good idea. You may now want to call as root 
$SGE_ROOT/util/setfileperms.sh, some files must be owned by root (e.g.: 
utilbin/$ARCH/authuser).

Lubos.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=144176

To unsubscribe from this discussion, e-mail:
[users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=145345

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=217669

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=217671

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=217674

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list