[GE users] How to clear internal hostname cache?

Joe Landman landman at scalableinformatics.com
Wed Mar 22 05:01:41 GMT 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Ok.

You have a host named compute-0-7, created when Rocks installed on this 
compute node.

At some point in time, something somehow injected a network-0-0.local 
name into there.

What we know:

1) it is not in DNS
2) it is not in /etc/hosts
3) the compute nodes and head nodes resolve correctly.
4) grid engine doesn't really know about it, that is, the gethostbyname 
failed
5) compute-0-7.q does not show up on the full list of queues (qstat -f) 
when run from one of the other compute nodes (and I presume the head 
node as well)

What this means

Simple version:  compute-0-7 SGE install is "broken" on the head node 
and on the compute node.

Complex version:  Still don't have all the details.

	uname -a

should tell you the name the machine thinks it is.  Somehow, when you 
run qstat on the broken compute node, it picks up the wrong name.  It 
doesn't look like it is coming from DNS though.  Must be config or 
config files.

So when you run qstat -f, it does many things.  Runs the uname.  Runs 
some other stuff.  Then it starts pouring through config files.


stat64("/opt/sge", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat64("/opt/sge/default", {st_mode=S_IFDIR|0755, st_size=31, ...}) = 0
stat64("/opt/sge/default/common", {st_mode=S_IFDIR|0755, st_size=4096, 
...}) = 0
open("/opt/sge/default/common/product_mode", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=4, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 
0) = 0xb7252000
read(3, "sge\n", 4096)                  = 4
close(3)                                = 0
munmap(0xb7252000, 4096)                = 0
open("/opt/sge/default/common/configuration", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=1598, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 
0) = 0xb7252000
read(3, "# Version: 5.3p5\n# \n# DO NOT MOD"..., 4096) = 1598
close(3)                                = 0
munmap(0xb7252000, 4096)                = 0
open("/opt/sge/default/common/act_qmaster", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=11, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 
0) = 0xb7252000
read(3, "qqqq.local\n", 4096)           = 11
read(3, "", 4096)                       = 0
close(3)                                = 0

then it does a little magic, including dns lookups, and then it tries to 
connect to the SGE_COMMD daemon

socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
fcntl64(3, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
connect(3, {sa_family=AF_INET, sin_port=htons(535), 
sin_addr=inet_addr("10.1.0.2")}, 16) = -1 EINPROGRESS (O
peration now in progress)
select(4, NULL, [3], NULL, {15, 0})     = 1 (out [3], left {15, 0})
read(3, 0xbfffb8db, 1)                  = -1 EAGAIN (Resource 
temporarily unavailable)
setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
select(1024, NULL, [3], NULL, {60, 0})  = 1 (out [3], left {60, 0})
write(3, "\0\0\0\4\0\36\0\0\0\0p=6\256qstat\0\0\0\0\0\0\0\0\0\0\0"..., 
44) = 44
select(1024, [3], NULL, NULL, {60, 0})  = 1 (in [3], left {60, 0})


which after this, it grabs the data (lots of reads from that socket 
handle "3"), and pulls the data back.

You can see all this if you run this on compute node 0-7

	strace qstat -f >& /tmp/m

(or if you are a bash person

	strace qstat -f > /tmp/m 2>&1

)

then run

	less /tmp/m

Note:  strace is your friend.

Ok, so where is the problem.

Somehow, the sge_execd or the sge_commd must think that this node is 
named as indicated.


Alright.  How do you fix it?

Lets start with the simple versions and work to the more complex:

1) restart sge_execd and sge_commd.

	/etc/init.d/rcsge stop

on the compute-0-7

Let it sit for a few seconds, and do an

	ps -ealf | grep -i sge

to see if it finished dying off.  If not, help it along.  If it doesn't 
die when asked nicely, try using kill -9 pid_of_sge_commd .

Make sure all the processes have exited.

now turn SGE back on

	/etc/init.d/rcsge start

now retry the qstat -f.  I presume it will still not work.  If not, a 
reboot won't help either.

2) turn off sge on compute node as above, and remove compute-0-7 from 
the head node sge.  This is more involved, but the idea is that you need 
to wipe all traces of compute-0-7 from the SGE directories.

Once this is done, you will have to manually add compute-0-7 back in. 
On the head node, do an

	qconf -ah compute-0-7

then on the compute-0-7, get into /opt/gridengine and run ./install_execd

Answer all the questions.  Dont use the old defaults.

This should work.

3) if the above doesn't work, then you need to go to the tactical 
battlefield weaponry and re-install that compute node.  Remove all 
traces of compute-0-7 from the sge directories on the head node using 
the qconf tools or the qmon tool, and restart sge on the head node.

Note:  before you do the following, THIS WILL DESTROY DATA AND 
CONFIGURATION FILES ON THE COMPUTE NODE.   Don't do this unless you have 
no other choice.  Rocks will automatically set most everything up for 
you, but if you have made changes, you are going to need to re-replicate 
those changes.

Only if you are really sure you want to do this.  Remember, data and 
other bits will be forever lost from compute-0-7 if you follow these steps.

Once you are really, really sure you want to do this, log onto the 
compute-0-7 (DO NOT TYPE THIS ON THE HEAD NODE!!!!!!!) and type

	/boot/kickstart/cluster-kickstart

and then step away.

The node will be rebuilt.  SGE will be re-installed.  Everything should 
work.

Unless the Rocks database is toasted.  Or the distribution has been 
damaged.  But you have bigger worries if this is the case.

Joe


Kim Leng Goh wrote:
> On 3/22/06, Joe Landman <landman at scalableinformatics.com> wrote:
> 
>>Got it.
> 
> [...]
> 
>>Notice that compute-0-7 isn't in the list.
>>
>>On the head node, could you do a
>>
>>        find /opt/sge | grep -i "compute-0-7"
>>
>>and lets see what comes up.
>>
>>Solution in sight BTW.
> 
> 
> [root at frontend root]# find /opt/gridengine | grep -i "compute-0-7"
> /opt/gridengine/default/common/history/exechosts/compute-0-7.local
> /opt/gridengine/default/common/history/exechosts/compute-0-7.local/20050504_150604
> /opt/gridengine/default/common/history/exechosts/compute-0-7.local/20050504_150606
> /opt/gridengine/default/common/history/exechosts/compute-0-7.local/20050512_135828
> /opt/gridengine/default/common/history/exechosts/compute-0-7.local/20050513_133638
> /opt/gridengine/default/common/history/exechosts/compute-0-7.local/20060316_203236
> /opt/gridengine/default/common/history/exechosts/compute-0-7.local/20060316_203238
> /opt/gridengine/default/common/history/queues/compute-0-7.q
> /opt/gridengine/default/common/history/queues/compute-0-7.q/20050504_150605
> /opt/gridengine/default/common/local_conf/compute-0-7.local
> /opt/gridengine/default/spool/qmaster/admin_hosts/compute-0-7.local
> 
> 
> Thanks,
> KL
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452
cell : +1 734 612 4615

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list