[GE users] cant delete host from SGE

brs brs at usf.edu
Wed Dec 17 16:39:32 GMT 2008


FYI, I was actually trying to de-reference the nodes from a particular 
queue and re-provision them, not remove them.  But same bug none the less.

-Brian

brs wrote:
> I had exactly the same problem with two execution hosts. I ended up 
> dumping the bdb from the spooling server with 'db_dump -p sge', removing 
> the references, db_load'ing the edited file, and restarting the spooling 
> server and qmaster. Definitely a bug of some kind.
>
> -Brian
>
> Alex Chekholko wrote:
>   
>> Hi,
>>
>> I had the same problem a while back, also running 6.1u3.
>>
>> I believe I did '$SGE_ROOT/inst_sge -ux nodename' when the manual qconf -de didn't work.
>>
>> And then maybe I restarted sgemaster, if the above didn't work?
>>
>> Regards,
>> Alex
>>
>> On Wed, 17 Dec 2008 03:39:37 -0800 (PST)
>> adary at marvell.com wrote:
>>
>>   
>>     
>>> Hi Andy,
>>>
>>> I actually checked each and every hostgroup, and lnx400 is not referenced in any of them.
>>>
>>> I also ran the three commands you listed just to be sure, and lnx400 is nowhere to be found.
>>>
>>> The SGE version is 6.1u3
>>>
>>> I'm pretty sure that this is a bug, since this is not the only host that behaves like this. I have at least 5 more (out of 300+ hosts in the grid)
>>>
>>>     
>>>       
>>>> Yuval,
>>>>
>>>> that might indicate there's a bug if lnx400 is also not referenced directly
>>>> in lnx400. Are you using any hosts aliasing via the "host_aliases" file?
>>>>
>>>> Can you do a check as follows:
>>>>
>>>>   qconf -sq bulk|grep "@"
>>>>      -> should show all hostgroups used by "bulk" queue
>>>>
>>>>   qconf -shgrp_tree <hostgroups_referenced_by_bulk_queue>
>>>>
>>>>   qconf -shgrp_resolved <<hostgroups_referenced_by_bulk_queue>
>>>>
>>>> Does the problem also occur after a qmaster restart?
>>>>
>>>> Which version/patch level are you using?
>>>>
>>>> Andy
>>>>
>>>> On Wed, 17 Dec 2008, Yuval Adar wrote:
>>>>
>>>>       
>>>>         
>>>>> In certain rare cases I'm not able to remove a host completely from SGE
>>>>>
>>>>> [117] root at sge_master ==>qconf -de lnx400
>>>>> Host object "lnx400" is still referenced in cluster queue "bulk".
>>>>>
>>>>> When I look at the bulk queue, it doesn't reference the said host at all, and the host is not included in any host group that is included in that queue in fact, the host is not listed in any hostgroup at all :
>>>>>
>>>>> bash-3.00# for i in `qconf -shgrpl`; do qconf -shgrp $i | grep lnx400; done
>>>>> bash-3.00#
>>>>>
>>>>> Has anyone ever experienced something similar?
>>>>>         
>>>>>           
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=93002
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>   
>>     
>
>
>   


-- 
Brian Smith
HPC Systems Administrator
Research Computing, University of South Florida
4202 E. Fowler Ave. LIB618
Office Phone: +1 813 974-1467
Organization URL: http://rc.usf.edu

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=93004

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list