[GE users] sge stopped: error: getting configuration:

Patrice Hamelin phamelin at clumeq.mcgill.ca
Tue May 3 15:13:35 BST 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Seems like something corrupted in my jobs database right!

Patrice Hamelin wrote:
> I run the classic spooling
> 
> 
> [root at stokes util]# /opt/sge/bin/lx24-x86/sge_qmaster
> Reading in complex attributes.
> Reading in execution hosts.
> Reading in administrative hosts.
> Reading in submit hosts.
> Reading in host group entries:
>         Host group entries for group "@allhosts".
>         Host group entries for group "@single".
>         Host group entries for group "@multi".
>         Host group entries for group "@gal".
>         Host group entries for group "@bigmem".
>         Host group entries for group "@multi2".
> Reading in usersets:
>         Userset "gal".
>         Userset "defaultdepartment".
>         Userset "deadlineusers".
>         Userset "admin".
> Reading in queues:
>         Queue "batch".
>         Queue "multi".
>         Queue "single".
>         Queue "bigmem".
>         Queue "multi2".
> Reading in parallel environments:
>         PE "mpich_1".
>         PE "mpich_2".
> Reading in ckpt interface definitions:
>         CKPT "blcr".
> Reading in Master_Job_List.
> .
> 
> read job database with 15 entries in 1 seconds
> Segmentation fault
> [root at stokes util]#
> 
> 
> Joachim Gabler wrote:
> 
>> Patrice,
>>
>> what spooling method are you using (classic / berkeleydb)?
>>
>> Please try to startup qmaster in debug mode:
>> In a shell as user root:
>> source $SGE_ROOT/util/dl.(c)sh
>> dl 1
>> $SGE_ROOT/bin/<arch>/sge_qmaster
>>
>> This might show some error messages, e.g. when reading jobs from disk.
>>
>>   Joachim
>>
>> Daniel Templeton schrieb:
>>
>>> Looks to me like running out of memory caused the qmaster to leave 
>>> the cluster in a broken state.  You'll need to clean up whatever the 
>>> qmaster left behind.  That may involve using utilbin/spooledit or 
>>> deleting jobs from the spool directory.  Unfortunately, you'll need 
>>> the advice of someone who actually recovers broken clusters instead 
>>> of just reinstalling them like I do.  Joachim?  Stephan?  Omar?
>>>
>>> Daniel
>>>
>>> Patrice Hamelin wrote:
>>>
>>>> After running sgemaster, qmaster is NOT running.  I run SGE 6.0u1 on 
>>>> RedHat linux 7.3.  see my other message, I had a memory problem 
>>>> which I think cause the  problem yesterday.
>>>>
>>>> Thanks guys for help!
>>>>
>>>> Daniel Templeton wrote:
>>>>
>>>>> After running sgemaster, is your qmaster running?  What platform 
>>>>> and SGE version?  Was there an event which caused the qmaster to stop?
>>>>>
>>>>> Daniel
>>>>>
>>>>> Patrice Hamelin wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>   My qmaster stopped since a couple of hours and I cannot restart it.
>>>>>> I always have:
>>>>>>
>>>>>> [root at stokes common]#  /etc/init.d/sgemaster start
>>>>>>    starting sge_qmaster
>>>>>>    starting sge_schedd
>>>>>> error: getting configuration: unable to contact qmaster using port 
>>>>>> 536
>>>>>> on host "stokes.clumeq.mcgill.ca"
>>>>>> can't get configuration from qmaster -- waiting ...
>>>>>>
>>>>>>
>>>>>>   thanks for help!
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
> 

-- 
Patrice Hamelin ing, M.Sc.A, CCNA
Systems Administrator
CLUMEQ Supercomputer Centre
McGill University
688 Sherbrooke Street West, Suite 710
Montreal, QC, Canada H3A 2S6
Tel: 514-398-3344
Fax: 514-398-2203
http://www.clumeq.mcgill.ca

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list