[GE users] sge stopped: error: getting configuration:

Patrice Hamelin phamelin at clumeq.mcgill.ca
Tue May 3 15:10:00 BST 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

I run the classic spooling


[root at stokes util]# /opt/sge/bin/lx24-x86/sge_qmaster
Reading in complex attributes.
Reading in execution hosts.
Reading in administrative hosts.
Reading in submit hosts.
Reading in host group entries:
         Host group entries for group "@allhosts".
         Host group entries for group "@single".
         Host group entries for group "@multi".
         Host group entries for group "@gal".
         Host group entries for group "@bigmem".
         Host group entries for group "@multi2".
Reading in usersets:
         Userset "gal".
         Userset "defaultdepartment".
         Userset "deadlineusers".
         Userset "admin".
Reading in queues:
         Queue "batch".
         Queue "multi".
         Queue "single".
         Queue "bigmem".
         Queue "multi2".
Reading in parallel environments:
         PE "mpich_1".
         PE "mpich_2".
Reading in ckpt interface definitions:
         CKPT "blcr".
Reading in Master_Job_List.
.

read job database with 15 entries in 1 seconds
Segmentation fault
[root at stokes util]#


Joachim Gabler wrote:
> Patrice,
> 
> what spooling method are you using (classic / berkeleydb)?
> 
> Please try to startup qmaster in debug mode:
> In a shell as user root:
> source $SGE_ROOT/util/dl.(c)sh
> dl 1
> $SGE_ROOT/bin/<arch>/sge_qmaster
> 
> This might show some error messages, e.g. when reading jobs from disk.
> 
>   Joachim
> 
> Daniel Templeton schrieb:
> 
>> Looks to me like running out of memory caused the qmaster to leave the 
>> cluster in a broken state.  You'll need to clean up whatever the 
>> qmaster left behind.  That may involve using utilbin/spooledit or 
>> deleting jobs from the spool directory.  Unfortunately, you'll need 
>> the advice of someone who actually recovers broken clusters instead of 
>> just reinstalling them like I do.  Joachim?  Stephan?  Omar?
>>
>> Daniel
>>
>> Patrice Hamelin wrote:
>>
>>> After running sgemaster, qmaster is NOT running.  I run SGE 6.0u1 on 
>>> RedHat linux 7.3.  see my other message, I had a memory problem which 
>>> I think cause the  problem yesterday.
>>>
>>> Thanks guys for help!
>>>
>>> Daniel Templeton wrote:
>>>
>>>> After running sgemaster, is your qmaster running?  What platform and 
>>>> SGE version?  Was there an event which caused the qmaster to stop?
>>>>
>>>> Daniel
>>>>
>>>> Patrice Hamelin wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>   My qmaster stopped since a couple of hours and I cannot restart it.
>>>>> I always have:
>>>>>
>>>>> [root at stokes common]#  /etc/init.d/sgemaster start
>>>>>    starting sge_qmaster
>>>>>    starting sge_schedd
>>>>> error: getting configuration: unable to contact qmaster using port 536
>>>>> on host "stokes.clumeq.mcgill.ca"
>>>>> can't get configuration from qmaster -- waiting ...
>>>>>
>>>>>
>>>>>   thanks for help!
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 

-- 
Patrice Hamelin ing, M.Sc.A, CCNA
Systems Administrator
CLUMEQ Supercomputer Centre
McGill University
688 Sherbrooke Street West, Suite 710
Montreal, QC, Canada H3A 2S6
Tel: 514-398-3344
Fax: 514-398-2203
http://www.clumeq.mcgill.ca

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list