[GE users] sge stopped: error: getting configuration:

Daniel Templeton Dan.Templeton at Sun.COM
Tue May 3 15:50:44 BST 2005


That cause is that the qmaster was in the process of spooling some job 
information when it ran out of memory.  The partial job record that got 
written was confusing enough to the next qmaster that it segfaulted when 
trying to read it.

Daniel

Patrice Hamelin wrote:

> Joachim,
> 
>   It works fine now but I have to clear all the running processes on the 
> nodes.
> 
>   What can be the cause of corruption?  I kept a copy of the jobs and 
> job_scripts dir.
> 
> Thanks to all (Daniel, Mac, Joachim)
> 
> Joachim Gabler wrote:
> 
>> Did qmaster write a core file?
>> Could you please try to produce a stack trace (using gdb)?
>>
>> Except the jobs, all spooled files are ascii files.
>> I would suggest to delete the spooled jobs: Everything under 
>> <spooldir>/jobs and <spooldir>/job_scripts.
>>
>> If qmaster still doesn't startup, have a look at the other spooled 
>> objects:
>> <spooldir>/admin_hosts,
>> <spooldir>/calendars,
>> ....
>>
>> Just do a cat on all files and verify "if they look ok".
>>
>>  Joachim
>>
>>
>> Patrice Hamelin schrieb:
>>
>>> I run the classic spooling
>>>
>>>
>>> [root at stokes util]# /opt/sge/bin/lx24-x86/sge_qmaster
>>> Reading in complex attributes.
>>> Reading in execution hosts.
>>> Reading in administrative hosts.
>>> Reading in submit hosts.
>>> Reading in host group entries:
>>>         Host group entries for group "@allhosts".
>>>         Host group entries for group "@single".
>>>         Host group entries for group "@multi".
>>>         Host group entries for group "@gal".
>>>         Host group entries for group "@bigmem".
>>>         Host group entries for group "@multi2".
>>> Reading in usersets:
>>>         Userset "gal".
>>>         Userset "defaultdepartment".
>>>         Userset "deadlineusers".
>>>         Userset "admin".
>>> Reading in queues:
>>>         Queue "batch".
>>>         Queue "multi".
>>>         Queue "single".
>>>         Queue "bigmem".
>>>         Queue "multi2".
>>> Reading in parallel environments:
>>>         PE "mpich_1".
>>>         PE "mpich_2".
>>> Reading in ckpt interface definitions:
>>>         CKPT "blcr".
>>> Reading in Master_Job_List.
>>> .
>>>
>>> read job database with 15 entries in 1 seconds
>>> Segmentation fault
>>> [root at stokes util]#
>>>
>>>
>>> Joachim Gabler wrote:
>>>
>>>> Patrice,
>>>>
>>>> what spooling method are you using (classic / berkeleydb)?
>>>>
>>>> Please try to startup qmaster in debug mode:
>>>> In a shell as user root:
>>>> source $SGE_ROOT/util/dl.(c)sh
>>>> dl 1
>>>> $SGE_ROOT/bin/<arch>/sge_qmaster
>>>>
>>>> This might show some error messages, e.g. when reading jobs from disk.
>>>>
>>>>   Joachim
>>>>
>>>> Daniel Templeton schrieb:
>>>>
>>>>> Looks to me like running out of memory caused the qmaster to leave 
>>>>> the cluster in a broken state.  You'll need to clean up whatever 
>>>>> the qmaster left behind.  That may involve using utilbin/spooledit 
>>>>> or deleting jobs from the spool directory.  Unfortunately, you'll 
>>>>> need the advice of someone who actually recovers broken clusters 
>>>>> instead of just reinstalling them like I do.  Joachim?  Stephan?  
>>>>> Omar?
>>>>>
>>>>> Daniel
>>>>>
>>>>> Patrice Hamelin wrote:
>>>>>
>>>>>> After running sgemaster, qmaster is NOT running.  I run SGE 6.0u1 
>>>>>> on RedHat linux 7.3.  see my other message, I had a memory problem 
>>>>>> which I think cause the  problem yesterday.
>>>>>>
>>>>>> Thanks guys for help!
>>>>>>
>>>>>> Daniel Templeton wrote:
>>>>>>
>>>>>>> After running sgemaster, is your qmaster running?  What platform 
>>>>>>> and SGE version?  Was there an event which caused the qmaster to 
>>>>>>> stop?
>>>>>>>
>>>>>>> Daniel
>>>>>>>
>>>>>>> Patrice Hamelin wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>   My qmaster stopped since a couple of hours and I cannot 
>>>>>>>> restart it.
>>>>>>>> I always have:
>>>>>>>>
>>>>>>>> [root at stokes common]#  /etc/init.d/sgemaster start
>>>>>>>>    starting sge_qmaster
>>>>>>>>    starting sge_schedd
>>>>>>>> error: getting configuration: unable to contact qmaster using 
>>>>>>>> port 536
>>>>>>>> on host "stokes.clumeq.mcgill.ca"
>>>>>>>> can't get configuration from qmaster -- waiting ...
>>>>>>>>
>>>>>>>>
>>>>>>>>   thanks for help!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --------------------------------------------------------------------- 
>>>>>>>
>>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list