[GE users] sge stopped: error: getting configuration:

Joachim Gabler Joachim.Gabler at Sun.COM
Tue May 3 16:06:39 BST 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

This is a problem that can occur with classic job spooling, as there is 
no transaction handling in classic spooling.
Jobs are spooled as multiple files. If not all files can be written (out 
of memory in your case, filesystem full is a more frequent situation), 
the job information can become inconsistent.

With Berkeley DB, spooling a job is protected by a transaction, and will 
always succeed or fail as a whole, leaving behind a consistent database 
in any case.

What was your motivation to use classic spooling?

   Joachim


Daniel Templeton schrieb:

> That cause is that the qmaster was in the process of spooling some job 
> information when it ran out of memory.  The partial job record that 
> got written was confusing enough to the next qmaster that it 
> segfaulted when trying to read it.
>
> Daniel
>
> Patrice Hamelin wrote:
>
>> Joachim,
>>
>>   It works fine now but I have to clear all the running processes on 
>> the nodes.
>>
>>   What can be the cause of corruption?  I kept a copy of the jobs and 
>> job_scripts dir.
>>
>> Thanks to all (Daniel, Mac, Joachim)
>>
>> Joachim Gabler wrote:
>>
>>> Did qmaster write a core file?
>>> Could you please try to produce a stack trace (using gdb)?
>>>
>>> Except the jobs, all spooled files are ascii files.
>>> I would suggest to delete the spooled jobs: Everything under 
>>> <spooldir>/jobs and <spooldir>/job_scripts.
>>>
>>> If qmaster still doesn't startup, have a look at the other spooled 
>>> objects:
>>> <spooldir>/admin_hosts,
>>> <spooldir>/calendars,
>>> ....
>>>
>>> Just do a cat on all files and verify "if they look ok".
>>>
>>>  Joachim
>>>
>>>
>>> Patrice Hamelin schrieb:
>>>
>>>> I run the classic spooling
>>>>
>>>>
>>>> [root at stokes util]# /opt/sge/bin/lx24-x86/sge_qmaster
>>>> Reading in complex attributes.
>>>> Reading in execution hosts.
>>>> Reading in administrative hosts.
>>>> Reading in submit hosts.
>>>> Reading in host group entries:
>>>>         Host group entries for group "@allhosts".
>>>>         Host group entries for group "@single".
>>>>         Host group entries for group "@multi".
>>>>         Host group entries for group "@gal".
>>>>         Host group entries for group "@bigmem".
>>>>         Host group entries for group "@multi2".
>>>> Reading in usersets:
>>>>         Userset "gal".
>>>>         Userset "defaultdepartment".
>>>>         Userset "deadlineusers".
>>>>         Userset "admin".
>>>> Reading in queues:
>>>>         Queue "batch".
>>>>         Queue "multi".
>>>>         Queue "single".
>>>>         Queue "bigmem".
>>>>         Queue "multi2".
>>>> Reading in parallel environments:
>>>>         PE "mpich_1".
>>>>         PE "mpich_2".
>>>> Reading in ckpt interface definitions:
>>>>         CKPT "blcr".
>>>> Reading in Master_Job_List.
>>>> .
>>>>
>>>> read job database with 15 entries in 1 seconds
>>>> Segmentation fault
>>>> [root at stokes util]#
>>>>
>>>>
>>>> Joachim Gabler wrote:
>>>>
>>>>> Patrice,
>>>>>
>>>>> what spooling method are you using (classic / berkeleydb)?
>>>>>
>>>>> Please try to startup qmaster in debug mode:
>>>>> In a shell as user root:
>>>>> source $SGE_ROOT/util/dl.(c)sh
>>>>> dl 1
>>>>> $SGE_ROOT/bin/<arch>/sge_qmaster
>>>>>
>>>>> This might show some error messages, e.g. when reading jobs from 
>>>>> disk.
>>>>>
>>>>>   Joachim
>>>>>
>>>>> Daniel Templeton schrieb:
>>>>>
>>>>>> Looks to me like running out of memory caused the qmaster to 
>>>>>> leave the cluster in a broken state.  You'll need to clean up 
>>>>>> whatever the qmaster left behind.  That may involve using 
>>>>>> utilbin/spooledit or deleting jobs from the spool directory.  
>>>>>> Unfortunately, you'll need the advice of someone who actually 
>>>>>> recovers broken clusters instead of just reinstalling them like I 
>>>>>> do.  Joachim?  Stephan?  Omar?
>>>>>>
>>>>>> Daniel
>>>>>>
>>>>>> Patrice Hamelin wrote:
>>>>>>
>>>>>>> After running sgemaster, qmaster is NOT running.  I run SGE 
>>>>>>> 6.0u1 on RedHat linux 7.3.  see my other message, I had a memory 
>>>>>>> problem which I think cause the  problem yesterday.
>>>>>>>
>>>>>>> Thanks guys for help!
>>>>>>>
>>>>>>> Daniel Templeton wrote:
>>>>>>>
>>>>>>>> After running sgemaster, is your qmaster running?  What 
>>>>>>>> platform and SGE version?  Was there an event which caused the 
>>>>>>>> qmaster to stop?
>>>>>>>>
>>>>>>>> Daniel
>>>>>>>>
>>>>>>>> Patrice Hamelin wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>>   My qmaster stopped since a couple of hours and I cannot 
>>>>>>>>> restart it.
>>>>>>>>> I always have:
>>>>>>>>>
>>>>>>>>> [root at stokes common]#  /etc/init.d/sgemaster start
>>>>>>>>>    starting sge_qmaster
>>>>>>>>>    starting sge_schedd
>>>>>>>>> error: getting configuration: unable to contact qmaster using 
>>>>>>>>> port 536
>>>>>>>>> on host "stokes.clumeq.mcgill.ca"
>>>>>>>>> can't get configuration from qmaster -- waiting ...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>   thanks for help!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --------------------------------------------------------------------- 
>>>>>>>>
>>>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>>>> For additional commands, e-mail: 
>>>>>>>> users-help at gridengine.sunsource.net
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --------------------------------------------------------------------- 
>>>>>>
>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list