[GE users] sge stopped: error: getting configuration:

Patrice Hamelin phamelin at clumeq.mcgill.ca
Tue May 3 19:12:14 BST 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Joachim,

   At first it was easier to install as  I was a SGE rookie.  After this 
experience, I will install my u3 update in Berkeley DB spooling.

Thanks again to all who helped me!

Joachim Gabler wrote:
> This is a problem that can occur with classic job spooling, as there is 
> no transaction handling in classic spooling.
> Jobs are spooled as multiple files. If not all files can be written (out 
> of memory in your case, filesystem full is a more frequent situation), 
> the job information can become inconsistent.
> 
> With Berkeley DB, spooling a job is protected by a transaction, and will 
> always succeed or fail as a whole, leaving behind a consistent database 
> in any case.
> 
> What was your motivation to use classic spooling?
> 
>   Joachim
> 
> 
> Daniel Templeton schrieb:
> 
>> That cause is that the qmaster was in the process of spooling some job 
>> information when it ran out of memory.  The partial job record that 
>> got written was confusing enough to the next qmaster that it 
>> segfaulted when trying to read it.
>>
>> Daniel
>>
>> Patrice Hamelin wrote:
>>
>>> Joachim,
>>>
>>>   It works fine now but I have to clear all the running processes on 
>>> the nodes.
>>>
>>>   What can be the cause of corruption?  I kept a copy of the jobs and 
>>> job_scripts dir.
>>>
>>> Thanks to all (Daniel, Mac, Joachim)
>>>
>>> Joachim Gabler wrote:
>>>
>>>> Did qmaster write a core file?
>>>> Could you please try to produce a stack trace (using gdb)?
>>>>
>>>> Except the jobs, all spooled files are ascii files.
>>>> I would suggest to delete the spooled jobs: Everything under 
>>>> <spooldir>/jobs and <spooldir>/job_scripts.
>>>>
>>>> If qmaster still doesn't startup, have a look at the other spooled 
>>>> objects:
>>>> <spooldir>/admin_hosts,
>>>> <spooldir>/calendars,
>>>> ....
>>>>
>>>> Just do a cat on all files and verify "if they look ok".
>>>>
>>>>  Joachim
>>>>
>>>>
>>>> Patrice Hamelin schrieb:
>>>>
>>>>> I run the classic spooling
>>>>>
>>>>>
>>>>> [root at stokes util]# /opt/sge/bin/lx24-x86/sge_qmaster
>>>>> Reading in complex attributes.
>>>>> Reading in execution hosts.
>>>>> Reading in administrative hosts.
>>>>> Reading in submit hosts.
>>>>> Reading in host group entries:
>>>>>         Host group entries for group "@allhosts".
>>>>>         Host group entries for group "@single".
>>>>>         Host group entries for group "@multi".
>>>>>         Host group entries for group "@gal".
>>>>>         Host group entries for group "@bigmem".
>>>>>         Host group entries for group "@multi2".
>>>>> Reading in usersets:
>>>>>         Userset "gal".
>>>>>         Userset "defaultdepartment".
>>>>>         Userset "deadlineusers".
>>>>>         Userset "admin".
>>>>> Reading in queues:
>>>>>         Queue "batch".
>>>>>         Queue "multi".
>>>>>         Queue "single".
>>>>>         Queue "bigmem".
>>>>>         Queue "multi2".
>>>>> Reading in parallel environments:
>>>>>         PE "mpich_1".
>>>>>         PE "mpich_2".
>>>>> Reading in ckpt interface definitions:
>>>>>         CKPT "blcr".
>>>>> Reading in Master_Job_List.
>>>>> .
>>>>>
>>>>> read job database with 15 entries in 1 seconds
>>>>> Segmentation fault
>>>>> [root at stokes util]#
>>>>>
>>>>>
>>>>> Joachim Gabler wrote:
>>>>>
>>>>>> Patrice,
>>>>>>
>>>>>> what spooling method are you using (classic / berkeleydb)?
>>>>>>
>>>>>> Please try to startup qmaster in debug mode:
>>>>>> In a shell as user root:
>>>>>> source $SGE_ROOT/util/dl.(c)sh
>>>>>> dl 1
>>>>>> $SGE_ROOT/bin/<arch>/sge_qmaster
>>>>>>
>>>>>> This might show some error messages, e.g. when reading jobs from 
>>>>>> disk.
>>>>>>
>>>>>>   Joachim
>>>>>>
>>>>>> Daniel Templeton schrieb:
>>>>>>
>>>>>>> Looks to me like running out of memory caused the qmaster to 
>>>>>>> leave the cluster in a broken state.  You'll need to clean up 
>>>>>>> whatever the qmaster left behind.  That may involve using 
>>>>>>> utilbin/spooledit or deleting jobs from the spool directory.  
>>>>>>> Unfortunately, you'll need the advice of someone who actually 
>>>>>>> recovers broken clusters instead of just reinstalling them like I 
>>>>>>> do.  Joachim?  Stephan?  Omar?
>>>>>>>
>>>>>>> Daniel
>>>>>>>
>>>>>>> Patrice Hamelin wrote:
>>>>>>>
>>>>>>>> After running sgemaster, qmaster is NOT running.  I run SGE 
>>>>>>>> 6.0u1 on RedHat linux 7.3.  see my other message, I had a memory 
>>>>>>>> problem which I think cause the  problem yesterday.
>>>>>>>>
>>>>>>>> Thanks guys for help!
>>>>>>>>
>>>>>>>> Daniel Templeton wrote:
>>>>>>>>
>>>>>>>>> After running sgemaster, is your qmaster running?  What 
>>>>>>>>> platform and SGE version?  Was there an event which caused the 
>>>>>>>>> qmaster to stop?
>>>>>>>>>
>>>>>>>>> Daniel
>>>>>>>>>
>>>>>>>>> Patrice Hamelin wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>>   My qmaster stopped since a couple of hours and I cannot 
>>>>>>>>>> restart it.
>>>>>>>>>> I always have:
>>>>>>>>>>
>>>>>>>>>> [root at stokes common]#  /etc/init.d/sgemaster start
>>>>>>>>>>    starting sge_qmaster
>>>>>>>>>>    starting sge_schedd
>>>>>>>>>> error: getting configuration: unable to contact qmaster using 
>>>>>>>>>> port 536
>>>>>>>>>> on host "stokes.clumeq.mcgill.ca"
>>>>>>>>>> can't get configuration from qmaster -- waiting ...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>   thanks for help!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --------------------------------------------------------------------- 
>>>>>>>>>
>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>>>>> For additional commands, e-mail: 
>>>>>>>>> users-help at gridengine.sunsource.net
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --------------------------------------------------------------------- 
>>>>>>>
>>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>>
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 

-- 
Patrice Hamelin ing, M.Sc.A, CCNA
Systems Administrator
CLUMEQ Supercomputer Centre
McGill University
688 Sherbrooke Street West, Suite 710
Montreal, QC, Canada H3A 2S6
Tel: 514-398-3344
Fax: 514-398-2203
http://www.clumeq.mcgill.ca

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list