[GE users] sge stopped: error: getting configuration:

Patrice Hamelin phamelin at clumeq.mcgill.ca
Tue May 3 15:53:28 BST 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Joachim,

   It works fine now but I have to clear all the running processes on 
the nodes.

   What can be the cause of corruption?  I kept a copy of the jobs and 
job_scripts dir.

Thanks to all (Daniel, Mac, Joachim)

Joachim Gabler wrote:
> Did qmaster write a core file?
> Could you please try to produce a stack trace (using gdb)?
> 
> Except the jobs, all spooled files are ascii files.
> I would suggest to delete the spooled jobs: Everything under 
> <spooldir>/jobs and <spooldir>/job_scripts.
> 
> If qmaster still doesn't startup, have a look at the other spooled objects:
> <spooldir>/admin_hosts,
> <spooldir>/calendars,
> ....
> 
> Just do a cat on all files and verify "if they look ok".
> 
>  Joachim
> 
> 
> Patrice Hamelin schrieb:
> 
>> I run the classic spooling
>>
>>
>> [root at stokes util]# /opt/sge/bin/lx24-x86/sge_qmaster
>> Reading in complex attributes.
>> Reading in execution hosts.
>> Reading in administrative hosts.
>> Reading in submit hosts.
>> Reading in host group entries:
>>         Host group entries for group "@allhosts".
>>         Host group entries for group "@single".
>>         Host group entries for group "@multi".
>>         Host group entries for group "@gal".
>>         Host group entries for group "@bigmem".
>>         Host group entries for group "@multi2".
>> Reading in usersets:
>>         Userset "gal".
>>         Userset "defaultdepartment".
>>         Userset "deadlineusers".
>>         Userset "admin".
>> Reading in queues:
>>         Queue "batch".
>>         Queue "multi".
>>         Queue "single".
>>         Queue "bigmem".
>>         Queue "multi2".
>> Reading in parallel environments:
>>         PE "mpich_1".
>>         PE "mpich_2".
>> Reading in ckpt interface definitions:
>>         CKPT "blcr".
>> Reading in Master_Job_List.
>> .
>>
>> read job database with 15 entries in 1 seconds
>> Segmentation fault
>> [root at stokes util]#
>>
>>
>> Joachim Gabler wrote:
>>
>>> Patrice,
>>>
>>> what spooling method are you using (classic / berkeleydb)?
>>>
>>> Please try to startup qmaster in debug mode:
>>> In a shell as user root:
>>> source $SGE_ROOT/util/dl.(c)sh
>>> dl 1
>>> $SGE_ROOT/bin/<arch>/sge_qmaster
>>>
>>> This might show some error messages, e.g. when reading jobs from disk.
>>>
>>>   Joachim
>>>
>>> Daniel Templeton schrieb:
>>>
>>>> Looks to me like running out of memory caused the qmaster to leave 
>>>> the cluster in a broken state.  You'll need to clean up whatever the 
>>>> qmaster left behind.  That may involve using utilbin/spooledit or 
>>>> deleting jobs from the spool directory.  Unfortunately, you'll need 
>>>> the advice of someone who actually recovers broken clusters instead 
>>>> of just reinstalling them like I do.  Joachim?  Stephan?  Omar?
>>>>
>>>> Daniel
>>>>
>>>> Patrice Hamelin wrote:
>>>>
>>>>> After running sgemaster, qmaster is NOT running.  I run SGE 6.0u1 
>>>>> on RedHat linux 7.3.  see my other message, I had a memory problem 
>>>>> which I think cause the  problem yesterday.
>>>>>
>>>>> Thanks guys for help!
>>>>>
>>>>> Daniel Templeton wrote:
>>>>>
>>>>>> After running sgemaster, is your qmaster running?  What platform 
>>>>>> and SGE version?  Was there an event which caused the qmaster to 
>>>>>> stop?
>>>>>>
>>>>>> Daniel
>>>>>>
>>>>>> Patrice Hamelin wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>>   My qmaster stopped since a couple of hours and I cannot restart 
>>>>>>> it.
>>>>>>> I always have:
>>>>>>>
>>>>>>> [root at stokes common]#  /etc/init.d/sgemaster start
>>>>>>>    starting sge_qmaster
>>>>>>>    starting sge_schedd
>>>>>>> error: getting configuration: unable to contact qmaster using 
>>>>>>> port 536
>>>>>>> on host "stokes.clumeq.mcgill.ca"
>>>>>>> can't get configuration from qmaster -- waiting ...
>>>>>>>
>>>>>>>
>>>>>>>   thanks for help!
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>>
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 

-- 
Patrice Hamelin ing, M.Sc.A, CCNA
Systems Administrator
CLUMEQ Supercomputer Centre
McGill University
688 Sherbrooke Street West, Suite 710
Montreal, QC, Canada H3A 2S6
Tel: 514-398-3344
Fax: 514-398-2203
http://www.clumeq.mcgill.ca

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list