[GE users] sge_qmaster crashing with segmentation fault

rayson rayrayson at gmail.com
Wed Nov 4 15:58:30 GMT 2009


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Did you guys try to run the qmaster under a debugger??

If you can get a stack trace when it crashes, we would at least have a
better understanding of the problem...

Rayson



On Wed, Nov 4, 2009 at 5:41 AM, sgexav <xaviercouvelard at gmail.com> wrote:
> Hi, we had the same sort of problem, was working and one day stop? did
> not manage to find a definitive solution,
> temporaly we went for a hourly restart script, no it seems that every
> thing work well :-) .
> If it can help.
> X.
>
> eimamagi a écrit :
>> Hello to all,
>>
>> we have a problem with SGE installation. Our environment is the following:
>> - frontend is VMWare virtual machine on ESX server (Infrastructure 3.5)
>> - kernel:
>>    # uname -a
>>    Linux sge.srce.hr 2.6.18-128.1.14.el5 #1 SMP Wed Jun 17 06:38:05 EDT
>> 2009 x86_64 x86_64 x86_64 GNU/Linux
>> - SGE version:
>>    # qstat --version
>>    SGE 6.2u3
>> - we have Beowulf cluster where nodes are on private network without any
>> firewall implemented.
>> - we installed courtesy binaries and everything has been working fine
>> since the end of August.
>>
>> On August 29th sge_qmaster simply stopped working without anything in
>> message logs. Afterwards we restarted it several times and it simply
>> died few minutes after restart.
>>
>> Then we tried running it in debug mode:
>>    SGE_DEBUG_LEVEL="2 2 0 0 0 0 2 0"; export SGE_DEBUG_LEVEL;
>> SGE_ND="true"; export SGE_ND ;
>> Messages seemed reasonable and sge_qmaster worked fine, but after random
>> hours died with the following message:
>> 1.
>> 1522805  16032 scheduler000     ================[SCHEDULING-EPOCH
>> 200911021352.48]==================
>> 1522806  16032 scheduler000     RAW CQ:2, J:57, H:9, C:49, A:103, D:1,
>> P:101, CKPT:0, US:211, PR:101, RQS:0, AR:0, S:nd:384/lf:282
>> /etc/init.d/sgemaster.isabella: line 652: 16032 Segmentation fault
>> $bin_dir/sge_qmaster
>> 2.
>> 18941855  31366 scheduler000     ================[SCHEDULING-EPOCH
>> 200911031641.16]==================
>> 18941856  31366 scheduler000     RAW CQ:2, J:67, H:9, C:49, A:103, D:1,
>> P:101, CKPT:0, US:211, PR:101, RQS:0, AR:0, S:nd:384/lf:282
>> 18941857  31366 event_master     processing event master request:
>> /etc/init.d/sgemaster.isabella: line 652: 31366 Segmentation fault
>> $bin_dir/sge_qmaster
>>
>>
>> It doesn't seem that this segfault have pattern it its behavior, but it
>> might me useful for others. Could this be a problem with virtual VMWare
>> machine?
>>
>> Thanks a lot in advance,
>> emir
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=224904
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=224983
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=225037

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list