[GE users] sge_qmaster crashing with segmentation fault

sgexav xaviercouvelard at gmail.com
Wed Nov 4 10:41:42 GMT 2009


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hi, we had the same sort of problem, was working and one day stop? did 
not manage to find a definitive solution,
temporaly we went for a hourly restart script, no it seems that every 
thing work well :-) .
If it can help.
X.

eimamagi a écrit :
> Hello to all,
>
> we have a problem with SGE installation. Our environment is the following:
> - frontend is VMWare virtual machine on ESX server (Infrastructure 3.5)
> - kernel:
>    # uname -a
>    Linux sge.srce.hr 2.6.18-128.1.14.el5 #1 SMP Wed Jun 17 06:38:05 EDT
> 2009 x86_64 x86_64 x86_64 GNU/Linux
> - SGE version:
>    # qstat --version
>    SGE 6.2u3
> - we have Beowulf cluster where nodes are on private network without any 
> firewall implemented.
> - we installed courtesy binaries and everything has been working fine 
> since the end of August.
>
> On August 29th sge_qmaster simply stopped working without anything in 
> message logs. Afterwards we restarted it several times and it simply 
> died few minutes after restart.
>
> Then we tried running it in debug mode:
>    SGE_DEBUG_LEVEL="2 2 0 0 0 0 2 0"; export SGE_DEBUG_LEVEL;
> SGE_ND="true"; export SGE_ND ;
> Messages seemed reasonable and sge_qmaster worked fine, but after random 
> hours died with the following message:
> 1.
> 1522805  16032 scheduler000     ================[SCHEDULING-EPOCH 
> 200911021352.48]==================
> 1522806  16032 scheduler000     RAW CQ:2, J:57, H:9, C:49, A:103, D:1, 
> P:101, CKPT:0, US:211, PR:101, RQS:0, AR:0, S:nd:384/lf:282
> /etc/init.d/sgemaster.isabella: line 652: 16032 Segmentation fault 
> $bin_dir/sge_qmaster
> 2.
> 18941855  31366 scheduler000     ================[SCHEDULING-EPOCH 
> 200911031641.16]==================
> 18941856  31366 scheduler000     RAW CQ:2, J:67, H:9, C:49, A:103, D:1, 
> P:101, CKPT:0, US:211, PR:101, RQS:0, AR:0, S:nd:384/lf:282
> 18941857  31366 event_master     processing event master request: 
> /etc/init.d/sgemaster.isabella: line 652: 31366 Segmentation fault
> $bin_dir/sge_qmaster
>
>
> It doesn't seem that this segfault have pattern it its behavior, but it 
> might me useful for others. Could this be a problem with virtual VMWare 
> machine?
>
> Thanks a lot in advance,
> emir
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=224904
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=224983

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list