[GE users] sge_sched getting killed by oom

Reuti reuti at staff.uni-marburg.de
Thu Dec 27 20:17:55 GMT 2007


Hi,

Am 27.12.2007 um 16:52 schrieb Chris Harwell:

> I'm having some trouble with sge. I am using GE 6.1u2 on an IBM  
> x346 with 2 physical CPUs and HT on with 4GB of physical RAM and  
> 18GB of swap.

wow - this is a huge swap space. Nowadays I use only around 1GB as  
swap space as a last resort, as all things should fit into real RAM.  
What is running else on this machine? Could you add 4 GB RAM?

>  uname -a
> Linux lcsge3.na.novartis.net 2.6.9-42.ELsmp #1 SMP Wed Jul 12  
> 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux
> rpm -q redhat-release
> redhat-release-4AS-5.5.x86_64
>
> and three times now sge_sched has been killed by the Linux oom  
> killer. One of the frustrating things is that it seems in this  
> scenario

Did you check before, whether this process was the one eating up  
memory? We had this once but it was because of the NFS server, and  
the OOM killed others tasks like SGE, postfix, cron, itsm... one  
after the other. So restarting NFS was the solution.

> that the shadow masters don't notice this as a problem and so they  
> do not take over. This leaves jobs in the pending queue that ought  
> to be scheduled.

Well, the qmaster is still running I guess.

-- Reuti


>
> So, are there any known memory leaks with the above configuration  
> that could be fixed by upgrading to the latest ( 6.1u3 last I  
> checked )?  What are the ballpark ranges for the amount of memory  
> sge sched should consume? Should I just add some RAM to the box or  
> move sched to another box? Presumably this is based on the number  
> of jobs and hosts and queues? I have about 223 hosts.
>
>                 running #jobs/#slots
> Owner        serial   parallel    total
> ---------------------------------------
> user1      6/  6     0/  0     6/  6
> user2     17/ 17     0/  0    17/ 17
> user3      1/  1     0/  0     1/  1
> user4       113/113     0/  0   113/113
> user5      1/  1     0/  0     1/  1
> user6       1/  1     0/  0     1/  1
> ---------------------------------------
> Sum         139/139     0/  0   139/139
>
>
>                 waiting #jobs/#slots
> Owner        serial   parallel    total
> ---------------------------------------
> user7      0/  0     1/ 20     1/ 20
> user4       596/596     0/  0   596/596
> ---------------------------------------
> Sum         596/596     1/ 20   597/616
>
>
> Should I provide more details of my specific SGE configuration?  Or  
> could someone point me to some documentation on reducing the memory  
> usage of sge_sched?
>
>
> Thanks for any pointers,
> Chris
>
>
>




More information about the gridengine-users mailing list