[GE users] GE Issue when handling a lot of jobs

craffi dag at sonsorol.org
Tue May 26 12:05:06 BST 2009

More memory would certainly help.

SGE has a design goal of 500,000 concurrent jobs (I think) and has  
been known to run parallel jobs across 63,000 cores (RANGER cluster in  

Spooling type (berkeley db vs. classic) matters in the "rapidly  
submitting jobs" context; you should be using berkeley-DB spooling. I  
do think your qmaster does not have a lot of memory for tracking 80K  
jobs so that could also be an issue. Modern versions of SGE have a  
decent amount of trace and profiling tools that should be able to let  
you know if the hardware is a bottleneck.

For 80,000 jobs it is also worth a serious look at your workflow. Is  
there any chance you can compress the 80K tasks into a job array or  
grouping of dependent job arrays? Job arrays with thousands of  
independent tasks put far less of a load on the system than thousands  
of independent jobs.


On May 25, 2009, at 3:55 AM, ggeca wrote:

> Dear all,
> We are running a Grid Engine system (6.1u4) with 4 execute hosts  
> (SLES 10
> SP2) and one of the execute hosts acts as a master host.
> Recently we had to submit more than 80 000 jobs to be processed over a
> long period of time. Unfortunately the master host got unresponsive  
> (qstat
> and qdel returning "failed receiving gdi request" messages). After
> stopping the scheduling daemon (sge_schedd) it failed to start again  
> and
> produced a timeout message. After deleting all the files in
> <SGE_ROOT>/default/spool/qmaster/job_scripts we were able to start the
> daemon again but we are afraid the problem may appear again.
> We were wondering if there is a limit to the jobs that can be  
> handled by
> the Grid Engine simultaneously. We are also wondering if this issue
> appears because of insufficient hardware resources (we are using  
> Athlon 64
> X2 3800+ with 2GB RAM) or failing file system (reiserfs).
> Any help, ideas or suggestions will be greatly appreciated.
> Best regards,
> Georgi Gecov
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=198787
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net 
> ].


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list