[GE users] GE Issue when handling a lot of jobs

dangruhn Dan.Gruhn at groupw.com
Tue May 26 14:30:51 BST 2009


As SGE master we were using a dual core Athlon 64 6000+ with 2GB RAM and 17 quad core execution hosts (~61 effective cpus).  We were having the problems you mentioned when we tried to run 20K jobs. We have since upgraded the SGE master to 8GB RAM and are able to run 30K jobs without problems. All of our systems run Fedora 9 and we used Berkley DB.

We recently ran 55K jobs and again had the same problems.  The jobs themselves seemed to run okay, so the scheduler was keeping up with them, but qstat was getting GDI errors. I will be looking at job arrays (there were some problems in the past but have since been fixed) which should help us.


adary wrote:
> From personal experience, sending too many jobs kills the scheduler. In our grid which is currently 450 execution hosts (~2000 cpu's) the scheduler is limited to 10000 total jobs, and a maximum 1000 jobs per user. Our SGE master is a dedicated sun fire v240 with 2CPU's and 16gb RAM, and its running at maximum.
> Right now we are in process of upgrading the master host to a T5120 server with 64 thread cpu and 16gb ram, juts to give ourselves more headroom for additional execution hosts.
> On the other hand, we have over 130000 daily jobs, and this is handled by a custom job feeder that lets each user hold approx. 100 waiting jobs. The feeder checks for finished jobs, and each time 10-ish jobs finish for the user, it sends additional 10 jobs to the grid.
> Just a note, if you are on SGE 6.1, running frequent qstat will add huge amount of GDI requests to the master host. In 6.2 this felt a lot less.
> -----Original Message-----
> From: ggeca [mailto:ggeca at bas.bg]
> Sent: Monday, May 25, 2009 10:55 AM
> To: users at gridengine.sunsource.net
> Subject: [GE users] GE Issue when handling a lot of jobs
> Dear all,
> We are running a Grid Engine system (6.1u4) with 4 execute hosts (SLES 10
> SP2) and one of the execute hosts acts as a master host.
> Recently we had to submit more than 80 000 jobs to be processed over a
> long period of time. Unfortunately the master host got unresponsive (qstat
> and qdel returning "failed receiving gdi request" messages). After
> stopping the scheduling daemon (sge_schedd) it failed to start again and
> produced a timeout message. After deleting all the files in
> <SGE_ROOT>/default/spool/qmaster/job_scripts we were able to start the
> daemon again but we are afraid the problem may appear again.
> We were wondering if there is a limit to the jobs that can be handled by
> the Grid Engine simultaneously. We are also wondering if this issue
> appears because of insufficient hardware resources (we are using Athlon 64
> X2 3800+ with 2GB RAM) or failing file system (reiserfs).
> Any help, ideas or suggestions will be greatly appreciated.
> Best regards,
> Georgi Gecov
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=198787
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=198946
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

Dan Gruhn
Group W Inc.
8315 Lee Hwy, Suite 303
Fairfax, VA, 22031
PH: (703) 752-5831
FX: (703) 752-5851


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list