No subject

Wed Jan 12 20:38:46 GMT 2011

Right now we are in process of upgrading the master host to a T5120 server with 64 thread cpu and 16gb ram, juts to give ourselves more headroom for additional execution hosts.

On the other hand, we have over 130000 daily jobs, and this is handled by a custom job feeder that lets each user hold approx. 100 waiting jobs. The feeder checks for finished jobs, and each time 10-ish jobs finish for the user, it sends additional 10 jobs to the grid.

Just a note, if you are on SGE 6.1, running frequent qstat will add huge amount of GDI requests to the master host. In 6.2 this felt a lot less.

-----Original Message-----
From: ggeca [mailto:ggeca at]
Sent: Monday, May 25, 2009 10:55 AM
To: users at
Subject: [GE users] GE Issue when handling a lot of jobs

Dear all,

We are running a Grid Engine system (6.1u4) with 4 execute hosts (SLES 10
SP2) and one of the execute hosts acts as a master host.

Recently we had to submit more than 80 000 jobs to be processed over a
long period of time. Unfortunately the master host got unresponsive (qstat
and qdel returning "failed receiving gdi request" messages). After
stopping the scheduling daemon (sge_schedd) it failed to start again and
produced a timeout message. After deleting all the files in
<SGE_ROOT>/default/spool/qmaster/job_scripts we were able to start the
daemon again but we are afraid the problem may appear again.

We were wondering if there is a limit to the jobs that can be handled by
the Grid Engine simultaneously. We are also wondering if this issue
appears because of insufficient hardware resources (we are using Athlon 64
X2 3800+ with 2GB RAM) or failing file system (reiserfs).

Any help, ideas or suggestions will be greatly appreciated.

Best regards,
Georgi Gecov


To unsubscribe from this discussion, e-mail: [users-unsubscribe at].


To unsubscribe from this discussion, e-mail: [users-unsubscribe at].

More information about the gridengine-users mailing list