[GE users] GE Issue when handling a lot of jobs

ggeca ggeca at bas.bg
Mon May 25 08:55:15 BST 2009

Dear all,

We are running a Grid Engine system (6.1u4) with 4 execute hosts (SLES 10
SP2) and one of the execute hosts acts as a master host.

Recently we had to submit more than 80 000 jobs to be processed over a
long period of time. Unfortunately the master host got unresponsive (qstat
and qdel returning "failed receiving gdi request" messages). After
stopping the scheduling daemon (sge_schedd) it failed to start again and
produced a timeout message. After deleting all the files in
<SGE_ROOT>/default/spool/qmaster/job_scripts we were able to start the
daemon again but we are afraid the problem may appear again.

We were wondering if there is a limit to the jobs that can be handled by
the Grid Engine simultaneously. We are also wondering if this issue
appears because of insufficient hardware resources (we are using Athlon 64
X2 3800+ with 2GB RAM) or failing file system (reiserfs).

Any help, ideas or suggestions will be greatly appreciated.

Best regards,
Georgi Gecov


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list