[GE users] Timeout with client requests

chambon chambon at cc.in2p3.fr
Thu Oct 21 10:44:48 BST 2010


I start a GE stress test, mainly by submitting huge numbers of jobs without running those jobs (all queues closed)
( master with a lot of memory (70 GB), with BerkeleyDB spooling, on a local disk, and with DBwriter up (for ARCo))

The good news are that GE show a very good behaviour
The qsub are very fast and don't decrease too much as the number of pending jobs increase
(For example : 80 jobs /s for 10,000 pending jobs, 61 jobs /s for 100,000 pending jobs)
I also take other measurements for qstat and qdel commands

As a last test, I try to submit more than 1,000,000 jobs (one million jobs)
The bad news is that, reaching 984279 pending jobs, I got timeout with qsub|qstat commands
with the message :
 error: failed receiving gdi request response for mid=1 (got syncron message receive timeout error).

The qmaster is still running, but I can't issue  requests onjobs (qstat, qdel) even after restarting the qmaster
but I can issue other requests (qconf, qping)

My questions are
 - Is-it only a timeout problem ?, if yes, is it possible to change client timeout ?
 - Has someone submitted such (or more) numbers of jobs ?     
   Do the GE limits only depend on the hardware of the master ? 

 - In the bootstrap file, what are the meaning of  listener and worker threads (listener_thread, worker_threads) ?
   Is-it useful to set other numbers than the default (2) 



To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list