[GE users] Timeout with client requests
chambon at cc.in2p3.fr
Thu Oct 21 10:44:48 BST 2010
I start a GE stress test, mainly by submitting huge numbers of jobs without running those jobs (all queues closed)
( master with a lot of memory (70 GB), with BerkeleyDB spooling, on a local disk, and with DBwriter up (for ARCo))
The good news are that GE show a very good behaviour
The qsub are very fast and don't decrease too much as the number of pending jobs increase
(For example : 80 jobs /s for 10,000 pending jobs, 61 jobs /s for 100,000 pending jobs)
I also take other measurements for qstat and qdel commands
As a last test, I try to submit more than 1,000,000 jobs (one million jobs)
The bad news is that, reaching 984279 pending jobs, I got timeout with qsub|qstat commands
with the message :
error: failed receiving gdi request response for mid=1 (got syncron message receive timeout error).
The qmaster is still running, but I can't issue requests onjobs (qstat, qdel) even after restarting the qmaster
but I can issue other requests (qconf, qping)
My questions are
- Is-it only a timeout problem ?, if yes, is it possible to change client timeout ?
- Has someone submitted such (or more) numbers of jobs ?
Do the GE limits only depend on the hardware of the master ?
- In the bootstrap file, what are the meaning of listener and worker threads (listener_thread, worker_threads) ?
Is-it useful to set other numbers than the default (2)
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users