[GE users] Scalability issues when running MPI jobs

fialia theveracious at yahoo.com
Thu Feb 12 15:09:10 GMT 2009


I have some questions regarding running jobs through GE in large 
clusters, that I hope someone can shed some light on. 

Do I need to configure the grid engine in any special way to be able to 
handle MPI jobs using more than 1200 cores? 
What kinds of problems have you encountered when running large jobs? 

During my test runs I got "error: getting configuration: failed 
receiving gdi request response for mid=1 (got syncron message receive 
timeout error)." "Cannot get configuration from qmaster." Does that mean 
the qmaster daemon is overloaded? I get it both in the error file from 
the job and on the screen when running qstat afterwards. For HPMPI 2.3 I 
got it (for my particular test job) when attempting to allocate more 
than 837 cores. For HPMPI 2.2.7 I got it when attempting to allocate 
more than 1021 cores. 

In my tests I am running GE 6.2u1. I get the same with GE 6.1u4. The 
63000 core support in 6.2, is that only the spec for running batch jobs 
or how is it defined more precisely? 



To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list