[GE users] non-advancing jobs in gridengine
landman at scalableinformatics.com
Mon Aug 24 04:15:56 BST 2009
Dan suggested I post this here.
We have been running into a problem with OpenMPI 1.3.2 jobs running in
GridEngine that I have not been able to resolve.
The problem: For jobs of a certain number of cores or larger, the job
gets "stuck" (e.g. does not advance) after a certain number of time steps.
Details on the problem: We see this for job sizes of more than 32
cores. 32 cores works great in SGE. I have verified that the jobs work
fine outside of GridEngine up to and including 128 cores (the current
cluster size). Only when running within SGE is there a problem.
ulimit on the nodes looks like this:
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
file size (blocks, -f) unlimited
pending signals (-i) 71680
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) 6956145
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 71680
virtual memory (kbytes, -v) 7388720
file locks (-x) unlimited
OpenMPI was built with Intel 11.1 compiler with the --with-sge switch
(among others). Fabric is a Voltaire/Mellanox SDR IB system.
An openmpi parallel environment has been set up in the usual manner.
We can reliably replicate the failure above 32 nodes.
So ... Dan suggested it could be stack size. As you can see, stack is
set to 8M (8192k).
Has anyone run into anything like this?
We are going to test torque and possibly slurm on this machine as well
to see if they work with these test cases.
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman at scalableinformatics.com
web : http://scalableinformatics.com
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users