[GE users] MPICH h_vmem questions
dev_hyd2001 at yahoo.com
Fri Aug 4 10:04:28 BST 2006
[ The following text is in the "iso-8859-1" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
I had two versions of my test program one where I try allocating unequal amounts of memory for each process and the other where I allocate the same amount of memory for each process. To keep things simpler today I tried out with the program where I allocate the same amount of memory in each process.
Request for 5 slots and h_vmem=2g
nodes showing h_vmem=6.519G
* In all the runs 3 slots get allotted on one node and 2 slots on the other node.
* The maximum amount of memory that was successfully allocated for each
process was 1.81G.
* When trying to allocate 1.82G per process weird things happened
A few of the 5 mpich processes hang around on the nodes allocating 1.82G of memory but GE reports that the job was killed and I also see a net_recv error reported by the MPICH process(es) in the jobs console output file.
when trying to allocate 2G per process I get the p4_error : interrupt SIGSEGV :11 and the job is killed by GE.
Probably it doesn't make sense to have such detailed tests in real life applications though.
But can I conclude that
When trying to run an application using MPICH processes and requesting h_vmem I should request for sufficiently more memory probably 0.5G than what I expect the job to use?
Reuti <reuti at staff.uni-marburg.de> wrote: Hi,
Am 02.08.2006 um 22:59 schrieb Dev:
> Yes I'm on a 64 bit platform ( Opteron and SLES 9.0 ).
> The qmaster messages file says
> "tightly integerated parallel task 10419.1 task 3.node28 failed-
> killing job"
> and the node(s) messages files say
> reaping job 10419 ptf complains: Job does not exist
> I tried running the same program with just mpirun and a hostfile
> and the same command line arguments and the same nodes and it seems
> to work. I get he right amount of memory malloced on every MPICH
> Could it be that I'm doing something wrong with the complex configs
> and requests. I have the complex config for h_vmem set to 2g as the
> default value.
if you set h_vmem, various settings are limited by SGE by setting the
data seg size (kbytes, -d)
stack size (kbytes, -s)
virtual memory (kbytes, -v)
But AFAIK they are multiplied on each node by the number of granted
slots on this machine. If you have a forked application this is okay,
But if you login for another process by qrsh, this might be a too
high limit. This I put here:
So with 5 processes you would request 1.3 to get 6.5 in total, which
lower than 2G. What is happening, if you request only half of it
intotal, i.e. 3.25G - is this working. I'm still puzzled, that you
get a SIGSEGV, not a SIGKILL. This looks at the first glance more
like a problem in your program, but as it's working without SGE it's
strange. Did you compile MPICH to use shared memory?
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
Yahoo! Messenger with Voice. Make PC-to-Phone Calls to the US (and 30+ countries) for 2?/min or less.
More information about the gridengine-users