[GE users] MPICH h_vmem questions

Dev dev_hyd2001 at yahoo.com
Fri Aug 4 10:04:28 BST 2006

    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]


       I had two versions of my test program one where I try allocating unequal amounts of memory for each process and the other where I allocate the same amount of memory for each process. To keep things simpler today I tried out with the program where I allocate the same amount of memory in each process.


Request for 5 slots and h_vmem=2g

nodes showing h_vmem=6.519G

The results 

* In all the runs 3 slots get allotted on one node and 2 slots on the other node.

* The maximum amount of memory that was successfully allocated for each   
   process was 1.81G.

* When trying to allocate 1.82G per process weird things happened

A few of the 5 mpich processes hang around on the nodes allocating 1.82G of memory but GE reports that the job was killed and I also see a net_recv error reported by the MPICH process(es) in the jobs console output file.

when trying to allocate 2G per process I get the p4_error : interrupt SIGSEGV :11 and the job is killed by GE. 

Probably it doesn't make sense to have such detailed tests in real life applications though. 

But can I conclude that

When trying to run an application using MPICH processes and requesting h_vmem I should request for sufficiently more memory probably 0.5G than what I expect the job to use?


Reuti <reuti at staff.uni-marburg.de> wrote: Hi,

Am 02.08.2006 um 22:59 schrieb Dev:

> hi,
> Yes I'm on a 64 bit platform ( Opteron and SLES 9.0 ).
> The qmaster messages file says
> "tightly integerated parallel task 10419.1 task 3.node28 failed-  
> killing job"
> and the node(s) messages files say
> reaping job  10419 ptf complains: Job does not exist
> I tried running the same program with just mpirun and a hostfile  
> and the same command line arguments and the same nodes and it seems  
> to work. I get he right amount of memory malloced on every MPICH  
> process.
> Could it be that I'm doing something wrong with the complex configs  
> and requests. I have the complex config for h_vmem set to 2g as the  
> default value.

if you set h_vmem, various settings are limited by SGE by setting the  

data seg size         (kbytes, -d)
stack size            (kbytes, -s)
virtual memory        (kbytes, -v)

But AFAIK they are multiplied on each node by the number of granted  
slots on this machine. If you have a forked application this is okay,  
But if you login for another process by qrsh, this might be a too  
high limit. This I put here:


So with 5 processes you would request 1.3 to get 6.5 in total, which  
lower than 2G. What is happening, if you request only half of it  
intotal, i.e. 3.25G - is this working. I'm still puzzled, that you  
get a SIGSEGV, not a SIGKILL. This looks at the first glance more  
like a problem in your program, but as it's working without SGE it's  
strange. Did you compile MPICH to use shared memory?

-- Reuti

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

Yahoo! Messenger with Voice. Make PC-to-Phone Calls to the US (and 30+ countries) for 2?/min or less.

More information about the gridengine-users mailing list