[GE users] MPICH h_vmem questions

Reuti reuti at staff.uni-marburg.de
Fri Aug 4 14:13:16 BST 2006


Am 04.08.2006 um 11:04 schrieb Dev:

> Hi,
>        I had two versions of my test program one where I try  
> allocating unequal amounts of memory for each process and the other  
> where I allocate the same amount of memory for each process. To  
> keep things simpler today I tried out with the program where I  
> allocate the same amount of memory in each process.
> Setup
> Request for 5 slots and h_vmem=2g
> nodes showing h_vmem=6.519G
> The results
> * In all the runs 3 slots get allotted on one node and 2 slots on  
> the other node.
> * The maximum amount of memory that was successfully allocated for  
> each
>    process was 1.81G.
> * When trying to allocate 1.82G per process weird things happened
> A few of the 5 mpich processes hang around on the nodes allocating  
> 1.82G of memory but GE reports that the job was killed and I also  
> see a net_recv error reported by the MPICH process(es) in the jobs  
> console output file.

thx for the detailed info.

> when trying to allocate 2G per process I get the p4_error :  
> interrupt SIGSEGV :11 and the job is killed by GE.
> Probably it doesn't make sense to have such detailed tests in real  
> life applications though.
> But can I conclude that
> When trying to run an application using MPICH processes and  
> requesting h_vmem I should request for sufficiently more memory  
> probably 0.5G than what I expect the job to use?

Yes, this makes sense. I tested it with our Gaussian03 application  
(just in serial) some time ago. As the program code also count  
against the requested h_vmem, the h_vmem always needed some higher  
value than the specified memory in the inputfile to Gaussian03. As  
this was cumbersome for the users, we didn't use h_vmem for this, but  
I made the virtual_free complex consumable and set it for each  
machine to the built-in memory minus 100M for the operating system.  
This is only a guidance for SGE, and if the users specify wrong  
amounts of memory, they might start their jobs earlier, but would be  
hurt afterwards when the system starts to swap.

For us, this is a better option, than getting a job killed only  
because it needs one byte more.

Cheers - Reuti

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list