[GE users] MPICH h_vmem questions

Reuti reuti at staff.uni-marburg.de
Thu Aug 3 20:34:50 BST 2006


Hi,

Am 02.08.2006 um 22:59 schrieb Dev:

> hi,
>
> Yes I'm on a 64 bit platform ( Opteron and SLES 9.0 ).
>
> The qmaster messages file says
>
> "tightly integerated parallel task 10419.1 task 3.node28 failed-  
> killing job"
>
> and the node(s) messages files say
>
> reaping job  10419 ptf complains: Job does not exist
>
> I tried running the same program with just mpirun and a hostfile  
> and the same command line arguments and the same nodes and it seems  
> to work. I get he right amount of memory malloced on every MPICH  
> process.
>
> Could it be that I'm doing something wrong with the complex configs  
> and requests. I have the complex config for h_vmem set to 2g as the  
> default value.

if you set h_vmem, various settings are limited by SGE by setting the  
rlimit:

data seg size         (kbytes, -d)
stack size            (kbytes, -s)
virtual memory        (kbytes, -v)

But AFAIK they are multiplied on each node by the number of granted  
slots on this machine. If you have a forked application this is okay,  
But if you login for another process by qrsh, this might be a too  
high limit. This I put here:

http://gridengine.sunsource.net/issues/show_bug.cgi?id=1254

So with 5 processes you would request 1.3 to get 6.5 in total, which  
lower than 2G. What is happening, if you request only half of it  
intotal, i.e. 3.25G - is this working. I'm still puzzled, that you  
get a SIGSEGV, not a SIGKILL. This looks at the first glance more  
like a problem in your program, but as it's working without SGE it's  
strange. Did you compile MPICH to use shared memory?

-- Reuti

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list