[GE users] MPICH h_vmem questions

Dev dev_hyd2001 at yahoo.com
Fri Aug 4 10:04:28 BST 2006


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,

       I had two versions of my test program one where I try allocating unequal amounts of memory for each process and the other where I allocate the same amount of memory for each process. To keep things simpler today I tried out with the program where I allocate the same amount of memory in each process.



Setup

Request for 5 slots and h_vmem=2g

nodes showing h_vmem=6.519G

The results 

* In all the runs 3 slots get allotted on one node and 2 slots on the other node.

* The maximum amount of memory that was successfully allocated for each   
   process was 1.81G.

* When trying to allocate 1.82G per process weird things happened

A few of the 5 mpich processes hang around on the nodes allocating 1.82G of memory but GE reports that the job was killed and I also see a net_recv error reported by the MPICH process(es) in the jobs console output file.


when trying to allocate 2G per process I get the p4_error : interrupt SIGSEGV :11 and the job is killed by GE. 


Probably it doesn't make sense to have such detailed tests in real life applications though. 

But can I conclude that

When trying to run an application using MPICH processes and requesting h_vmem I should request for sufficiently more memory probably 0.5G than what I expect the job to use?




/Dev











Reuti <reuti at staff.uni-marburg.de> wrote: Hi,

Am 02.08.2006 um 22:59 schrieb Dev:

> hi,
>
> Yes I'm on a 64 bit platform ( Opteron and SLES 9.0 ).
>
> The qmaster messages file says
>
> "tightly integerated parallel task 10419.1 task 3.node28 failed-  
> killing job"
>
> and the node(s) messages files say
>
> reaping job  10419 ptf complains: Job does not exist
>
> I tried running the same program with just mpirun and a hostfile  
> and the same command line arguments and the same nodes and it seems  
> to work. I get he right amount of memory malloced on every MPICH  
> process.
>
> Could it be that I'm doing something wrong with the complex configs  
> and requests. I have the complex config for h_vmem set to 2g as the  
> default value.

if you set h_vmem, various settings are limited by SGE by setting the  
rlimit:

data seg size         (kbytes, -d)
stack size            (kbytes, -s)
virtual memory        (kbytes, -v)

But AFAIK they are multiplied on each node by the number of granted  
slots on this machine. If you have a forked application this is okay,  
But if you login for another process by qrsh, this might be a too  
high limit. This I put here:

http://gridengine.sunsource.net/issues/show_bug.cgi?id=1254

So with 5 processes you would request 1.3 to get 6.5 in total, which  
lower than 2G. What is happening, if you request only half of it  
intotal, i.e. 3.25G - is this working. I'm still puzzled, that you  
get a SIGSEGV, not a SIGKILL. This looks at the first glance more  
like a problem in your program, but as it's working without SGE it's  
strange. Did you compile MPICH to use shared memory?

-- Reuti

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



 		
---------------------------------
Yahoo! Messenger with Voice. Make PC-to-Phone Calls to the US (and 30+ countries) for 2?/min or less.



More information about the gridengine-users mailing list