[GE users] MPICH h_vmem questions

Dev dev_hyd2001 at yahoo.com
Wed Aug 2 21:59:15 BST 2006


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

hi,

Yes I'm on a 64 bit platform ( Opteron and SLES 9.0 ). 

The qmaster messages file says

"tightly integerated parallel task 10419.1 task 3.node28 failed- killing job"

and the node(s) messages files say

reaping job  10419 ptf complains: Job does not exist

I tried running the same program with just mpirun and a hostfile and the same command line arguments and the same nodes and it seems to work. I get he right amount of memory malloced on every MPICH process.

Could it be that I'm doing something wrong with the complex configs and requests. I have the complex config for h_vmem set to 2g as the default value.


/Dev



Reuti <reuti at staff.uni-marburg.de> wrote: Am 02.08.2006 um 19:03 schrieb Dev:

> ""Which rank is allocating how much memory? Is it proper  
> distributed?""
>
> I tried it with allocating the same amount of memory on all the
> MPICH processes( thats my understanding , Doing a top on the nodes  
> shows four
> processes with the same name as my program with the same amount of  
> memory.)
> it still behaves the same way.
>
>
>
> Are you using MPICH with qrsh or shared memory with forks?
>
> As far as I know I'm using MPICH with qrsh and I also have -V  
> option in the $SGE_ROOT/mpi/rsh
> for tight integration.
>
> BTW when this happens the MPICH process gets a sigsegv 11 interrupt.

This is not necessarily coming from SGE. Can you please check in the  
"messages" files on the nodes, whether it detected that h_vmem was  
passed? If it's killed by the kernel, I think there should be a  
SIGKILL, not SIGSEGV signal. Also if SGE detects that the limit was  
passed, it will send a SIGKILL.

You are on a 64-bit platform and running a 64-bit program which runs  
fine without SGE?

-- Reuti

> /Dev
>
>
> Reuti  wrote: Am 02.08.2006 um 15:17  
> schrieb Dev:
>
> > Hi,
> >
> > Few questions about h_vmem and MPICH.
> >
> > Setup
> >
> > Two Nodes with h_vmem set to requestable and consumable and with
> > its value
> > 6.519G when doing a qhost -F | grep h_vmem
> >
> > Both the nodes have 4 slots each.
> >
> >
> > I have a very simple MPICH program which doesn't do anything useful
> > except mallocing memory in each MPICH process by taking a command
> > line parameter.
> >
> > Each MPICH process mallocs different amounts of memory based on a
> > command line parameter specified by the user.
> >
> > I start the MPI program by doing a qsub and requesting for h_vmem
> > and a parallel environment which includes both the nodes mentioned
> > above.
> >
> > The whole purpose is just to increase my understanding of how
> > h_vmem behaves when launching mpich programs and I'm not yet an
> > expert MPICH programmer.
> >
> >
> >
> > I do
> >
> > qsub -pe mp_pe 8 -l h_vmem=1g submit_mympi.sh
> >
> > In submit_mympi.sh I ask the MPI program to allocate a total of
> > less than 6G approx( more closer to 6.5 or so ) and everything
> > works fine with the mpich processes distributed between both the
> > nodes.
> >
> > Then I do
> >
> > qsub -pe mp_pe 5 -l h_vmem=2g submit_mympi.sh
>
> Which rank is allocating how much memory? Is it proper distributed?
> Are you using MPICH with qrsh or shared memory with forks? - Reuti
>
>
> > In submit_mympi.sh I ask the MPI program to allocate a total of
> > 6.5G approx and the MPICH processes gets killed.
> >
> > What I expected in the above case was for the program to run since
> > the total amount of h_vmem requested was more than what the
> > program tried allocating with 4 mpich processes on each node.
> > ( Definitely the total mallocing attempted on a single node would
> > have been less than 6.5 GB which is the total setting for h_vmem on
> > each node ).
> >
> > Is this the intended behavior that, even if the total h_vmem
> > requested is more than the total memory which would be used by the
> > job , the job will get killed if the memory allocated is more than
> > the value of h_vmem on the head node?
> >
> >
> > cheers
> >
> > /Dev
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > Do you Yahoo!?
> > Get on board. You're invited to try the new Yahoo! Mail Beta.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>
> Groups are talking. We?re listening. Check out the handy changes to  
> Yahoo! Groups.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



 		
---------------------------------
Do you Yahoo!?
 Next-gen email? Have it all with the  all-new Yahoo! Mail Beta.



More information about the gridengine-users mailing list