[GE users] MPICH 1.2.5.2 and Signals

Brian R. Smith brian at cypher.acomp.usf.edu
Wed Oct 27 21:44:13 BST 2004


To all:

Thanks for the help.  I figured out the problem and fixed it.  No more
runny processes.

Exporting P4_RSHCOMMAND in my 'mpirun' scripts seems to have fixed the
problem.

Thanks

Brian

On Wed, 2004-10-27 at 16:37 -0400, Brian R. Smith wrote:
> Reuti,
> 
> I chamged my P4_RSHCOMMAND to rsh and removed the -nolocal from
> 'mpirun'.  I'm still encountering the same problem BUT noticed that some
> zombie rsh processes no longer show up.  The Job's "head node" still
> runs the mpi process after i tell it to die.
> 
> 
> Brian
> 
> On Wed, 2004-10-27 at 22:24 +0200, Reuti wrote:
> > Hi there,
> > 
> > > I just joined the list and have my first question to shoot: Has the
> > > problem with MPICH tight-integration been resolved yet?  I am running
> > > SGE 6.0u1 with MPICH 1.2.5.2.  I have tight integration set up.  My
> > > mpirun scripts all point to "/usr/local/sge/mpi/rsh" (its nfs mounted).
> > > I have exported the MPICH_PROCESS_GROUP=no variable and have modified
> > > the "/usr/local/sge/mpi/rsh" to include the -V option on all the "qrsh"
> > > lines.
> > 
> > what do you mean exactly with: "mpirun scripts all point to"? The idea behind 
> > the rsh_wrapper of SGE is to extend the $PATH on the execution node (master) in 
> > a form like $TMPDIR:$PATH and the startmpi.sh creates a symbolic link in this 
> > directory to /usr/local/sge/mpi/rsh in the end. The rsh_wrapper will then 
> > remove $TMPDIR to get the final rsh. So, exporting something like 
> > P4_RSHCOMMAND=/usr/local/sge/mpi/rsh would lead to some weird effects (did you 
> > ment this?). Just set this to rsh. What is compiled into the program as rsh 
> > command?
> >  
> > > Here is my Parallel environment configuration:
> > > 
> > > pe_name           mpich
> > > slots             6
> > > user_lists        NONE
> > > xuser_lists       NONE
> > > start_proc_args   /usr/local/sge/mpi/startmpi.sh -catch_rsh $pe_hostfile
> > > stop_proc_args    /usr/local/sge/mpi/stopmpi.sh
> > > allocation_rule   $round_robin
> > > control_slaves    TRUE
> > > job_is_first_task FALSE
> > > urgency_slots     min
> > > 
> > > Here's an example of a submit script i am using:
> > > 
> > > #!/bin/bash
> > > #$ -v MPIR_HOME=/usr/local/mpich-intel/bin
> > > #$ -N rhog
> > > #$ -pe mpich 2
> > > #$ -S /bin/bash
> > > #$ -q all.q
> > > #$ -e /home/student/b/brs/bbmark/stderr
> > > #$ -o /home/student/b/brs/bbmark/stdout
> > > ##############
> > > export LD_LIBRARY_PATH=/usr/local/intel/cc/lib
> > > export MPICH_PROCESS_GROUP=no
> > > RUN_HOME=/home/student/b/brs/bbmark
> > >                                                                              
> > >   
> > > cd $RUN_HOME
> > >                                                                              
> > >   
> > > # Single
> > > #./bbmark01
> > >                                                                              
> > >   
> > > # Multi-processor
> > > $MPIR_HOME/mpirun -no-local -np $NSLOTS -machinefile $TMPDIR/machines
> > > $RUN_HOME/bbmark01
> > 
> > Well, using -no-local will not use the master node at all, this should be 
> > removed (and the PE gets job_is_first_task yes). The entry of this machine will 
> > be ignored in the machines file otherwise. Can you please check the 
> > distribution to the nodes (are you using two network cards)?
> >  
> > > I end up with a process still running on the first node of the job node
> > > group with all of the other processes killed.  How do I correct this?
> > 
> > Can you submit a simply job like:
> > 
> > #include <stdio.h>
> > #include <mpi.h>
> > 
> > main(int argc, char **argv)
> > {
> >    int node;
> > 
> >    int i;
> >    float f;
> >    
> >    MPI_Init(&argc,&argv);
> >    MPI_Comm_rank(MPI_COMM_WORLD, &node);
> >      
> >    printf("Hello World from Node %d.\n", node);
> >    for (;;)
> >       for(i=0;i <= 100000; i++)
> >           f=i*2.718281828*i+i+i*3.141592654;
> > 
> >    MPI_Finalize();
> > }
> > 
> > and look at the output of 'ps' whether all is called in the correct way.
> > 
> > Cheers - Reuti
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list