[GE users] MPICH 1.2.5.2 and Signals

Brian R. Smith brian at cypher.acomp.usf.edu
Wed Oct 27 21:37:08 BST 2004


Reuti,

I chamged my P4_RSHCOMMAND to rsh and removed the -nolocal from
'mpirun'.  I'm still encountering the same problem BUT noticed that some
zombie rsh processes no longer show up.  The Job's "head node" still
runs the mpi process after i tell it to die.


Brian

On Wed, 2004-10-27 at 22:24 +0200, Reuti wrote:
> Hi there,
> 
> > I just joined the list and have my first question to shoot: Has the
> > problem with MPICH tight-integration been resolved yet?  I am running
> > SGE 6.0u1 with MPICH 1.2.5.2.  I have tight integration set up.  My
> > mpirun scripts all point to "/usr/local/sge/mpi/rsh" (its nfs mounted).
> > I have exported the MPICH_PROCESS_GROUP=no variable and have modified
> > the "/usr/local/sge/mpi/rsh" to include the -V option on all the "qrsh"
> > lines.
> 
> what do you mean exactly with: "mpirun scripts all point to"? The idea behind 
> the rsh_wrapper of SGE is to extend the $PATH on the execution node (master) in 
> a form like $TMPDIR:$PATH and the startmpi.sh creates a symbolic link in this 
> directory to /usr/local/sge/mpi/rsh in the end. The rsh_wrapper will then 
> remove $TMPDIR to get the final rsh. So, exporting something like 
> P4_RSHCOMMAND=/usr/local/sge/mpi/rsh would lead to some weird effects (did you 
> ment this?). Just set this to rsh. What is compiled into the program as rsh 
> command?
>  
> > Here is my Parallel environment configuration:
> > 
> > pe_name           mpich
> > slots             6
> > user_lists        NONE
> > xuser_lists       NONE
> > start_proc_args   /usr/local/sge/mpi/startmpi.sh -catch_rsh $pe_hostfile
> > stop_proc_args    /usr/local/sge/mpi/stopmpi.sh
> > allocation_rule   $round_robin
> > control_slaves    TRUE
> > job_is_first_task FALSE
> > urgency_slots     min
> > 
> > Here's an example of a submit script i am using:
> > 
> > #!/bin/bash
> > #$ -v MPIR_HOME=/usr/local/mpich-intel/bin
> > #$ -N rhog
> > #$ -pe mpich 2
> > #$ -S /bin/bash
> > #$ -q all.q
> > #$ -e /home/student/b/brs/bbmark/stderr
> > #$ -o /home/student/b/brs/bbmark/stdout
> > ##############
> > export LD_LIBRARY_PATH=/usr/local/intel/cc/lib
> > export MPICH_PROCESS_GROUP=no
> > RUN_HOME=/home/student/b/brs/bbmark
> >                                                                              
> >   
> > cd $RUN_HOME
> >                                                                              
> >   
> > # Single
> > #./bbmark01
> >                                                                              
> >   
> > # Multi-processor
> > $MPIR_HOME/mpirun -no-local -np $NSLOTS -machinefile $TMPDIR/machines
> > $RUN_HOME/bbmark01
> 
> Well, using -no-local will not use the master node at all, this should be 
> removed (and the PE gets job_is_first_task yes). The entry of this machine will 
> be ignored in the machines file otherwise. Can you please check the 
> distribution to the nodes (are you using two network cards)?
>  
> > I end up with a process still running on the first node of the job node
> > group with all of the other processes killed.  How do I correct this?
> 
> Can you submit a simply job like:
> 
> #include <stdio.h>
> #include <mpi.h>
> 
> main(int argc, char **argv)
> {
>    int node;
> 
>    int i;
>    float f;
>    
>    MPI_Init(&argc,&argv);
>    MPI_Comm_rank(MPI_COMM_WORLD, &node);
>      
>    printf("Hello World from Node %d.\n", node);
>    for (;;)
>       for(i=0;i <= 100000; i++)
>           f=i*2.718281828*i+i+i*3.141592654;
> 
>    MPI_Finalize();
> }
> 
> and look at the output of 'ps' whether all is called in the correct way.
> 
> Cheers - Reuti
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list