[GE users] MPICH 1.2.5.2 and Signals

Reuti reuti at staff.uni-marburg.de
Wed Oct 27 21:24:07 BST 2004


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi there,

> I just joined the list and have my first question to shoot: Has the
> problem with MPICH tight-integration been resolved yet?  I am running
> SGE 6.0u1 with MPICH 1.2.5.2.  I have tight integration set up.  My
> mpirun scripts all point to "/usr/local/sge/mpi/rsh" (its nfs mounted).
> I have exported the MPICH_PROCESS_GROUP=no variable and have modified
> the "/usr/local/sge/mpi/rsh" to include the -V option on all the "qrsh"
> lines.

what do you mean exactly with: "mpirun scripts all point to"? The idea behind 
the rsh_wrapper of SGE is to extend the $PATH on the execution node (master) in 
a form like $TMPDIR:$PATH and the startmpi.sh creates a symbolic link in this 
directory to /usr/local/sge/mpi/rsh in the end. The rsh_wrapper will then 
remove $TMPDIR to get the final rsh. So, exporting something like 
P4_RSHCOMMAND=/usr/local/sge/mpi/rsh would lead to some weird effects (did you 
ment this?). Just set this to rsh. What is compiled into the program as rsh 
command?
 
> Here is my Parallel environment configuration:
> 
> pe_name           mpich
> slots             6
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   /usr/local/sge/mpi/startmpi.sh -catch_rsh $pe_hostfile
> stop_proc_args    /usr/local/sge/mpi/stopmpi.sh
> allocation_rule   $round_robin
> control_slaves    TRUE
> job_is_first_task FALSE
> urgency_slots     min
> 
> Here's an example of a submit script i am using:
> 
> #!/bin/bash
> #$ -v MPIR_HOME=/usr/local/mpich-intel/bin
> #$ -N rhog
> #$ -pe mpich 2
> #$ -S /bin/bash
> #$ -q all.q
> #$ -e /home/student/b/brs/bbmark/stderr
> #$ -o /home/student/b/brs/bbmark/stdout
> ##############
> export LD_LIBRARY_PATH=/usr/local/intel/cc/lib
> export MPICH_PROCESS_GROUP=no
> RUN_HOME=/home/student/b/brs/bbmark
>                                                                              
>   
> cd $RUN_HOME
>                                                                              
>   
> # Single
> #./bbmark01
>                                                                              
>   
> # Multi-processor
> $MPIR_HOME/mpirun -no-local -np $NSLOTS -machinefile $TMPDIR/machines
> $RUN_HOME/bbmark01

Well, using -no-local will not use the master node at all, this should be 
removed (and the PE gets job_is_first_task yes). The entry of this machine will 
be ignored in the machines file otherwise. Can you please check the 
distribution to the nodes (are you using two network cards)?
 
> I end up with a process still running on the first node of the job node
> group with all of the other processes killed.  How do I correct this?

Can you submit a simply job like:

#include <stdio.h>
#include <mpi.h>

main(int argc, char **argv)
{
   int node;

   int i;
   float f;
   
   MPI_Init(&argc,&argv);
   MPI_Comm_rank(MPI_COMM_WORLD, &node);
     
   printf("Hello World from Node %d.\n", node);
   for (;;)
      for(i=0;i <= 100000; i++)
          f=i*2.718281828*i+i+i*3.141592654;

   MPI_Finalize();
}

and look at the output of 'ps' whether all is called in the correct way.

Cheers - Reuti

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list