[GE users] MPICH and SGE. not getting correct number of PEs in program

m. r. schaferkotter schaferk at bellsouth.net
Thu Jul 29 20:14:41 BST 2004


greetings:

i/m try to run an ocean model with MPICH and SGE.
the model runs fine without SGE with:
mpirun -np 4 -machinefile machinehosts ${EXE}

i read the README about loose integration and probably don/t have  
things correct.

the problem is that although qstat shows four nodes running, that a  
sanity check in the program
reports that NPE=1 and the program quits.
i run a script which submits a script

things get going with:

qsub -V -j y -S /bin/sh job.intel_mpich.sh

more job.intel_mpich.sh

#!/bin/sh
#$ -pe mpich -4
#$ -v MPIR_HOME=/common/mpich/mpich-1.2.5-pgi5
#$ -v PWD=/net/machine/export/disk1/me/mpi/expt_1.8.2
#$ -l arch=glinux
#$ -o /net/machine/export/disk1/me/mpi/expt_1.8.2
#$ -e /net/machine/export/disk1/me/mpi/expt_1.8.2
echo "NHOSTS     : "$NHOSTS
echo "NSLOTS     : "$NSLOTS
echo "NQUEUES    : "$NQUEUES
echo "PE         : "$PE
echo "PE_HOSTFILE: "$PE_HOSTFILE

all environment variables are exported and the executable is invoked

./${EXE}

the run (fortran program) produces

[me at machine expt_1.8.2]$ more job.o237

NHOSTS     : 4
NSLOTS     : 4
NQUEUES    : 4
PE         : mpich
PE_HOSTFILE:  
/u/gridware/default/spool/mach1/active_jobs/237.1/pe_hostfile
mach1 1 mach1.q UNDEFINED
mach2 1 mach2.q UNDEFINED
mach3 1 mach3.q UNDEFINED
mach4 1 mach4.q UNDEFINED
datestart  Thu Jul 29 13:13:21 CDT 2004
IN   /net/machine/export/disk1/me/mpi/expt_1.8.2/input
OUT  /net/machine/export/disk1/me/mpi/expt_1.8.2/output

SPMD processor layout:
    ipr     =    2
    jpr     =    2
    jqr     =    4
    iprsum  =    4
    jprsum  =    4


  ***** ERROR - WRONG MPI SIZE *****

  NPES    =             1
      JQR =             4
  IPR,JPR =             2            2

stop

the sanity check in the fortran program:

the sanity check:

       call mpi_init(mpierr)
       call mpi_comm_rank(mpi_comm_world, mype, mpierr)
       call mpi_comm_size(mpi_comm_world, npes, mpierr)
c
c     do we have the right number of pes?
c
       if     (npes.ne.jqr) then
         if     (mnproc.eq.1) then
           write(6,*)
           write(6,*) '***** ERROR - WRONG MPI SIZE *****'
           write(6,*)
           write(6,*) 'NPES    = ',npes
           write(6,*) '    JQR = ',    jqr
           write(6,*) 'IPR,JPR = ',ipr,jpr
           write(6,*)
           call zhflsh(6)
         endif
         call xcstop('Error in xcspmd')
         stop
       endif

----

during the run qstat  shows four nodes:

[me at machine util]$ qstat -u me
job-ID  prior name       user         state submit/start at     queue    
    master  ja-task-ID
------------------------------------------------------------------------ 
---------------------
     237     0 job      me     t     07/29/2004 13:18:09 mach4.q  SLAVE
     237     0 job      me     t     07/29/2004 13:18:09 mach1.q  MASTER
                 0 job      me     t     07/29/2004 13:18:09 mach1.q   
SLAVE
     237     0 job      me     t     07/29/2004 13:18:09 mach2.q  SLAVE
     237     0 job      me     t     07/29/2004 13:18:09 mach3.q  SLAVE

finally the mpich pe is defined as:

[me at machine expt_1.8.2]$ qconf -sp mpich
pe_name           mpich
queue_list        all
slots             20
user_lists        NONE
xuser_lists       NONE
start_proc_args   /u/gridware/mpi/startmpi.sh -catch_rsh $pe_hostfile
stop_proc_args    /u/gridware/mpi/stopmpi.sh
allocation_rule   $round_robin
control_slaves    TRUE
job_is_first_task FALSE


this is my first attempt with mpich and SGE, it looks like there is a  
problem in the MPI environment, since
the same program runs without SGE with mpirun -np 4 -machinefile  
machinehosts ${EXE}.

what have i left out?

michael


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list