[GE users] mpi problems

Roberta Gigon RGigon at slb.com
Mon Apr 28 16:23:49 BST 2008


Hi there,
I'm having a few issues with getting MPICH-2  to work under SGE. I have an mpi job that works just fine with PBS and outside of SGE, so I'm pretty confident in saying that MPI itself is working.

Some background:
I have a pe called mpi with these characteristics:


[root at bear ~]$ qconf -sp mpi

pe_name           mpi

slots             999

user_lists        NONE

xuser_lists       NONE

start_proc_args   /opt/sge/mpi/startmpi.sh -catch_rsh $pe_hostfile

stop_proc_args    /opt/sge/mpi/stopmpi.sh

allocation_rule   $round_robin

control_slaves    FALSE

job_is_first_task TRUE

urgency_slots     min

I have a queue called mpi.q with 6 dual processor nodes (12 slots)

I submit the job like this:  qsub -q mpi.q -pe mpi 6 -cwd ./sbt034.csh

sbt034.csh:
#! /bin/tcsh

#$ -q mpi.q
#$ -j y
#$ -o testSGE2.out
#$ -N testSGE2
#$ -cwd
#$ -pe mpi 6

echo running...
echo $TMPDIR
/usr/local/mpich2-1.0.4p1-pgi-k8-64/bin/mpirun -np 6 /people8/tzhou/mcnprun/SUN/bin/mcnp 5j.mpi i=sbt034 wwinp=sbwwmx05 eol
echo done!

qstat says:

tzhou at bear[162] qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
   6862 0.56000 testSGE2   tzhou        r     04/28/2008 10:52:48 mpi.q at bear72.cl.slb.com            6

error file says:
master starting       5 tasks with       1 threads each  **/**/08 **:**:10
 master sending static commons...
 master sending dynamic commons...
 master sending cross section data...
PGFIO/stdio: No such file or directory
PGFIO-F-/OPEN/unit=32/error code returned by host stdio - 2.
 In source file msgtsk.f90, at line number 116
PGFIO/stdio: No such file or directory
PGFIO-F-/OPEN/unit=32/error code returned by host stdio - 2.
 In source file msgtsk.f90, at line number 116
rank 4 in job 4  bear75.cl.slb.com_47485   caused collective abort of all ranks
  exit status of rank 4: killed by signal 9
done!

The $TMPDIR gets set properly...

Any thoughts on what might be happening here?

Many thanks,
Roberta

---------------------------------------------------------------------------------------------
Roberta M. Gigon
Schlumberger-Doll Research
One Hampshire Street, MD-B253
Cambridge, MA 02139
617.768.2099 - phone
617.768.2381 - fax

This message is considered Schlumberger CONFIDENTIAL.  Please treat the information contained herein accordingly.



    [ Part 2, "image001.jpg"  Image/JPEG (Name: "image001.jpg") 7.2 KB. ]
    [ Unable to print this part. ]



More information about the gridengine-users mailing list