[GE users] mpi problems

Reuti reuti at staff.uni-marburg.de
Mon Apr 28 17:06:35 BST 2008


    [ The following text is in the "WINDOWS-1252" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,

Am 28.04.2008 um 17:23 schrieb Roberta Gigon:

> I?m having a few issues with getting MPICH-2  to work under SGE. I  
> have an mpi job that works just fine with PBS and outside of SGE,  
> so I?m pretty confident in saying that MPI itself is working.

the included $SGE_ROOT/mpi is only for MPICH(1). There is a Howto for  
MPICH2:

http://gridengine.sunsource.net/howto/mpich2-integration/mpich2- 
integration.html

Just take note, that MPICH2 can be compiled in at least 4 different  
ways and the compilation (of your application) must use the  
appropriate mpirun and SGE PE. Which type of startup do you want to use?

Anyway: you have no -machinefile or similar in your mpirun call,  
hence all will be local. And: how it's getting from bear72 to bear75  
- do you have any predefined mpd.hosts which could trigger this?

-- Reuti

PS: Please try the latest 1.0.7 of MPICH2 (although your 1.0.4p1  
should be fine), at least 1.0.6p1 is broken.


>
> Some background:
> I have a pe called mpi with these characteristics:
>
> [root at bear ~]$ qconf -sp mpi
> pe_name           mpi
> slots             999
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   /opt/sge/mpi/startmpi.sh -catch_rsh $pe_hostfile
> stop_proc_args    /opt/sge/mpi/stopmpi.sh
> allocation_rule   $round_robin
> control_slaves    FALSE
> job_is_first_task TRUE
> urgency_slots     min
>
> I have a queue called mpi.q with 6 dual processor nodes (12 slots)
>
> I submit the job like this:  qsub -q mpi.q -pe mpi 6 -cwd ./sbt034.csh
>
> sbt034.csh:
> #! /bin/tcsh
>
> #$ -q mpi.q
> #$ -j y
> #$ -o testSGE2.out
> #$ -N testSGE2
> #$ -cwd
> #$ -pe mpi 6
>
> echo running...
> echo $TMPDIR
> /usr/local/mpich2-1.0.4p1-pgi-k8-64/bin/mpirun -np 6 /people8/tzhou/ 
> mcnprun/SUN/bin/mcnp 5j.mpi i=sbt034 wwinp=sbwwmx05 eol
> echo done!
>
> qstat says:
>
> tzhou at bear[162] qstat
> job-ID  prior   name       user         state submit/start at      
> queue                          slots ja-task-ID
> ---------------------------------------------------------------------- 
> -------------------------------------------
>    6862 0.56000 testSGE2   tzhou        r     04/28/2008 10:52:48  
> mpi.q at bear72.cl.slb.com            6
>
> error file says:
> master starting       5 tasks with       1 threads each  **/**/08  
> **:**:10
>  master sending static commons...
>  master sending dynamic commons...
>  master sending cross section data...
> PGFIO/stdio: No such file or directory
> PGFIO-F-/OPEN/unit=32/error code returned by host stdio - 2.
>  In source file msgtsk.f90, at line number 116
> PGFIO/stdio: No such file or directory
> PGFIO-F-/OPEN/unit=32/error code returned by host stdio - 2.
>  In source file msgtsk.f90, at line number 116
> rank 4 in job 4  bear75.cl.slb.com_47485   caused collective abort  
> of all ranks
>   exit status of rank 4: killed by signal 9
> done!
>
> The $TMPDIR gets set properly?
>
> Any thoughts on what might be happening here?
>
> Many thanks,
> Roberta
>
> ---------------------------------------------------------------------- 
> -----------------------
> Roberta M. Gigon
> Schlumberger-Doll Research
> One Hampshire Street, MD-B253
> Cambridge, MA 02139
> 617.768.2099 - phone
> 617.768.2381 - fax
>
> This message is considered Schlumberger CONFIDENTIAL.  Please treat  
> the information contained herein accordingly.
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list