[GE users] mpi problems

Roberta Gigon RGigon at slb.com
Mon Apr 28 18:16:55 BST 2008


Hi Reuti,

I have also tried using -machinefile $TMPDIR/machines in the script file and get the same result.

We have been using the mpd method.  The master is on the head node of the cluster and the nodes are all mpd "slaves".  I didn't see instructions for the mpd method in the how-to.  The program we are using with MPICH-2 is MCNP from Los Alamos National Labs; I'm not sure if it works with any of the other methods.

Thanks!
Roberta

P.S.  Regarding the bear72/bear75 confusion... I cut and pasted the wrong error file entry... in reality, it is consistent.

---------------------------------------------------------------------------------------------
Roberta M. Gigon
Schlumberger-Doll Research
One Hampshire Street, MD-B253
Cambridge, MA 02139
617.768.2099 - phone
617.768.2381 - fax

This message is considered Schlumberger CONFIDENTIAL.  Please treat the information contained herein accordingly.


-----Original Message-----
From: Reuti [mailto:reuti at staff.uni-marburg.de]
Sent: Monday, April 28, 2008 12:07 PM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] mpi problems

Hi,

Am 28.04.2008 um 17:23 schrieb Roberta Gigon:

> I'm having a few issues with getting MPICH-2  to work under SGE. I
> have an mpi job that works just fine with PBS and outside of SGE,
> so I'm pretty confident in saying that MPI itself is working.

the included $SGE_ROOT/mpi is only for MPICH(1). There is a Howto for
MPICH2:

http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-
integration.html

Just take note, that MPICH2 can be compiled in at least 4 different
ways and the compilation (of your application) must use the
appropriate mpirun and SGE PE. Which type of startup do you want to use?

Anyway: you have no -machinefile or similar in your mpirun call,
hence all will be local. And: how it's getting from bear72 to bear75
- do you have any predefined mpd.hosts which could trigger this?

-- Reuti

PS: Please try the latest 1.0.7 of MPICH2 (although your 1.0.4p1
should be fine), at least 1.0.6p1 is broken.


>
> Some background:
> I have a pe called mpi with these characteristics:
>
> [root at bear ~]$ qconf -sp mpi
> pe_name           mpi
> slots             999
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   /opt/sge/mpi/startmpi.sh -catch_rsh $pe_hostfile
> stop_proc_args    /opt/sge/mpi/stopmpi.sh
> allocation_rule   $round_robin
> control_slaves    FALSE
> job_is_first_task TRUE
> urgency_slots     min
>
> I have a queue called mpi.q with 6 dual processor nodes (12 slots)
>
> I submit the job like this:  qsub -q mpi.q -pe mpi 6 -cwd ./sbt034.csh
>
> sbt034.csh:
> #! /bin/tcsh
>
> #$ -q mpi.q
> #$ -j y
> #$ -o testSGE2.out
> #$ -N testSGE2
> #$ -cwd
> #$ -pe mpi 6
>
> echo running...
> echo $TMPDIR
> /usr/local/mpich2-1.0.4p1-pgi-k8-64/bin/mpirun -np 6 /people8/tzhou/
> mcnprun/SUN/bin/mcnp 5j.mpi i=sbt034 wwinp=sbwwmx05 eol
> echo done!
>
> qstat says:
>
> tzhou at bear[162] qstat
> job-ID  prior   name       user         state submit/start at
> queue                          slots ja-task-ID
> ----------------------------------------------------------------------
> -------------------------------------------
>    6862 0.56000 testSGE2   tzhou        r     04/28/2008 10:52:48
> mpi.q at bear72.cl.slb.com            6
>
> error file says:
> master starting       5 tasks with       1 threads each  **/**/08
> **:**:10
>  master sending static commons...
>  master sending dynamic commons...
>  master sending cross section data...
> PGFIO/stdio: No such file or directory
> PGFIO-F-/OPEN/unit=32/error code returned by host stdio - 2.
>  In source file msgtsk.f90, at line number 116
> PGFIO/stdio: No such file or directory
> PGFIO-F-/OPEN/unit=32/error code returned by host stdio - 2.
>  In source file msgtsk.f90, at line number 116
> rank 4 in job 4  bear75.cl.slb.com_47485   caused collective abort
> of all ranks
>   exit status of rank 4: killed by signal 9
> done!
>
> The $TMPDIR gets set properly...
>
> Any thoughts on what might be happening here?
>
> Many thanks,
> Roberta
>
> ----------------------------------------------------------------------
> -----------------------
> Roberta M. Gigon
> Schlumberger-Doll Research
> One Hampshire Street, MD-B253
> Cambridge, MA 02139
> 617.768.2099 - phone
> 617.768.2381 - fax
>
> This message is considered Schlumberger CONFIDENTIAL.  Please treat
> the information contained herein accordingly.
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list