[GE users] MPI Job Execution

Reuti reuti at staff.uni-marburg.de
Mon Sep 29 13:20:43 BST 2008


Hi,

Am 29.09.2008 um 11:37 schrieb rajesh britto:
> When i submit an mpijob to SGE my output file shows the following  
> the error...
>
> p0_7279: p4_error: Child process exited while making connection to  
> remote process on 172.16.3.100: 0
>
> this is my script file:
>
> #!/bin/csh -f
> #
> #
> # (c) 2008 Sun Microsystems, Inc. All rights reserved. Use is  
> subject to license terms.
>
> # ---------------------------
> # our name
> #$ -N a.out
> #
> #$ -o output
> #
> # pe request
> #$ -pe make 2
>
did you follow the instructions in $SGE_ROOT/mpi? The default "make"  
parallel environment isn't suitable for MPICH(1) jobs, as a special  
file "machines" must be generated for MPICH(1). The template  
"mpich.template" in $SGE_ROOT/mpi shows how to setup the PE. And  
please include the two export or setenv commands I posted before.
> # MPIR_HOME from submitting environment
> #$ -v MPIR_HOME=/usr/local/mpich-1.2.6
> # ---------------------------
>
> #
> # needs in
> # $NSLOTS
> # the number of tasks to be used
> # $TMPDIR/machines
> # a valid machine file to be passed to mpirun
>
> echo "Got $NSLOTS slots."
>
> $MPIR_HOME/bin/mpirun -np $NSLOTS -machinefile _machines a.out
>
The machine file which is generated by SGE's PE must be used, as you  
can't be sure which nodes you will get later on granted by SGE for  
your job, and with which slot allocation.

> this is my _machines file:
>
> 172.16.3.100:1
> 172.16.3.2:1
>
> Note: when i submit from the terminal as mpirun -np 2 -machinefile  
> _machines a.out the job runs successfull and prints the output.
>
This might indicate, that a connection from the submit-node to the  
exec-nodes is possible, but SGE will start already on a node, and a  
login from one node to the other might not work. Any /etc/hosts. 
{allow,deny} or firewalls in effect?

-- Reuti


> On Sun, Sep 28, 2008 at 9:12 PM, Mag Gam <magawake at gmail.com> wrote:
> Reuti:
>
> thanks! Very good thread.
>
> Keep up the great work.
>
>
> On Sat, Sep 27, 2008 at 8:42 AM, Reuti <reuti at staff.uni-marburg.de>  
> wrote:
> > Hi,
> >
> > Am 27.09.2008 um 09:01 schrieb rajesh britto:
> >
> >>  After export the variables i still get the same error.. in my  
> job output
> >> file.. when i check for the error file the content is empty...
> >>
> >> [sgeadmin at slaserver ~]$ cat cpi.o30
> >>
> >> Got 2 slots.
> >>
> >> Cannot read /tmp/30.1.all.q/machines.
> >>
> > seems that the file "machines" isn't written or not readable. You  
> can
> > investigate this by putting a "sleep 300" or so in the jobscript  
> before the
> > "mpirun...". While the job is sleeping, you can go to the node  
> and check the
> > content of the directory in $TMPDIR of the job, which just starts  
> with the
> > jobnumber on the node in /tmp.
> >>
> >> Looked for files with extension LINUX in
> >> directory /usr/local/mpich-1.2.6/util/machines .
> >>
> > -- Reuti
> >
> >> ---RB
> >>
> >>
> >> On Fri, Sep 26, 2008 at 3:53 PM, Reuti <reuti at staff.uni- 
> marburg.de> wrote:
> >> Am 26.09.2008 um 11:55 schrieb rajesh britto:
> >>
> >>    thanks chris and reuti for your information. now i am able to  
> run an
> >> mpi program.. my mpi job has been successfully submitted but  
> when i check my
> >> job output it shows the following error.
> >>
> >> [sgeadmin at slaserver ~]$ cat mpi.o126
> >> Warning: no access to tty (Bad file descriptor).
> >> Thus no job control in this shell.
> >>
> >> This is the output from the c shell. If you prefer the shell  
> mentioned in
> >> the jobscript, the queue definition has to be changed to read
> >> "shell_start_mode      unix_behavior".
> >>
> >> http://gridengine.sunsource.net/howto/commonproblems.html
> >>
> >> Got 2 slots.
> >> Cannot read /usr/local/mpich-1.2.6/util/machines/machine.LINUX.
> >> Looked for files with extension LINUX in
> >> directory /usr/local/mpich-1.2.6/util/machines .
> >>
> >> #!/bin/sh
> >> echo "Got $NSLOTS slots."
> >> export MPICH_PROCESS_GROUP=no
> >> export P4_RSHCOMMAND=rsh
> >> mpirun -np $NSLOTS -machinefile $TMPDIR/machines ...
> >>
> >> -- Reuti
> >>
> >>
> >> regards,
> >>
> >> RB
> >>
> >> On Fri, Sep 26, 2008 at 3:15 PM, Reuti <reuti at staff.uni- 
> marburg.de> wrote:
> >> Hi,
> >>
> >> Am 26.09.2008 um 07:03 schrieb rajesh britto:
> >>
> >>
> >> Hi,
> >>
> >>  Thanks for the information.. i implemented mpich  
> implementation.. can
> >> anyone give me how to submit mpi job using mpich and sge..
> >>
> >> what do you mean in detail: you setup a parallel environment  
> already or
> >> just a plain mpich(1) installation? As Chris pointed out, there are
> >> documents in $SGE_ROOT/mpi/ to get started.
> >>
> >> Additional hints for a Tight Integration you will find here:
> >>
> >> http://gridengine.sunsource.net/howto/mpich-integration.html
> >>
> >> -- Reuti
> >>
> >>
> >>
> >> On Thu, Sep 25, 2008 at 7:59 PM, Chris Dagdigian  
> <dag at sonsorol.org> wrote:
> >> Hello,
> >>
> >> This is what I'd recommend:
> >>
> >> (1) Determine what sort of MPI environment you have or need to  
> install
> >> (there are many implementations of the MPI standard)
> >> (1.5) If you don't know what MPI to start with, start with  
> OpenMPI from
> >> openmpi.org as that works beautifully with SGE in tight  
> integration mode
> >> (2) Set up and install MPI
> >> (3) Compile the example cpi.c program using MPICC
> >> (4) Run your MPI job outside of Grid Engine using passwordless  
> SSH to the
> >> nodes
> >>
> >> The basic idea here is that integrating MPI with Grid Engine is  
> far, far
> >> easier if you are first able to validate for yourself that MPI  
> works on its
> >> own. I've seen many "SGE can't handle MPI" trouble tickets where  
> the actual
> >> problem was with the MPI install and not Grid Engine
> >>
> >> Once you have MPI working outside of SGE and ideally with a real  
> world
> >> application then you can move on to SGE work ...
> >>
> >> Resources:
> >> (1) example scripts can be found in $SGE_ROOT/mpi/
> >> (2) The documentation on wikis.sun.com for Grid Engine covers PE  
> and
> >> parallel environment stuff well
> >> (3) A few google searches on "tight mpi integration with SGE" or  
> similar
> >> will show you other HOWTO methods
> >>
> >> If you use openmpi and compile it with the "--with-sge" option then
> >> OpenMPI will automatically detect that it is running under Grid  
> Engine and
> >> will do the right thing. This is currently (in my opinion) the  
> fastest and
> >> easiest way to get a working tight MPI integration into SGE at  
> this time.
> >>
> >> For the difference between "tight" and "loose" PE integration  
> and why
> >> tight is better if you can achieve it, these links may help:
> >>
> >>
> >> http://gridengine.info/2005/09/19/parallel-environments-pes- 
> loose-vs-tight-integration
> >>
> >> -Chris
> >>
> >>
> >>
> >>
> >> On Sep 25, 2008, at 9:18 AM, rajesh britto wrote:
> >>
> >> hi,
> >>  i have installed sge6.2 in a cluster and it works fine for  
> sequential
> >> job.. when i submit an mpi job the job gets submitted and it  
> goes to the
> >> waiting state.. (qw state)
> >>  i need to know how to set up parallel environment to run a mpi  
> job, if
> >> any one has any document it will be usefull..
> >>  thanks in advance..
> >>
> >> with regards,
> >> RB
> >>
> >>
> >>  
> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >> For additional commands, e-mail: users- 
> help at gridengine.sunsource.net
> >>
> >>
> >>
> >>
> >>  
> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >> For additional commands, e-mail: users- 
> help at gridengine.sunsource.net
> >>
> >>
> >>
> >>
> >>  
> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >> For additional commands, e-mail: users- 
> help at gridengine.sunsource.net
> >>
> >>
> >
> >
> >  
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list