[GE users] MPI Job Execution

rajesh britto britto.gridlab at gmail.com
Mon Sep 29 10:37:43 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

When i submit an mpijob to SGE my output file shows the following the
error...

*p0_7279: p4_error: Child process exited while making connection to remote
process on 172.16.3.100: 0*

this is my script file:

#!/bin/csh -f
#
#
# (c) 2008 Sun Microsystems, Inc. All rights reserved. Use is subject to
license terms.

# ---------------------------
# our name
#$ -N a.out
#
#$ -o output
#
# pe request
#$ -pe make 2
#
# MPIR_HOME from submitting environment
#$ -v MPIR_HOME=/usr/local/mpich-1.2.6
# ---------------------------

#
# needs in
# $NSLOTS
# the number of tasks to be used
# $TMPDIR/machines
# a valid machine file to be passed to mpirun

echo "Got $NSLOTS slots."

$MPIR_HOME/bin/mpirun -np $NSLOTS -machinefile _machines a.out

this is my _machines file:

172.16.3.100:1
172.16.3.2:1

Note: when i submit from the terminal as mpirun -np 2 -machinefile _machines
a.out the job runs successfull and prints the output.
On Sun, Sep 28, 2008 at 9:12 PM, Mag Gam <magawake at gmail.com> wrote:

> Reuti:
>
> thanks! Very good thread.
>
> Keep up the great work.
>
>
> On Sat, Sep 27, 2008 at 8:42 AM, Reuti <reuti at staff.uni-marburg.de> wrote:
> > Hi,
> >
> > Am 27.09.2008 um 09:01 schrieb rajesh britto:
> >
> >>  After export the variables i still get the same error.. in my job
> output
> >> file.. when i check for the error file the content is empty...
> >>
> >> [sgeadmin at slaserver ~]$ cat cpi.o30
> >>
> >> Got 2 slots.
> >>
> >> Cannot read /tmp/30.1.all.q/machines.
> >>
> > seems that the file "machines" isn't written or not readable. You can
> > investigate this by putting a "sleep 300" or so in the jobscript before
> the
> > "mpirun...". While the job is sleeping, you can go to the node and check
> the
> > content of the directory in $TMPDIR of the job, which just starts with
> the
> > jobnumber on the node in /tmp.
> >>
> >> Looked for files with extension LINUX in
> >> directory /usr/local/mpich-1.2.6/util/machines .
> >>
> > -- Reuti
> >
> >> ---RB
> >>
> >>
> >> On Fri, Sep 26, 2008 at 3:53 PM, Reuti <reuti at staff.uni-marburg.de>
> wrote:
> >> Am 26.09.2008 um 11:55 schrieb rajesh britto:
> >>
> >>    thanks chris and reuti for your information. now i am able to run an
> >> mpi program.. my mpi job has been successfully submitted but when i
> check my
> >> job output it shows the following error.
> >>
> >> [sgeadmin at slaserver ~]$ cat mpi.o126
> >> Warning: no access to tty (Bad file descriptor).
> >> Thus no job control in this shell.
> >>
> >> This is the output from the c shell. If you prefer the shell mentioned
> in
> >> the jobscript, the queue definition has to be changed to read
> >> "shell_start_mode      unix_behavior".
> >>
> >> http://gridengine.sunsource.net/howto/commonproblems.html
> >>
> >> Got 2 slots.
> >> Cannot read /usr/local/mpich-1.2.6/util/machines/machine.LINUX.
> >> Looked for files with extension LINUX in
> >> directory /usr/local/mpich-1.2.6/util/machines .
> >>
> >> #!/bin/sh
> >> echo "Got $NSLOTS slots."
> >> export MPICH_PROCESS_GROUP=no
> >> export P4_RSHCOMMAND=rsh
> >> mpirun -np $NSLOTS -machinefile $TMPDIR/machines ...
> >>
> >> -- Reuti
> >>
> >>
> >> regards,
> >>
> >> RB
> >>
> >> On Fri, Sep 26, 2008 at 3:15 PM, Reuti <reuti at staff.uni-marburg.de>
> wrote:
> >> Hi,
> >>
> >> Am 26.09.2008 um 07:03 schrieb rajesh britto:
> >>
> >>
> >> Hi,
> >>
> >>  Thanks for the information.. i implemented mpich implementation.. can
> >> anyone give me how to submit mpi job using mpich and sge..
> >>
> >> what do you mean in detail: you setup a parallel environment already or
> >> just a plain mpich(1) installation? As Chris pointed out, there are
> >> documents in $SGE_ROOT/mpi/ to get started.
> >>
> >> Additional hints for a Tight Integration you will find here:
> >>
> >> http://gridengine.sunsource.net/howto/mpich-integration.html
> >>
> >> -- Reuti
> >>
> >>
> >>
> >> On Thu, Sep 25, 2008 at 7:59 PM, Chris Dagdigian <dag at sonsorol.org>
> wrote:
> >> Hello,
> >>
> >> This is what I'd recommend:
> >>
> >> (1) Determine what sort of MPI environment you have or need to install
> >> (there are many implementations of the MPI standard)
> >> (1.5) If you don't know what MPI to start with, start with OpenMPI from
> >> openmpi.org as that works beautifully with SGE in tight integration
> mode
> >> (2) Set up and install MPI
> >> (3) Compile the example cpi.c program using MPICC
> >> (4) Run your MPI job outside of Grid Engine using passwordless SSH to
> the
> >> nodes
> >>
> >> The basic idea here is that integrating MPI with Grid Engine is far, far
> >> easier if you are first able to validate for yourself that MPI works on
> its
> >> own. I've seen many "SGE can't handle MPI" trouble tickets where the
> actual
> >> problem was with the MPI install and not Grid Engine
> >>
> >> Once you have MPI working outside of SGE and ideally with a real world
> >> application then you can move on to SGE work ...
> >>
> >> Resources:
> >> (1) example scripts can be found in $SGE_ROOT/mpi/
> >> (2) The documentation on wikis.sun.com for Grid Engine covers PE and
> >> parallel environment stuff well
> >> (3) A few google searches on "tight mpi integration with SGE" or similar
> >> will show you other HOWTO methods
> >>
> >> If you use openmpi and compile it with the "--with-sge" option then
> >> OpenMPI will automatically detect that it is running under Grid Engine
> and
> >> will do the right thing. This is currently (in my opinion) the fastest
> and
> >> easiest way to get a working tight MPI integration into SGE at this
> time.
> >>
> >> For the difference between "tight" and "loose" PE integration and why
> >> tight is better if you can achieve it, these links may help:
> >>
> >>
> >>
> http://gridengine.info/2005/09/19/parallel-environments-pes-loose-vs-tight-integration
> >>
> >> -Chris
> >>
> >>
> >>
> >>
> >> On Sep 25, 2008, at 9:18 AM, rajesh britto wrote:
> >>
> >> hi,
> >>  i have installed sge6.2 in a cluster and it works fine for sequential
> >> job.. when i submit an mpi job the job gets submitted and it goes to the
> >> waiting state.. (qw state)
> >>  i need to know how to set up parallel environment to run a mpi job, if
> >> any one has any document it will be usefull..
> >>  thanks in advance..
> >>
> >> with regards,
> >> RB
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>
> >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>



More information about the gridengine-users mailing list