[GE users] File staging problem / Persistent $TMPDIR on all slave nodes

Gerhard Venter gventer at sun.ac.za
Tue Jul 8 20:09:51 BST 2008


Reuti,

Thanks - yes that is a good suggestion.  It is so obvious I can't
believe I missed it.  For some reason I was so stuck on getting the
input file in the local directory that I did not sit back to look at the
bigger picture.  I thought there would be some performance gain to copy
the input files over, but now that I think about it, I can't see why
that would be - either way you have to access it from each node over the
network.

Thanks again, at least I gained some more insight into SGE!

Gerhard

On Tue, 2008-07-08 at 18:16 +0200, Reuti wrote:
> Am 08.07.2008 um 17:56 schrieb Gerhard Venter:
> 
> > Reuti,
> >
> > Yes you are correct,
> >
> > mpirun -np ${NSLOTS} ./mpijob
> >
> > also works.
> >
> > Thanks for your input.  For now I have a working solution based on the
> > input I got from Kevin and use a prolog script.  The reason I want  
> > files
> > that are copied to the work directory for each worker thread, is that
> > each worker thread, among other things, starts a numerical simulation
> > that runs in series.  Each of these simulations generate large scratch
> 
> I see. But isn't it possible to specify a full pathname for the  
> inputfile instead the default "input.dat" in the cwd? This way the  
> cwd for each worker could still be local on the node, but the input  
> files won't be needed to be copied to the same location on all nodes  
> to access them.
> 
> -- Reuti
> 
> 
> > files, and as a result I want them to run on the local disk.  For  
> > now I
> > think the best solution is a combination of what Kevin suggested and
> > some manual file manipulation from within my worker threads.
> >
> > Thanks again for your input,
> > Gerhard
> >
> > On Tue, 2008-07-08 at 13:55 +0200, Reuti wrote:
> >> Hi,
> >>
> >> Am 08.07.2008 um 09:32 schrieb Gerhard Venter:
> >>
> >>> I am using SGE 6.0 and OpenMPI.  I am distributing my job to  
> >>> multiple
> >>> slots using a round robin mpi parallel environment - the result is
> >>> that
> >>> each slot is on a different compute node, each with its own local  
> >>> disk
> >>> storage.  I would like to stage files from my home directory to the
> >>> TMPDIR on each compute node that MPI will use.  My submit script  
> >>> looks
> >>> something like this:
> >>>
> >>> #!/bin/bash
> >>> #$ -cwd
> >>> #$ -j y
> >>> #$ -pe openmpi_rr 4
> >>> #
> >>> cp input.dat $TMPDIR
> >>> cd $TMPDIR
> >>> mpirun -np ${NSLOTS} $SGE_O_WORKDIR/mpijob
> >>
> >> mpirun -np ${NSLOTS} ./mpijob
> >>
> >> might also work.
> >>
> >>>
> >>> The cd $TMPDIR does change the pwd on each node to $TMPDIR (I print
> >>> the
> >>> value for all slots from my mpi program), however, the files are  
> >>> only
> >>> copied to the first MPI slot (the master slot) and not the other  
> >>> slots
> >>> (the worker slots).  I have also implemented the prolog and epilog
> >>> scripts from
> >>>
> >>> http://gridengine.sunsource.net/project/gridengine/howto/ 
> >>> filestaging/
> >>>
> >>> but with the same result.
> >>>
> >>> Am I missing something?  Is there a way that I can stage input
> >>> files for
> >>> all the slots that will be used in a MPI job?
> >>
> >> This is by design. The jobscript and prolog/epilog is running only on
> >> the master node of the parallel job. Often this is sufficient, as
> >> only the rank 0 reads the file(s) and all communication is done via
> >> MPI on its own. Is there an advantage of copying the data to all
> >> nodes and not reading them from the original location?
> >>
> >> If in your case all nodes need the input file to be local, you have
> >> to copy it by hand to all nodes. The $TMPDIR might be different on
> >> all nodes, if you get slots from several queues to fullfil your
> >> request, as the queuename is part of the $TMPDIR name. You will need
> >> a loop across all nodes, copying the data to some kind of persistent
> >> directory, as otherwise after a qrsh call the $TMPDIR on a slave node
> >> will be removed again - hence any copy won't be possible this way.
> >>
> >> Although you won't need a start_proc_args for Open MPI, you could
> >> define one with a loop to copy the data (digist from our MPICH
> >> startup for MOLCAS, which will need persistent scratch directories
> >> between subsequent qrsh calls):
> >>
> >> ========================START=========================
> >> #!/bin/sh
> >>
> >> PeHostfile2MachineFile()
> >> {
> >>     cat $1 | while read line; do
> >>        # echo $line
> >>        host=`echo $line|cut -f1 -d" "|cut -f1 -d"."`
> >>        nslots=`echo $line|cut -f2 -d" "`
> >>        i=1
> >>        while [ $i -le $nslots ]; do
> >>           # add here code to map regular hostnames into ATM hostnames
> >>           echo $host
> >>           i=`expr $i + 1`
> >>        done
> >>     done
> >> }
> >>
> >> PeHostfile2MachineFile $pe_hostfile >> $machines
> >>
> >> myhostname=`hostname`
> >> mkdir ${TMPDIR}_persistent
> >> for HOST in `grep -v $myhostname $TMPDIR/machines | uniq`; do
> >>      $SGE_ROOT/bin/$ARC/qrsh -inherit $HOST mkdir ${TMPDIR} 
> >> _persistent
> >> done
> >> exit 0
> >> ========================END=========================
> >>
> >> Of course, you can't delete the scratch directories this way - in
> >> case of a qdel it's no longer allowed to use qrsh. So I set up a
> >> special "cleaner.q", which will be submitted in a loop in the
> >> stop_proc_args procedure:
> >>
> >> =======================START==========================
> >> #!/bin/sh
> >>
> >> for HOST in `uniq $TMPDIR/machines`; do
> >>      $SGE_ROOT/bin/$ARC/qsub -l hostname=$HOST,virtual_free=0,cleaner
> >> $SGE_ROOT/cluster/molcas/cleaner.sh ${TMPDIR}_persistent
> >> done
> >>
> >> rm $TMPDIR/machines
> >>
> >> exit 0
> >> ========================END=========================
> >>
> >> Depending on your setup, you might need a different resource request
> >> and path to the cleaner script. The "cleaner.q" has only one slot and
> >> is always allowed to run without any restrictions and use a BOOL
> >> FORCED attribut "cleaner" to avoid that normal jobs end up there.
> >>
> >> =======================START==========================
> >> #!/bin/sh
> >>
> >> #
> >> # Remove the persistent directory
> >> #
> >>
> >> rm -rf $1
> >>
> >> exit 0
> >> ========================END=========================
> >>
> >> So, the execution nodes must also be submit nodes. Now you can copy
> >> the input file to all persistent directories on all nodes, and you
> >> still have a Tight Integration in your jobscript:
> >>
> >>
> >> myhostname=`hostname`
> >> for HOST in `grep -v $myhostname $TMPDIR/machines | uniq`; do
> >>      $SGE_ROOT/bin/$ARC/qrsh -inherit $HOST cp input.dat ${TMPDIR}
> >> _persistent
> >> done
> >>
> >> You have of course to use the ${TMPDIR}_persistent as location
> >>
> >>
> >> HTH - Reuti
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> > -- 
> > +------------------------------------------------------------------+
> > || Prof. Gerhard Venter
> > ||
> > || Departement Meganiese en        |  Department of Mechanical and
> > ||   Megatroniese Ingenieurswese   |    Mechatronic Engineering
> > || Universiteit Stellenbosch       |  Stellenbosch  University
> > || Privaat Sak X1 Matieland 7602   |  Private Bag X1 Matieland 7602
> > || Suid-Afrika                     |  South Africa
> > ||
> > || Tel: +27 21 808 3560
> > || E-Mail: gventer at sun.ac.za          Web: www.eng.sun.ac.za
> > +------------------------------------------------------------------+
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

-- 
+------------------------------------------------------------------+
|| Prof. Gerhard Venter
||
|| Departement Meganiese en        |  Department of Mechanical and
||   Megatroniese Ingenieurswese   |    Mechatronic Engineering
|| Universiteit Stellenbosch       |  Stellenbosch  University
|| Privaat Sak X1 Matieland 7602   |  Private Bag X1 Matieland 7602  
|| Suid-Afrika                     |  South Africa
||
|| Tel: +27 21 808 3560
|| E-Mail: gventer at sun.ac.za          Web: www.eng.sun.ac.za
+------------------------------------------------------------------+


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list