[GE users] File staging problem / Persistent $TMPDIR on all slave nodes

Gerhard Venter gventer at sun.ac.za
Tue Jul 8 16:56:37 BST 2008


Reuti,

Yes you are correct, 

mpirun -np ${NSLOTS} ./mpijob

also works.

Thanks for your input.  For now I have a working solution based on the
input I got from Kevin and use a prolog script.  The reason I want files
that are copied to the work directory for each worker thread, is that
each worker thread, among other things, starts a numerical simulation
that runs in series.  Each of these simulations generate large scratch
files, and as a result I want them to run on the local disk.  For now I
think the best solution is a combination of what Kevin suggested and
some manual file manipulation from within my worker threads.

Thanks again for your input,
Gerhard

On Tue, 2008-07-08 at 13:55 +0200, Reuti wrote:
> Hi,
> 
> Am 08.07.2008 um 09:32 schrieb Gerhard Venter:
> 
> > I am using SGE 6.0 and OpenMPI.  I am distributing my job to multiple
> > slots using a round robin mpi parallel environment - the result is  
> > that
> > each slot is on a different compute node, each with its own local disk
> > storage.  I would like to stage files from my home directory to the
> > TMPDIR on each compute node that MPI will use.  My submit script looks
> > something like this:
> >
> > #!/bin/bash
> > #$ -cwd
> > #$ -j y
> > #$ -pe openmpi_rr 4
> > #
> > cp input.dat $TMPDIR
> > cd $TMPDIR
> > mpirun -np ${NSLOTS} $SGE_O_WORKDIR/mpijob
> 
> mpirun -np ${NSLOTS} ./mpijob
> 
> might also work.
> 
> >
> > The cd $TMPDIR does change the pwd on each node to $TMPDIR (I print  
> > the
> > value for all slots from my mpi program), however, the files are only
> > copied to the first MPI slot (the master slot) and not the other slots
> > (the worker slots).  I have also implemented the prolog and epilog
> > scripts from
> >
> > http://gridengine.sunsource.net/project/gridengine/howto/filestaging/
> >
> > but with the same result.
> >
> > Am I missing something?  Is there a way that I can stage input  
> > files for
> > all the slots that will be used in a MPI job?
> 
> This is by design. The jobscript and prolog/epilog is running only on  
> the master node of the parallel job. Often this is sufficient, as  
> only the rank 0 reads the file(s) and all communication is done via  
> MPI on its own. Is there an advantage of copying the data to all  
> nodes and not reading them from the original location?
> 
> If in your case all nodes need the input file to be local, you have  
> to copy it by hand to all nodes. The $TMPDIR might be different on  
> all nodes, if you get slots from several queues to fullfil your  
> request, as the queuename is part of the $TMPDIR name. You will need  
> a loop across all nodes, copying the data to some kind of persistent  
> directory, as otherwise after a qrsh call the $TMPDIR on a slave node  
> will be removed again - hence any copy won't be possible this way.
> 
> Although you won't need a start_proc_args for Open MPI, you could  
> define one with a loop to copy the data (digist from our MPICH  
> startup for MOLCAS, which will need persistent scratch directories  
> between subsequent qrsh calls):
> 
> ========================START=========================
> #!/bin/sh
> 
> PeHostfile2MachineFile()
> {
>     cat $1 | while read line; do
>        # echo $line
>        host=`echo $line|cut -f1 -d" "|cut -f1 -d"."`
>        nslots=`echo $line|cut -f2 -d" "`
>        i=1
>        while [ $i -le $nslots ]; do
>           # add here code to map regular hostnames into ATM hostnames
>           echo $host
>           i=`expr $i + 1`
>        done
>     done
> }
> 
> PeHostfile2MachineFile $pe_hostfile >> $machines
> 
> myhostname=`hostname`
> mkdir ${TMPDIR}_persistent
> for HOST in `grep -v $myhostname $TMPDIR/machines | uniq`; do
>      $SGE_ROOT/bin/$ARC/qrsh -inherit $HOST mkdir ${TMPDIR}_persistent
> done
> exit 0
> ========================END=========================
> 
> Of course, you can't delete the scratch directories this way - in  
> case of a qdel it's no longer allowed to use qrsh. So I set up a  
> special "cleaner.q", which will be submitted in a loop in the  
> stop_proc_args procedure:
> 
> =======================START==========================
> #!/bin/sh
> 
> for HOST in `uniq $TMPDIR/machines`; do
>      $SGE_ROOT/bin/$ARC/qsub -l hostname=$HOST,virtual_free=0,cleaner  
> $SGE_ROOT/cluster/molcas/cleaner.sh ${TMPDIR}_persistent
> done
> 
> rm $TMPDIR/machines
> 
> exit 0
> ========================END=========================
> 
> Depending on your setup, you might need a different resource request  
> and path to the cleaner script. The "cleaner.q" has only one slot and  
> is always allowed to run without any restrictions and use a BOOL  
> FORCED attribut "cleaner" to avoid that normal jobs end up there.
> 
> =======================START==========================
> #!/bin/sh
> 
> #
> # Remove the persistent directory
> #
> 
> rm -rf $1
> 
> exit 0
> ========================END=========================
> 
> So, the execution nodes must also be submit nodes. Now you can copy  
> the input file to all persistent directories on all nodes, and you  
> still have a Tight Integration in your jobscript:
> 
> 
> myhostname=`hostname`
> for HOST in `grep -v $myhostname $TMPDIR/machines | uniq`; do
>      $SGE_ROOT/bin/$ARC/qrsh -inherit $HOST cp input.dat ${TMPDIR} 
> _persistent
> done
> 
> You have of course to use the ${TMPDIR}_persistent as location
> 
> 
> HTH - Reuti
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

-- 
+------------------------------------------------------------------+
|| Prof. Gerhard Venter
||
|| Departement Meganiese en        |  Department of Mechanical and
||   Megatroniese Ingenieurswese   |    Mechatronic Engineering
|| Universiteit Stellenbosch       |  Stellenbosch  University
|| Privaat Sak X1 Matieland 7602   |  Private Bag X1 Matieland 7602  
|| Suid-Afrika                     |  South Africa
||
|| Tel: +27 21 808 3560
|| E-Mail: gventer at sun.ac.za          Web: www.eng.sun.ac.za
+------------------------------------------------------------------+


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list