[GE users] File staging problem / Persistent $TMPDIR on all slave nodes

Reuti reuti at staff.uni-marburg.de
Tue Jul 8 12:55:06 BST 2008


Am 08.07.2008 um 09:32 schrieb Gerhard Venter:

> I am using SGE 6.0 and OpenMPI.  I am distributing my job to multiple
> slots using a round robin mpi parallel environment - the result is  
> that
> each slot is on a different compute node, each with its own local disk
> storage.  I would like to stage files from my home directory to the
> TMPDIR on each compute node that MPI will use.  My submit script looks
> something like this:
> #!/bin/bash
> #$ -cwd
> #$ -j y
> #$ -pe openmpi_rr 4
> #
> cp input.dat $TMPDIR
> cd $TMPDIR
> mpirun -np ${NSLOTS} $SGE_O_WORKDIR/mpijob

mpirun -np ${NSLOTS} ./mpijob

might also work.

> The cd $TMPDIR does change the pwd on each node to $TMPDIR (I print  
> the
> value for all slots from my mpi program), however, the files are only
> copied to the first MPI slot (the master slot) and not the other slots
> (the worker slots).  I have also implemented the prolog and epilog
> scripts from
> http://gridengine.sunsource.net/project/gridengine/howto/filestaging/
> but with the same result.
> Am I missing something?  Is there a way that I can stage input  
> files for
> all the slots that will be used in a MPI job?

This is by design. The jobscript and prolog/epilog is running only on  
the master node of the parallel job. Often this is sufficient, as  
only the rank 0 reads the file(s) and all communication is done via  
MPI on its own. Is there an advantage of copying the data to all  
nodes and not reading them from the original location?

If in your case all nodes need the input file to be local, you have  
to copy it by hand to all nodes. The $TMPDIR might be different on  
all nodes, if you get slots from several queues to fullfil your  
request, as the queuename is part of the $TMPDIR name. You will need  
a loop across all nodes, copying the data to some kind of persistent  
directory, as otherwise after a qrsh call the $TMPDIR on a slave node  
will be removed again - hence any copy won't be possible this way.

Although you won't need a start_proc_args for Open MPI, you could  
define one with a loop to copy the data (digist from our MPICH  
startup for MOLCAS, which will need persistent scratch directories  
between subsequent qrsh calls):


    cat $1 | while read line; do
       # echo $line
       host=`echo $line|cut -f1 -d" "|cut -f1 -d"."`
       nslots=`echo $line|cut -f2 -d" "`
       while [ $i -le $nslots ]; do
          # add here code to map regular hostnames into ATM hostnames
          echo $host
          i=`expr $i + 1`

PeHostfile2MachineFile $pe_hostfile >> $machines

mkdir ${TMPDIR}_persistent
for HOST in `grep -v $myhostname $TMPDIR/machines | uniq`; do
     $SGE_ROOT/bin/$ARC/qrsh -inherit $HOST mkdir ${TMPDIR}_persistent
exit 0

Of course, you can't delete the scratch directories this way - in  
case of a qdel it's no longer allowed to use qrsh. So I set up a  
special "cleaner.q", which will be submitted in a loop in the  
stop_proc_args procedure:


for HOST in `uniq $TMPDIR/machines`; do
     $SGE_ROOT/bin/$ARC/qsub -l hostname=$HOST,virtual_free=0,cleaner  
$SGE_ROOT/cluster/molcas/cleaner.sh ${TMPDIR}_persistent

rm $TMPDIR/machines

exit 0

Depending on your setup, you might need a different resource request  
and path to the cleaner script. The "cleaner.q" has only one slot and  
is always allowed to run without any restrictions and use a BOOL  
FORCED attribut "cleaner" to avoid that normal jobs end up there.


# Remove the persistent directory

rm -rf $1

exit 0

So, the execution nodes must also be submit nodes. Now you can copy  
the input file to all persistent directories on all nodes, and you  
still have a Tight Integration in your jobscript:

for HOST in `grep -v $myhostname $TMPDIR/machines | uniq`; do
     $SGE_ROOT/bin/$ARC/qrsh -inherit $HOST cp input.dat ${TMPDIR} 

You have of course to use the ${TMPDIR}_persistent as location

HTH - Reuti

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list