[GE users] File staging problem / Persistent $TMPDIR on all slave nodes

Reuti reuti at staff.uni-marburg.de
Tue Jul 8 17:16:57 BST 2008


Am 08.07.2008 um 17:56 schrieb Gerhard Venter:

> Reuti,
>
> Yes you are correct,
>
> mpirun -np ${NSLOTS} ./mpijob
>
> also works.
>
> Thanks for your input.  For now I have a working solution based on the
> input I got from Kevin and use a prolog script.  The reason I want  
> files
> that are copied to the work directory for each worker thread, is that
> each worker thread, among other things, starts a numerical simulation
> that runs in series.  Each of these simulations generate large scratch

I see. But isn't it possible to specify a full pathname for the  
inputfile instead the default "input.dat" in the cwd? This way the  
cwd for each worker could still be local on the node, but the input  
files won't be needed to be copied to the same location on all nodes  
to access them.

-- Reuti


> files, and as a result I want them to run on the local disk.  For  
> now I
> think the best solution is a combination of what Kevin suggested and
> some manual file manipulation from within my worker threads.
>
> Thanks again for your input,
> Gerhard
>
> On Tue, 2008-07-08 at 13:55 +0200, Reuti wrote:
>> Hi,
>>
>> Am 08.07.2008 um 09:32 schrieb Gerhard Venter:
>>
>>> I am using SGE 6.0 and OpenMPI.  I am distributing my job to  
>>> multiple
>>> slots using a round robin mpi parallel environment - the result is
>>> that
>>> each slot is on a different compute node, each with its own local  
>>> disk
>>> storage.  I would like to stage files from my home directory to the
>>> TMPDIR on each compute node that MPI will use.  My submit script  
>>> looks
>>> something like this:
>>>
>>> #!/bin/bash
>>> #$ -cwd
>>> #$ -j y
>>> #$ -pe openmpi_rr 4
>>> #
>>> cp input.dat $TMPDIR
>>> cd $TMPDIR
>>> mpirun -np ${NSLOTS} $SGE_O_WORKDIR/mpijob
>>
>> mpirun -np ${NSLOTS} ./mpijob
>>
>> might also work.
>>
>>>
>>> The cd $TMPDIR does change the pwd on each node to $TMPDIR (I print
>>> the
>>> value for all slots from my mpi program), however, the files are  
>>> only
>>> copied to the first MPI slot (the master slot) and not the other  
>>> slots
>>> (the worker slots).  I have also implemented the prolog and epilog
>>> scripts from
>>>
>>> http://gridengine.sunsource.net/project/gridengine/howto/ 
>>> filestaging/
>>>
>>> but with the same result.
>>>
>>> Am I missing something?  Is there a way that I can stage input
>>> files for
>>> all the slots that will be used in a MPI job?
>>
>> This is by design. The jobscript and prolog/epilog is running only on
>> the master node of the parallel job. Often this is sufficient, as
>> only the rank 0 reads the file(s) and all communication is done via
>> MPI on its own. Is there an advantage of copying the data to all
>> nodes and not reading them from the original location?
>>
>> If in your case all nodes need the input file to be local, you have
>> to copy it by hand to all nodes. The $TMPDIR might be different on
>> all nodes, if you get slots from several queues to fullfil your
>> request, as the queuename is part of the $TMPDIR name. You will need
>> a loop across all nodes, copying the data to some kind of persistent
>> directory, as otherwise after a qrsh call the $TMPDIR on a slave node
>> will be removed again - hence any copy won't be possible this way.
>>
>> Although you won't need a start_proc_args for Open MPI, you could
>> define one with a loop to copy the data (digist from our MPICH
>> startup for MOLCAS, which will need persistent scratch directories
>> between subsequent qrsh calls):
>>
>> ========================START=========================
>> #!/bin/sh
>>
>> PeHostfile2MachineFile()
>> {
>>     cat $1 | while read line; do
>>        # echo $line
>>        host=`echo $line|cut -f1 -d" "|cut -f1 -d"."`
>>        nslots=`echo $line|cut -f2 -d" "`
>>        i=1
>>        while [ $i -le $nslots ]; do
>>           # add here code to map regular hostnames into ATM hostnames
>>           echo $host
>>           i=`expr $i + 1`
>>        done
>>     done
>> }
>>
>> PeHostfile2MachineFile $pe_hostfile >> $machines
>>
>> myhostname=`hostname`
>> mkdir ${TMPDIR}_persistent
>> for HOST in `grep -v $myhostname $TMPDIR/machines | uniq`; do
>>      $SGE_ROOT/bin/$ARC/qrsh -inherit $HOST mkdir ${TMPDIR} 
>> _persistent
>> done
>> exit 0
>> ========================END=========================
>>
>> Of course, you can't delete the scratch directories this way - in
>> case of a qdel it's no longer allowed to use qrsh. So I set up a
>> special "cleaner.q", which will be submitted in a loop in the
>> stop_proc_args procedure:
>>
>> =======================START==========================
>> #!/bin/sh
>>
>> for HOST in `uniq $TMPDIR/machines`; do
>>      $SGE_ROOT/bin/$ARC/qsub -l hostname=$HOST,virtual_free=0,cleaner
>> $SGE_ROOT/cluster/molcas/cleaner.sh ${TMPDIR}_persistent
>> done
>>
>> rm $TMPDIR/machines
>>
>> exit 0
>> ========================END=========================
>>
>> Depending on your setup, you might need a different resource request
>> and path to the cleaner script. The "cleaner.q" has only one slot and
>> is always allowed to run without any restrictions and use a BOOL
>> FORCED attribut "cleaner" to avoid that normal jobs end up there.
>>
>> =======================START==========================
>> #!/bin/sh
>>
>> #
>> # Remove the persistent directory
>> #
>>
>> rm -rf $1
>>
>> exit 0
>> ========================END=========================
>>
>> So, the execution nodes must also be submit nodes. Now you can copy
>> the input file to all persistent directories on all nodes, and you
>> still have a Tight Integration in your jobscript:
>>
>>
>> myhostname=`hostname`
>> for HOST in `grep -v $myhostname $TMPDIR/machines | uniq`; do
>>      $SGE_ROOT/bin/$ARC/qrsh -inherit $HOST cp input.dat ${TMPDIR}
>> _persistent
>> done
>>
>> You have of course to use the ${TMPDIR}_persistent as location
>>
>>
>> HTH - Reuti
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> -- 
> +------------------------------------------------------------------+
> || Prof. Gerhard Venter
> ||
> || Departement Meganiese en        |  Department of Mechanical and
> ||   Megatroniese Ingenieurswese   |    Mechatronic Engineering
> || Universiteit Stellenbosch       |  Stellenbosch  University
> || Privaat Sak X1 Matieland 7602   |  Private Bag X1 Matieland 7602
> || Suid-Afrika                     |  South Africa
> ||
> || Tel: +27 21 808 3560
> || E-Mail: gventer at sun.ac.za          Web: www.eng.sun.ac.za
> +------------------------------------------------------------------+
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list