[GE users] Checkpointing and using local hard disk of execution host
dan.templeton at sun.com
Tue Aug 25 15:16:39 BST 2009
[ The following text is in the "utf-8" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some characters may be displayed incorrectly. ]
Grid Engine doesn't provide you with any magic to move your files
around. If the files are on the execution node's local disk, then you
either have to access them via that execution host or transfer them
somewhere else manually. Grid Engine does provide you with hooks,
however, that you can use to automate some of your file transfers. For
example, you could write an epilog that moves the input.raw file from
the local disk into some kind of share repository. Similarly, you could
set up the checkpoint object with a restart_command that you've written
to transfers the checkpointed state data from the job's previous
execution host to the current one.
> Hi All,
> I?m using Sung grid on Linux with 10 systems to run Spectre (From
> We have configured two queues 1. long.q 2. short.q (Long queue are for
> jobs that run for 1 hour, short queue is for jobs that run for 5 minutes)
> (In long queue if the jobs run for more than one hour it will be
> check-pointed and rescheduled. For short queue the jobs which run for
> more than 5 minutes will be rescheduled and jobs in the queue wait
> state will go to execute state.)
> We are able to successfully write a co-scheduler to implement the
> above requirement. However I have one query.
> Engineers submit Spectre jobs from an NFS mounted directory structure.
> (Ex: /project/chip1/username-X. Within this directory the input file
> to Spectre input.scs is stored. This file is now submitted as input to
> the grid, which is ? *qsub input.scs ?q short.q. *The input.scs file
> has all the required pointers to the library. When job is submitted to
> the grid using qsub the o/p file which input.raw is stored in the same
> We do not have a high end NAS; as a result we observe performance
> issue with NAS. Can we achieve the following?
> * Can we use the local hard disk of the execution host to dump the
> input.raw file?
> * Since we use checkpoint restart to reschedule the jobs in both
> long.q and short.q, is it ideal to use local hard disk. (When
> the grid reschedules the job, it could resume in another
> execution host right? and it may not find the data as it is in
> the local directory of other execution host.
> Please clarify and provide inputs on how to achieve it
> Thanks and Regards
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users