[GE users] SGE support for cycle scavenging and Job checkpointing/migration
reuti at staff.uni-marburg.de
Fri Sep 19 13:04:01 BST 2008
Am 19.09.2008 um 11:06 schrieb Atle Rudshaug:
> Is anyone using SGE for cycle scavenging? We have a dedicated
> cluster for MPI jobs using SGE and a lot of spare resources on
> workstations all over the office. We would like to use the
> workstations mostly for serial jobs but threaded ones as well,
you could define a calendar and start only short jobs during night on
these machines. But this would mean to give the workstations r/w
access to the NFS server or define some kind of file staging, to
transfer input and output file to and from the nodes.
Running only serial and threaded (or even local-MPI) jobs removes the
necessity to allow a qrsh between the workstation node.
> however we really need a checkpoint and migration option so the
> jobs will be preemted and paused/moved when the user returns to his/
> hers workstation.
> I have read about including Condor's libraries, but it has too many
> limitations (gfortran not supported and max 2GB input files
> supported, etc.).
I'm not sure, whether this is really the limit. It's the limit, when
you you recompile your application for a full Condor integration, as
then some kernel calls will be replaced by Condor call. E.g. to
redirect a "local I/O on an execution node" to read the files in the
end from the "submission host" (and this call replacement might have
the limitation of 2 GB). But you are right, even then there are some
limitations which types of applications can be re-compiled at all. In
a loose Condor integration you can run any job, but then it's like
> I have seen something about BLCR (http://www.escience.cam.ac.uk/
> projects/camgrid/blcr.html). How well does that work for SGE?
For general checkpointing:
> Or do we need to include manual checkpointing into our
> applications? That will include a LOT of work which would be nice
> to avoid.
Correct, but Linux isn't am operating system like NEC's Super-UX,
where such things are included by default for all applications.
> The main question is, how well does SGE handle cycle scavenging/job
It will support checkpointing, when it's already built into the
application, but it will not add such a facility on its own to them.
For low priority jobs you can of course run them on the workstations
with a nice value of 19 (which can be defined in the queue
definition). Then it's just a matter, whether there is enough memory
available to run the user tasks in addition.
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users