[GE users] SGE support for cycle scavenging and Job checkpointing/migration

Reuti reuti at staff.uni-marburg.de
Fri Sep 19 13:04:01 BST 2008

Hi Atle,

Am 19.09.2008 um 11:06 schrieb Atle Rudshaug:

> Is anyone using SGE for cycle scavenging? We have a dedicated  
> cluster for MPI jobs using SGE and a lot of spare resources on  
> workstations all over the office. We would like to use the  
> workstations mostly for serial jobs but threaded ones as well,

you could define a calendar and start only short jobs during night on  
these machines. But this would mean to give the workstations r/w  
access to the NFS server or define some kind of file staging, to  
transfer input and output file to and from the nodes.

Running only serial and threaded (or even local-MPI) jobs removes the  
necessity to allow a qrsh between the workstation node.

> however we really need a checkpoint and migration option so the  
> jobs will be preemted and paused/moved when the user returns to his/ 
> hers workstation.
> I have read about including Condor's libraries, but it has too many  
> limitations (gfortran not supported and max 2GB input files  
> supported, etc.).

I'm not sure, whether this is really the limit. It's the limit, when  
you you recompile your application for a full Condor integration, as  
then some kernel calls will be replaced by Condor call. E.g. to  
redirect a "local I/O on an execution node" to read the files in the  
end from the "submission host" (and this call replacement might have  
the limitation of 2 GB). But you are right, even then there are some  
limitations which types of applications can be re-compiled at all. In  
a loose Condor integration you can run any job, but then it's like  
using SGE.

> I have seen something about BLCR (http://www.escience.cam.ac.uk/ 
> projects/camgrid/blcr.html). How well does that work for SGE?


For general checkpointing:


> Or do we need to include manual checkpointing into our  
> applications? That will include a LOT of work which would be nice  
> to avoid.

Correct, but Linux isn't am operating system like NEC's Super-UX,  
where such things are included by default for all applications.

> The main question is, how well does SGE handle cycle scavenging/job  
> migration?

It will support checkpointing, when it's already built into the  
application, but it will not add such a facility on its own to them.  
For low priority jobs you can of course run them on the workstations  
with a nice value of 19 (which can be defined in the queue  
definition). Then it's just a matter, whether there is enough memory  
available to run the user tasks in addition.

-- Reuti

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list