[GE users] Workload management and virtualization

daireb Daire.Byrne at framestore.com
Sat Nov 8 18:21:44 GMT 2008


Seeing as I have recently been giving this some thought in relation to our future needs (virtualisation is going to be pretty big) I can contribute a few thoughts.

> Am 07.11.2008 um 14:33 schrieb adoerr:
> > there is currently a very interesting discussion ongoing concerning
> > GridEngine and virtualization.
> >
> > I want to invite you to a little 'Thought Experiment'. From your point
> > of view, how do you think should an ideal integrated solution for
> > workload management *and* management of virtual resources should look
> > like? Do you think this would be a good idea at all? Don't be shy and
> > feel free to come up with a long wish list, Christmas is coming ;-)

The main question that interested me was whether organisations with compute farms can use the same system (e.g. SGE) and resources to manage a VM server cluster. Applications like OpenNebula and oVirt
are designed to manage such clusters but they work independently of the job scheduler and so I feel that the actual compute resources cannot be fully maximised automatically. There are two main groups of VM; short execution environments in which you run specific jobs (it can run an execd but as part of a special queue) and long running "server" VMs which may be virtual appliances or are execution hosts whose OS can dynamically change depending on the pending requirements. Rebooting, reinstalling or partitioning the farm up depending on OS is usually quite a manual, slow inefficient operation.

Some example cases I can think of -
  * Your compute farm mostly runs normal jobs (through SGE) but when you underutilise the resources other organisations with their own custom OS images (and even schedulers) can use up the free cputime. Essentially you could rent your unused capacity in a similar way to Amazon's EC2. They are billed by what they use. Things like Globus spend much of their time ensuring that remote execution environments are similar to yours - VMs are much easier IMHO.
  * Desktop machines can become part of the compute resources during periods of idleness without having to guarantee that their OS image is the same as that running on the permanent compute farm. These can be migrated or suspended when the user returns.
  * Long running simulations can be periodically checkpointed (e.g. in case of a power outage) without having to program in specialised checkpointing interfaces. Third party commercial applications can also be checkpointed easily.
  * Under extreme server cluster load the VMs can spill over onto the compute farm if required (server VMs would be very high priority jobs).
  * Rolling out a new OS image across the compute farm hardly ever happens in one go as some departments have not yet migrated their code over yet. So rollout the new VM capable image but automatically spawn VMs of the older OS image when those departments need to run their older software. Perhaps something like Hedeby can start and stop these VM jobs when it detects that there are jobs which require them. You could automatically reboot the machines between OS images too but it seems somewhat more wasteful - what if you only need a single 2 slot job to run on the old OS but your hosts are 8 slot machines? It would be better to keep the other 6 slots available for the newer OS jobs.

Obviously the big thing missing from all of this is a nice GUI to manage VMs like OpenNebula but if something like SGE can provide all the required functionality then creating custom GUIs (like many already do for things like "qstat" and "qalter") is fairly trivial. Or perhaps, like Hedeby, it is something that Sun would be interested in developing at some point.

> What we would like to have, is a checkpointing (& migration) facility  
> for long running applications - even for applications where only the
> binaries are available.

The checkpointing suspend_method in SGE should be easy enough to configure to suspend and resume jobs/VMs but obviously there is currently no inbuilt way of migrating jobs. As I mentioned in the other email thread there are probably ways to devise such a feature by launching a new job that once running sucks the VM from another machine also causing the original job to quit.

> Maybe it's not necessary to run a complete virtual machine for each  
> slot (and having one execd for the virtual machine and an additonal  
> one inside the virtual machine), but to emulate only some layer of a
> virtual machine. The sge_shepherd becomes a sge_virtualizer with a  
> tighter integration of the outer and inner world. This would allow,  
> also to send e.g. signals from the outer machine to the program  
> running in the sge_virtualizer. As VirtualBox is not only open  
> source, but also now owned by SUN, maybe there are good options to  
> combine it.

I think that the VM would, in most cases, be a "parallel" job using a fair proportion of the actual CPUs - single slot/cpu VMs are probably not the most efficient use of RAM for example. Saying that until virtualised drivers (network/storage) can replicate the bare metal native performance it may be better to have multiple VMs per physical host. Running an execd inside the VM and using queue configurations and job resource requests should be good enough to manage jobs within the VM assuming you use a fully bridged network. If you can't or don't want to run an execd within the VM then an interface to ssh into the VM and run jobs may be useful. Perhaps you could have an extra accounting dependency that knows that a hostname is actually a VM subset of another hostname. Qstat could then be made to report all jobs running on "host1" even if some of them are running within a locally networked VM "vm-host1" - it gets complicated quickly!

VirtualBox is the Sun favorite but libvirt is becoming a pretty good hypervisor interface standard on Linux which hopefully Sun will support at some stage. I'm more of a KVM fan at the moment - Xen was just too complicated to package and maintain.



To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list