[GE users] Floating node-locked license management and qlicserver
Daire.Byrne at framestore.com
Thu Jul 30 09:43:56 BST 2009
This is a problem I have come across before (you can probably search the mailing list for my posts on it). I think the simplest solution in the end is to simply set aside some compute nodes for jobs with these licenses. So if you have 40 licenses you would set aside 40 hosts. Obviously this is very wasteful if you have a lot of the licenses but the licenses may also cost much more than a compute node anyway. Perhaps you could also tell jobs which request the licenses to just "prefer" these machines and all other jobs to "prefer to avoid" these machines but that sounds like what you have with your load sensors anyway.
There may be some new tricks in SGE's armoury (v6.2) since I last looked at this to improve the "single license consumed for entire host" licensing scheme.
----- "blair" <blair.bethwaite at infotech.monash.edu.au> wrote:
> Hi all,
> Apologies for the long post but I'm trying to share some experience as
> well as ask a couple of questions...
> Recently I spent a good portion of a week struggling with the
> following resource management issue in SGE (currently using 6.1u4)...
> We're making available a software package, let's call it TheSoftware,
> which uses the FlexM license manager to dish out licenses on a rather
> odd basis - they are floating, but once assigned, node-locked (at
> least this seems to be the terminology used in lmstat).
> What this actually means is that one license for TheSoftware is
> consumed per user per node. We're running a commodity high-throughput
> cluster so naturally all the nodes have many cpus/cores so it is
> possible for a user to have several instances of TheSoftware running
> on a single node, which serendipitously means they can get extra bang
> for a single license. E.g. say there are five 8core nodes (with 8
> slots each) free, 20 licenses for TheSoftware available, and a user
> submits 40 jobs using TheSoftware - all 40 jobs can run in parallel
> and only consume 5 TheSoftware licenses. However, making the scheduler
> manage this seems to be far from trivial - even when simplified for
> the case of a single user running TheSoftware...
> From the start I adopted a hybrid consumable + load sensor approach
> that turns out to be very similar to that documented here
> which I found later - the main difference being that mine is just
> shell scripts with no caching and is specific to TheSoftware licenses.
> It took me a while to figure out how to get SGE to maximise use of
> licenses by preferring to schedule jobs requesting TheSoftware
> licenses/consumable to a machine already running the TheSoftware. My
> initial thought was to create a queue for each host with increasing
> sequence numbers but that seemed like a very roundabout (and painful
> to maintain) hack. After being frustrated by the documentation
> regarding load sensors in the admin guide I finally stumbled onto the
> sge_conf man page and realised this could all be configured on a per
> host basis, and also that the load_sensor field was actually a list of
> paths. So I added a new sensor to the global configuration that simply
> sets a boolean resource value on each host indicating whether
> TheSoftware is running there, then jobs make a soft resource request
> for this resource to be true, this approach works to a certain
> A couple of issues remain though (any suggestions would be welcomed):
> - This doesn't work with qlicserver because the internal consumable
> accounting is done per job so if 8 instances of TheSoftware are
> running on an 8core node SGE and qlicserver incorrectly think 8
> licenses are being consumed (even though lmstat reports otherwise). In
> essence, qlicserver would require modification to handle the floating
> dynamically node-locked licenses.
> - Using a soft resource request makes it possible to open new nodes to
> TheSoftware jobs when e.g. a license is available but all existing
> nodes running TheSoftware are full or there are none currently
> running. However, particularly in the latter case, when SGE schedules
> a batch of new jobs in a single interval there is no time to update
> the load sensor that indicates TheSoftware is running. This means the
> new jobs are distributed according to the usual scheduling heuristics
> which tend to be worst case for conserving licenses.
> If you read this far, thanks!
> To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net].
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users