AW: AW: [GE users] resource allocation and race condition

Rod.Rebello at Microchip.com Rod.Rebello at Microchip.com
Mon Oct 18 16:52:31 BST 2004


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Mark,

Can you use scripts like we do to submit placeholder jobs to SGE for your 
interactive runs?  The user runs a script that submits a job to decrement 
the consumable.  The script waits for the job to be running in SGE (safe 
to take a license) then runs the interactive job.  The SGE job then 
periodically checks to see if the interactive job is done via a lock file 
mechanism then exits to free up the license (increment the consumable).

We've set up a special "interactive" queue on a non-critical system to 
take these requests.

----------------------------------------
Rod Rebello
Microchip Technology Inc.





"Olesen, Mark" <Mark.Olesen at arvinmeritor.com>
10/18/2004 03:05 AM
Please respond to users

 
        To:     "'users at gridengine.sunsource.net'" <users at gridengine.sunsource.net>
        cc: 
        Subject:        AW: AW: [GE users] resource allocation and race condition



Thanks to all of you for the various answers.
Unfortunately, I don't seem to have found the magic formula yet.

As an example of the problem, I track 2 independent licenses:
  'shpc'  (8 licenses, exclusively for parallel calculations)
  'stars' (4 licenses, for serial/parallel/interactive use)

Since the 'stars' licenses can also be used for parallel, when the 'shpc'
licenses are exhausted, I build a composite resource 'shpc+' that can be
requested from the SGE.

The 'stars' licenses can be used interactively (pre/post-processing), or 
for
dry runs.  As such, they are often called directly from the user's
workstation and thus bypass the SGE. The 'shpc' licenses are mostly, but 
not
exclusively, used via the SGE. Thus, a load sensor to account for non-SGE
usage would appear to be unavoidable.
 
The exclusive reliance on the load sensor, however, gives rise to a 
possible
race condition between not only between SGE and non-SGE jobs, but also 
among
SGE-queued jobs themselves.
While a race condition between SGE and non-SGE jobs can be tolerated
somewhat, the race condition between SGE-queued jobs needs to be 
eliminated.

As many of you have pointed out, combining SGE consumables with a load
report will prevent oversubscription of a resource.  It seems to me,
however, that UNDERsubscription will the problem with this approach.

Given that load sensor currently reports the number of licenses available,
eg:
    Users of hpcdomains: \ 
        (Total of 8 licenses issued; Total of 6 licenses in use)
yields,
    begin
    global:shpc:2
    end

Then, AFAICS, the license would be tracked as follows (here I've invented 
a
new 'count' attribute for 'qconf -se global' to show the internal
consumption):

0) Start - no licenses used:
    complex_values shpc=8
    load_values    shpc=8
    count          shpc=0

1) Start 6 license job via SGE:
    complex_values shpc=8
    load_values    shpc=2
    count          shpc=6

2) Start 2 license job via SGE ?

Following the logic of host_conf(5), the quota definition (shpc=8) will be
replaced by the current load value (shpc=2).  From the 2 licenses that are
now registered as being available, *none* can be granted, since the 
internal
count is already at 6!

There must either be something severely wrong with my reasoning, or I'm
taking the wrong approach to mixing internal consumables and load reports.

/mark


Dr. Mark Olesen
Thermofluid Dynamics Analyst
ArvinMeritor Light Vehicle Systems
ArvinMeritor Emissions Technologies GmbH
Biberbachstr. 9
D-86154 Augsburg, GERMANY
tel: +49 (821) 4103 - 862
fax: +49 (821) 4103 - 7862
Mark.Olesen at ArvinMeritor.com

> Please see this HOWTO on tracking licenses with GE:
> http://bioteam.net/dag/sge-flexlm-integration/
> 
> The HOWTO has all the details, but basically, you track a license with
> *both* a load sensor *and* a consumable resource simultaneously.  The
> GE master will then use whichever is the lower of the two values in
> order to avoid oversubscribing a license.  The HOWTO talks about how
> there's still the possibility of a race condition, and ways to deal
> with it.
> 
> Regards,
>                Charu
> 
> On Oct 15, 2004, at 8:19 AM, Olesen, Mark wrote:
> 
> >>> Assuming that I only have a single float license 'foo', I can
> >>> 'qsub -l foo=1' a job.  After a while I submit two (2) new jobs with
> >>> the
> >>> same resource requirement(s). Both these jobs wait politely in the
> >> queue,
> >>> since the resource 'foo' is unavailable.  After the first job
> >>> finishes,
> >> and
> >>> the load reports get correctly updated, *both* of the jobs in the
> >>> queue
> >> try
> >>> to grab the 'foo' resource (almost) simultaneously.
> >>> How can I circumvent such a race condition?
> >>
> >> Could you use a SGE consumable in addition to your load sensor? -
> >> Reuti
> >
> >
> > Based on what I can read from host_conf(5) about 'complex_values', I'd
> > have
> > to alter the load sensor so that it only tracks non-SGE license use
> > rather
> > than reporting the number of licenses currently available for use.
> >
> > This means that the load sensor needs to distinguish between
> > applications
> > that were started with/without SGE. If accomplished, this would make
> > the
> > load sensor anything other than lightweight.
> >
> > Is there a direct way, or a backdoor, to determine how many resources
> > SGE
> > believes are still free and/or have been allocated?  Perhaps this
> > could be a
> > means of adjusting the load sensor values.
> >
> > /mark
> >
> > Dr. Mark Olesen
> > Thermofluid Dynamics Analyst
> > ArvinMeritor Light Vehicle Systems
> > ArvinMeritor Emissions Technologies GmbH
> > Biberbachstr. 9
> > D-86154 Augsburg, GERMANY
> > tel: +49 (821) 4103 - 862
> > fax: +49 (821) 4103 - 7862
> > Mark.Olesen at ArvinMeritor.com
> >
> >> -----Ursprüngliche Nachricht-----
> >> Von: Reuti [mailto:reuti at staff.uni-marburg.de]
> >> Gesendet: Freitag, 15. Oktober 2004 11:00
> >> An: users at gridengine.sunsource.net
> >> Betreff: Re: [GE users] resource allocation and race condition
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >
> ###############################################################
> # Charu V. Chaubal  # Phone: (650) 786-7672
(x87672)
> # Grid Computing Technologist          # Fax:   (650) 786-4591
> # Sun Microsystems, Inc.                                               # 
Email:
charu.chaubal at sun.com
> ###############################################################
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net






More information about the gridengine-users mailing list