[GE users] SGE6.0u3 global consumable resource - applies to allqueues

Walt Minkel wminkel at latticesemi.com
Thu Feb 3 17:00:37 GMT 2005


Mark,

Nice write up.  It's all becoming clear now...
 
    Thks,  Walt

Olesen, Mark wrote:

>>The load sensor however seemed to be too slow.  For example, with five
>>licenses, if three jobs were running and four more were queued in, often
>>all four jobs would be submitted before the load sensor (using FlexLM's
>>lmstat) could report back the load.  If I delayed the starting of jobs,
>>~10 seconds, the load sensor could keep up.  Maybe there is a way to only
>>queue in after an event like a load refresh??
>>    
>>
>
>Walt,
>
>The problem that you describe here is inevitable with the pure load sensor
>approach. I'm working on a How-To that describes the problem and possible
>solutions. The preliminary version of the How-To, including some sample
>code, is attached. Run it thru pod2html / pod2text to obtain a nicer format.
>
>
>/mark
>
>  
>
>------------------------------------------------------------------------
>
>This is a short summary of integrating floating licenses (e.g. FlexLM) into
>the GridEngine. The text here is written in 'pod' format -- so conversion
>into 'html', '*roff', 'text', etc. is quite easy.  With some luck, this
>information will find its way into a 'proper' HowTo.
>
>=pod
>
>=head1 Background
>
>The issue at hand is the correct bookkeeping of floating licenses when the
>GridEngine may or may not share them with other (non-GridEngine)
>applications. If the float licenses are to be used exclusively on the
>GridEngine, with no means of external access, you need to read no further.
>The Administration Guide provides an adequate example of this task. For the
>balance of the installations (arguably the vast majority), you will need a
>B<combination> of internal and external license tracking to accomplish the
>task correctly.
>
>The load sensor provides the principal means of passing information from the
>external world into the GridEngine. For tracking floating licenses, a load
>sensor queries the license server, processes the number of license currently
>available, and reports this information back to the GridEngine in the
>expected format [see sge_execd(8)].  For anything but the most simple
>cases, munging the data via perl is recommended.  Before discussing the
>advantages/disadvantages of the various approaches, a small example of
>parsing FlexLM via perl is given.
>
>=head2 Text Processing Example
>
>This example of text processing munges the output of a FlexLM license server
>query into a load report. There are a few notable points incorporated in
>this simple example:
>
>=over 5
>
>=item *
>
>Only names explicitly listed in the lookup table will be reported.
>This prevents flooding the GridEngine with extraneous information.
>
>=item *
>
>The lookup table is used to map the names reported by the license server to
>those used by our complexes.  This ensures license version independence and
>consistent orthography.
>
>=item *
>
>Since the licenses may be split across several servers, the results must be
>accumulated before being reported.
>
>=item *
>
>Some normal licenses are bundled into new pseudo-licenses. This is useful
>when some licenses have mixed usage (ie, interactive and non-interactive).
>
>=back
>
>For the example, we'll report the information for two software packages:
>Nastran (a FEA program) and Star-CD (a CFD program).
>
>Since Nastran tends to be quite I/O intensive in its current incarnation,
>the C<hog> approach will be used, in which each execution host is equipped
>with 100 hog units and particular jobs can use C<-l hog=100> to grab the
>entire machine.
>
>The current Star-CD licensing scheme has two licenses: C<hpcdomains>, which
>can only be used for parallel calculations, and C<starsuite>, which can be
>used for pre/post-processing, or serial calculation, or parallel
>calculation.  The pseudo-license C<shpc+> represents the combination of both
>licenses.
>
>Here is an example of the complexes defined:
>
>  #name     shortcut  type  relop requestable consumable default  urgency
>  #----------------------------------------------------------------------
>  hog       hog       INT   <=    YES         YES        1        1000
>  nastran   nas       INT   <=    YES         YES        0        1000
>  shpc      hpc       INT   <=    YES         YES        0        1000
>  shpc+     hpc+      INT   <=    YES         YES        0        1000
>  stars     star      INT   <=    YES         YES        0        1000
>
>Here is a perl program for parsing the FlexLM output:
>
>    #!/usr/bin/perl -w
>    use strict;
>
>    my %lut = (
>	NASTRAN    => "nastran",
>	hpcdomains => "shpc",
>	starsuite  => "stars",
>    );
>
>    my %bundle = ( "shpc+" => [qw( shpc stars )], );
>
>   # parse output that looks like this
>   #  Users of NASTRAN: (Total of 2 licenses issued; Total of 2 licenses in use)
>   #  Users of NASTRAN: (Total of 1 license issued; Total of 0 licenses in use)
>   #
>   # error checking is left as an exercise for the reader
>
>    sub lmstat {
>	my %hash = map { $_ => [0, 0] } values %lut;
>
>	local @ARGV = "lmstat -a|";
>	while (<>) {
>	    my ( $name, $total, $used ) =
>	      /^Users\s+of\s+(\S+):.*of\s+(\d+)\s+.*of\s+(\d+)\s*/
>	      or next;
>	    my $alias = $lut{$name} or next;
>	    $hash{$alias}[0] += $total;
>	    $hash{$alias}[1] += ( $total - $used );
>	}
>
>	# bundle licenses
>	for my $alias ( keys %bundle ) {
>	    my ( $total, $free ) = ( 0, 0 );
>	    for my $name ( @{ $bundle{$alias} } ) {
>		$total += $hash{$name}[0] || 0;
>		$free  += $hash{$name}[1] || 0;
>	    }
>	    $hash{$alias} = [$total, $free];
>	}
>
>	return %hash;
>    }
>
>    my %license = lmstat();
>
>    print "begin\n";
>    print "global:$_:$license{$_}[1]\n" for keys %license;
>    print "end\n";
>    exit 0;
>
>=head1 PROBLEMS WITH THE LOAD SENSOR APPROACH
>
>An obvious problem with the load sensor approach is the delay associated with
>the load reports, as mentioned in the online documentation
>(see
>L<http://gridengine.sunsource.net/project/gridengine/howto/resource_management.html>):
>
>  Unfortunately, due to the loadsensor's delay, it can't be 100% excluded
>  that batch jobs are dispatched and started although the license has been
>  aquired by an interactive job.
>
>The problem is actually B<much> more serious than suggested by this warning!
>A race condition between a GridEngine job and an interactive job is less
>problematic than what actually occurs.
>
>In the following examples, we'll examine how the licenses are managed with
>different approaches.  For the sake of clarity, a new pseudo-variables
>C<internal_count> and C<available> have been introduced to reflect the
>current internal GridEngine state. The other variables - C<complex_values>
>and C<load_values> - are retrieved via C<qconf -se global>.
>
>=head2 The Pure Load Sensor Approach
>
>Here the C<complex_values> are left as C<NONE>. The license availability is
>managed exclusively over the load sensor.  This combination has the
>interesting side-effect that the internal bookkeeping is not used.
>
>=over 5
>
>=item Start:
>
>all licenses are available
>
>  load_values      license=4
>  complex_values   NONE
>  (internal_count) NONE
>  (available)      license=4
>
>=item launch X jobs, each with C<-l license=4>:
>
>Since all nodes provide resource C<license=4> and there is no internal
>bookkeeping to track the consumption of the resource, all jobs attempt to
>start at the same scheduling interval. Only one job wins the race and others
>fail with licensing problems.
>
>=back
>
>=head2 A Combined Internal and Load Sensor Approach
>
>Here the C<complex_values> are set to the number of licenses available. The
>GridEngine decides on availability based on C<complex_values> minus
>C<(internal_count)> or the C<load_values>. The I<lowest> value dictates the
>availability, as mentioned in the online documentation (see
>L<http://gridengine.sunsource.net/project/gridengine/howto/loadsensor.html>):
>
>  The lesser of the Consumable Resources or the load sensor
>  value will be used to prevent license oversubscription.
>
>=over 5
>
>=item Start:
>
>all licenses are available
>
>  load_values      license=4
>  complex_values   license=4
>  (internal_count) NONE
>  (available)      license=4
>
>=item launch two jobs, each with C<-l license=4>:
>
>Since C<complex_values> exist, the internal bookkeeping is used to track
>license availability and only one job is dispatched:
>
>  load_values      license=4
>  complex_values   license=4
>  (internal_count) license=4
>  (available)      license=0
>
>After some delay, the load sensor will catch up to the current status.
>
>  load_values      license=0
>  complex_values   license=4
>  (internal_count) license=4
>  (available)      license=0
>
>When the first job finishes, the internal count will increase.
>
>  load_values      license=0
>  complex_values   license=4
>  (internal_count) license=0
>  (available)      license=0
>
>After some delay, the load sensor will catch up to the current status and
>the second job can start.
>
>  load_values      license=4
>  complex_values   license=4
>  (internal_count) license=4
>  (available)      license=4
>
>=back
>
>Despite some delays associated with the load sensor, only a single job is
>started and this approach I<seems> to be behaving as expected. However, the
>bookkeeping becomes less robust when non-GridEngine usage is tracked too!
>
>=over 5
>
>=item Start:
>
>all licenses are available
>
>  load_values      license=4
>  complex_values   license=4
>  (internal_count) NONE
>  (available)      license=4
>
>=item start a non-GridEngine job that occupies 2 licenses:
>
>After a delay, the load sensor reports that only two licenses are available.
>
>  load_values      license=2
>  complex_values   license=4
>  (internal_count) NONE
>  (available)      license=2
>
>=item launch two jobs via the GridEngine, each with C<-l license=2>:
>
>Since there are only 2 licenses available, and internal bookkeeping
>tracks the resource consumption, only one job is started at the first
>scheduling interval.  The internal count is incremented accordingly:
>
>  load_values      license=2
>  complex_values   license=4
>  (internal_count) license=2
>  (available)      license=2
>
>At the next scheduling interval, there are still 2 licenses available (the
>lower limit of the internal bookkeeping and the external load report) and
>the second job will be started. This job will fail with licensing problems.
>
>=back
>
>It is obvious from the above examples that these approaches B<cannot> work
>correctly with a mixed license usage.
>
>=head1 A PROPOSED SOLUTION
>
>The only obvious solution to the problem is to change the load sensor so
>that it does not report I<any> values at all, but instead adjusts the
>C<complex_values> directly.
>
>=over 5
>
>=item Start:
>
>all licenses are available
>
>  load_values      NONE
>  complex_values   license=4
>  (internal_count) NONE
>  (available)      license=4
>
>=item start a non-GridEngine job that occupies 2 licenses:
>
>After a delay, the load sensor adjusts the number of
>licenses available for the GridEngine.
>
>  load_values      NONE
>  complex_values   license=2
>  (internal_count) NONE
>  (available)      license=2
>
>=back
>
>Apart from the delay inherent with the load sensor approach, there is no
>internal race condition and we've thus eliminated the significant failings
>of the previous problems.  Small problems still exist, but at least the
>worst problems have been addressed. The practical aspects of implementing
>this solution are given below.
>
>=head2 Determine the Internal Count
>
>Although it is not currently possible to query the GridEngine directly about
>its internal count, we can parse the C<qstat> output to determine which
>requests were made. The corresponding perl program is relatively compact:
>
>    #!/usr/bin/perl -w
>    use strict;
>
>    sub qstat {
>	my $lines = qx{qstat -r -s rs -xml};
>	my %hash;
>
>	for ( grep { defined } split m{</job_list>}, $lines ) {
>	    my ($slots) = m{<slots>(\d+)</slots>} or last;
>	    while (
>		s{<(\S*hard_request).*?\s+name=\"(\S+)\".*?>(\d[\.\d]*)</\1>}{})
>	    {
>		$hash{$2} += ( $3 * $slots );
>	    }
>	}
>
>	return %hash;
>    }
>
>    my %internal = qstat();
>    print "internal_count\t",
>      join( ',' => map { "$_=$internal{$_}" } sort keys %internal ), "\n";
>
>    exit 0;
>
>=head2 Determine the Complex Values
>
>Determining the current C<complex_values> via C<qconf -se global> is easy
>enough, that a simple shell script suffices:
>
>  #/bin/sh
>  SGE_SINGLE_LINE=1 qconf -se global | sed -ne 's/^complex_values *//p'
>
>However, a perl program is preferable for integration:
>
>    #!/usr/bin/perl -w
>    use strict;
>
>    sub qconf_se {
>	$ENV{SGE_SINGLE_LINE} = 1;    # no backslash continuations
>
>	return map {
>	    s/,/ /g;
>	    map { /^(.+)=(.+)\s*$/ } split;
>	} grep { s/^complex_values\s+// } qx{qconf -se global};
>    }
>
>    my %complex_values = qconf_se();
>    print "complex_values\t",
>      join( ',' => map { "$_=$complex_values{$_}" } sort keys %complex_values ),
>      "\n";
>
>    exit 0;
>
>=head2 Determine the Available Complex Values
>
>Based on the various programming elements now available to us, we now
>determine how to adjust the C<complex_values> to reflect the number of
>licenses available for the GridEngine to administer.
>
>Only C<complex_values> that also exist in the license server query are
>eligible to be adjusted.
>
>    #!/usr/bin/perl -w
>    use strict;
>
>    # ##############################
>    # re-use code from previous examples
>    # ##############################
>
>    my %license        = lmstat();
>    my %internal       = qstat();
>    my %complex_values = qconf_se();
>
>    # determine what changes may be required
>    sub changes_required {
>	return join ',' => sort map {
>	    my ( $total, $free ) = @{ $license{$_} };
>	    $free += $internal{$_} || 0;
>	    $free <= $total or $free = $total;
>	    $free != $complex_values{$_} ? "$_=$free" : ();
>	  }
>	  grep { $license{$_} } keys %complex_values;
>    }
>
>    my $changes = changes_required();
>    system "qconf -mattr exechost complex_values $changes global" if $changes;
>    exit 0;
>
>=head2 Alleviate Race Condition
>
>While the solution presented addresses most of the significant problems
>associated with mixed GridEngine and non-GridEngine usage, a few (hopefully
>minor) race conditions remain:
>
>=over 5
>
>=item 1
>
>A race condition can occur when the GridEngine job is slower to occupy the
>licenses than a non-GridEngine job that starts afterwards. The GridEngine
>job could, for example, first decompose the geometry and compile/link user
>subroutines before actually starting the calculation and occupying licenses.
>
>There is no simple method of preventing the non-GridEngine job from grabbing
>the licenses.  The only possibility is for the GridEngine job script to
>catch the license failure return code and then exit with C<99> to trigger
>rescheduling. Determining the exit code for the various software simulation
>packages is left as an exercise for the reader.
>
>=item 2
>
>A delay exists between when a non-GridEngine job starts and its existence is
>registered via the C<qconf -mattr> procedure outlined in the previous
>section(s).  During this time, a race condition exists if a GridEngine job
>is slated to start.
>
>Double-checking the license availability within an prolog script can help
>here. Using the exit code C<99> will signal the GridEngine to reschedule the
>job for the next interval. This extra safety is no longer needed after the
>next load report interval, at which point the C<complex_values> will have
>been updated to reflect the non-GridEngine usage.
>
>The example prolog script:
>
>    #!/bin/sh
>    # prolog
>
>    # <settings>
>    : ${SGE_ROOT:=/opt/n1ge6}
>    : ${SGE_CELL:=default}
>    for i in $SGE_ROOT/$SGE_CELL/site/environ; do [ -f $i ] && . $i; done
>    # </settings>
>
>    # the (hard) requested resources
>    rclist=`qstat -r -j $JOB_ID | sed -ne 's/^.*hard *resource_list: *//p'`
>
>    # <resource_check>
>    # verify that the expected resources actually exist
>    # this should prevent the race condition that occurs between SGE jobs
>    # before the load report (available licenses) gets updated
>    #
>    query="$SGE_ROOT/$SGE_CELL/site/qlicserver"
>
>    if [ -n "$rclist" -a -x "$query" ]; then
>       echo "query resources   $rclist,slots=$NSLOTS"
>       available=`$query $rclist,slots=$NSLOTS`
>
>       exitcode=$?
>       if [ $exitcode -eq 99 ]; then
>          echo "re-queue job      $available"
>          echo "-------------------------"
>          exit 99
>       fi
>
>       if [ $exitcode -ne 0 ]; then
>          echo "error with license query $exitcode"
>          exit $exitcode
>       fi
>    fi
>    # </resource_check>
>
>where the corresponding perl program would resemble the following:
>
>    #!/usr/bin/perl -w
>    use strict;
>
>    # ##############################
>    # re-use code from previous examples
>    # ##############################
>
>    my %license = lmstat();
>
>    sub get_requirements {
>	my ($slots) = map { /(?:^|,)slots=(\d+)(?:,|$)/ } @_;
>	$slots ||= 1;
>
>	my @requirement =
>	  map {
>	    my ( $rc, $limit ) = split /=/;
>	    [$rc => ( $limit * $slots )];
>	  }
>	  sort
>	  grep { /^([^=]+)=(\d[\.\d]*)$/ and $license{$1} }
>	  map { s/,+/ /g; split } @_;
>    }
>
>    my @required = get_requirements @ARGV;
>
>    my $failed;
>    for (@required) {
>	my ( $rc, $limit ) = @$_;
>	my $free = $license{$rc}[1] || 0;
>	if ( $limit > $free ) {
>	    $limit = $free;
>	    $failed++;
>	}
>	$_ = "$rc=$limit";
>    }
>
>    print "have ", join( "," => @required ), "\n";
>
>    exit( $failed ? 99 : 0 );
>
>B<NB:> For this example to work, the execution hosts must also be registered
>as submission hosts - otherwise C<qstat> does not work.
>
>This does not really prevent a race condition, but can at least signal the
>GridEngine job if it has already lost the race!
>
>=back
>
>=head1 CLOSURE
>
>Since the presented solution only uses the load sensor to invoke
>C<qconf -mattr>, but not to return any values, it could also
>be replaced by an independent daemon:
>
>    #!/usr/bin/perl -w
>    use strict;
>    use POSIX;
>
>    my $delay = 60;
>    ( my $Script = $0 ) =~ s{^.*/}{};
>
>    sub kill_daemon {
>	my @list =
>	  grep { $_ != $$ }
>	  map  { /^\s*(\d+)\s*$/ } qx{ps -C $Script -o pid= 2>/dev/null};
>	kill 9 => @list if @list;
>    }
>
>    my $daemon = $delay;
>
>    if ($daemon) {    # daemonize
>	kill_daemon();
>
>	my $pid = fork;
>	exit if $pid;    # let parent exit
>	defined $pid    or die "Couldn't fork: $!";
>	POSIX::setsid() or die "Can't start a new session: $!";
>
>	# Trap fatal signals, exit gracefully
>	$SIG{INT} = $SIG{TERM} = $SIG{HUP} = sub { undef $daemon };
>	$SIG{PIPE} = "IGNORE";
>    }
>
>    do {
>	# ##############################
>	# re-use code from previous examples
>	# ##############################
>	my %license        = lmstat();
>	my %internal       = qstat();
>	my %complex_values = qconf_se();
>	my $changes        = changes_required();
>
>	system "qconf -mattr exechost complex_values $changes global"
>	  if $changes;
>	sleep $delay;
>    } while $daemon;
>
>The choice between a daemon or a load sensor is largely a matter of personal
>preference. In either case, it is however necessary that the program monitor
>the contents of C<$SGE_ROOT/$SGE_CELL/common/act_qmaster> to react to
>changes in the qmaster.
>
>=head2 Remaining Issues
>
>
>This text is not yet finished ...
>
>Mark Olesen (2005-02-03)
>
>=cut
>  
>
>------------------------------------------------------------------------
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>For additional commands, e-mail: users-help at gridengine.sunsource.net
>




More information about the gridengine-users mailing list