[GE users] help with loadsensor complexes

hawson beckerjes at mail.nih.gov
Mon Nov 15 15:39:03 GMT 2010


On Fri, Nov 12, 2010 at 07:54:43AM -0500, sgenedharvey wrote:
>I want to prevent jobs from running on machines, if there isn't enough disk
>space in a particular directory.  I wrote a simple loadsensor script, which
>works fine.  Each machine now has a property, hl:scratchfree=whatever .
>which indicates the amount of free disk space.
>
>Problem is, I can't seem to figure out how to use it.  I created the
>complex:
>scratchfree         scratchfree       INT         <=    YES         YES
>0        0
>
>I tried setting various settings, >=, Yes, No, some number for default, set
>priority to 1000.  I tried requesting the resource on the qsub prompt .
>can't seem to figure out the right way to use the information to prevent job
>distribution to machines without enough disk space.
>
>scratchfree is configured as a "reporting variable" in each host.
>The loadsensor is set on global, and it is running correctly for each
>machine.
>
>For example:
>qconf -se dell0307s-02 | grep report
>report_variables      scratchfree
>
>qconf -sconf global | grep sens
>load_sensor                  /path/to/scratch_loadsensor
>
>I think the problem is the fact that the scratchfree is a reporting
>variable, instead of a consumable, or a resource limit.  Should I make it a
>load_scaling?  Or a complex_value, or something else?

I've run into this problem, and specifically with /scratch disk space.
The good news is that there is a solution; the bad news is that it is
slightly complicated to describe (and further, my explantion may not be
correct).  So take this with a shaker or two of salt:

A very poorly documented "quirk" is that a consumable must have some
sort of "starting" value against which it can compare.  Thus, if you want
to use scratchfree as consumable, the exec host must have a value to indicate
some sort of starting value, and the resource must be attached to that
host.  Assuming you have the "relop" attribute set correctly, SGE will
use the *LOWER* of the two values:  either the "starting" value, or the
one reported by the load_sensor script.


As I said, I track scratch space at the host level as well, and here
are the bits of configuration that I have that apply:

First, the definition of the resource (I used a MEMORY type, not INT,
but I don't know if that matters or not):

   $ qconf -sc|grep scratch
   #name                shortcut        type        relop requestable consumable default  urgency
   #----------------------------------------------------------------------------------------------
   freescratchspace     scratch         MEMORY      <=    YES         YES 0        0


Next, I "attach" the resource to *each* exec host, giving it a
hard-coded "starting" value.  I use the maximum amount of disk space
possibly availble to users on /scratch, as reported by 'df'.  This can
be a *real* pain the first time you set it up, and I suggest spending
some quality time scripting things with qconf.

   $ qconf -se z012  | egrep '(values|report_vari)'
   complex_values    scratchsync=TRUE,mem_free=127G,freeshm=64423M,freescratchspace=409645M
   load_values       arch=lx24-amd64,num_proc=8,mem_total=128845.203125M,swap_total=32765.375000M,virtual_total=161610.578125M
   report_variables  NONE

Note that I have 'freescratchspace' set for this host, and that I don't
have it set in the load_values, nor in the report_variables.





-- 
Jesse Becker
NHGRI Linux support (Digicon Contractor)

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=295890

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list