[GE users] job submission verifier

Reuti reuti at staff.uni-marburg.de
Tue Sep 23 13:20:10 BST 2008


Hi,

Am 23.09.2008 um 13:45 schrieb Ernst Bablick:

> in the past some users expressed their need for some kind of  
> presubmission procedure which is executed whenever a job enters the  
> GE system (see also issue #2621).
>
> Find attached a draft for a corresponding GE enhancement. Please  
> give feedback by Friday.
>
> Regards,
>
> Ernst
>       Functional Specification: Job Submission Verifier
>       =================================================
>
>       Version  Comment                                Date      Author
>       -------  -------------------------------------  --------   
> -------------
>       0.1      Initial version                        ?          
> Andreas Haas
>       0.5      Describe changes so that enhancement   17-09-08   
> Ernst Bablick
>                can be implemented for Urubu with
>                less performance loss
>       0.6      added missing parts according to       22-09-08   
> Ernst Bablick
>                discussion with RD and AS
>
> 1     INTRODUCTION
>       ============
>
>       In the past some of our users expressed their need for some  
> kind of
>       presubmission procedure which is executed whenever a job  
> enters the GE
>       system. (see also issue #2621). Here are some examples what  
> should be
>       done in such a procedure:
>
>       -  Check accounting DB to make sure the user has enough wall  
> clock
>          hours in their account to run the requested job on the  
> requested
>          slots for the requested time.
>
>       -  Guarantee that the number of slots requested is a multiple  
> of 16 for
>          parallel jobs.
>
>       -  Verify that the user can write to various shared filesystems.
>
>       -  Make sure that the user does not request certain -l  
> resources that
>          might not behave the way the user expects them to (h_vmem,  
> h_data,
>          etc).
>
>       -  Add required resource requests that users don't now are  
> mandatory.
>
>       -  Add a project request of the form -P queue_name where  
> queue_name is
>          the queue used with the -q option.
>
>       -  Make sure that the user hasn't messed up their ssh keys so  
> badly
>          that they cannot ssh into compute nodes w/o a passphrase.
>
>       -  Print out status messages and errors about the above as  
> well as
>          printing out the queue, allocation account name, PE,
>          total number of tasks requested, and number of tasks per node
>          requested.
>
>       -  Print out an motd-like message at the top of qsub output
>
>> qsub job.sge
>          Welcome
>          -------
>          Please note that we strongly advise using the mvapich- 
> devel MPI
>          stack for running jobs with more than 2048 MPI tasks.
>           
> ---------------------------------------------------------------
>          --> Submitting 16 tasks...
>          --> Submitting 16 tasks/host...
>          --> Submitting exclusive job to 1 hosts...
>          ...
>
>
> 2     PROJECT OVERVIEW
>       ================
>
> 2.1   Project Aim
>
>       Aim of the project is it to provide a interface enhancement  
> for GE that
>       allows it to define job verification/modification routines  
> which will
>       either be executed on client side or within qmaster process  
> when a
>       job enters the system or both.
>
> 2.2   Project Benefit
>
>       The administrator of a GE cluster can define additional  
> policies needed.
>
>       The GE cluster will not be loaded with jobs which would break  
> a defined
>       policy if a job verification/modification routine is defined.
>
> 2.3   Project Duration
>
> 2.4   Project Dependencies
>
>       There are no known dependencies with other projects
>
>
> 3     SYSTEM ARCHITECTURE
>       ===================
>
> 3.1   Enhancement Functions
>
>       Here is the summary of the customer needs:
>
>       (N1)  The administrator gets the possibility to define job  
> verification
>             procedure which will be executed in qsub, qrsh, qsh,  
> qlogin, qmon
>             and applications using DRMAA, to evaluate a job before  
> it is send
>             to qmaster
>
>       (N2)  The administrator gets the possibility to define a job  
> verification
>             procedure which will be executed on qmaster side before  
> a job
>             is finally added to the qmaster data store or before the
>             modification of a job is finally accepted.
>
>       (N3)  It will be possible to define under which user account the
>             verification procedure within the master is executed.  
> By default
>             the script is executed as sgeadmin user. Within the  
> client context
>             the script is executed as submit used.
>
>       (N4)  Data defining the job will be provided to the verification
>             procedure.
>
>       (N5)  After evaluating a job the verification result might  
> either be:
>                *  accept job
>                *  correct parameters part of the job specification
>                *  reject job
>                *  temporarily reject job (it might be accepted later)
>
>       (N6)  Nearly all parameters which define a job can be changed  
> by the
>             verification procedure but there are some exceptions.  
> Following
>             things are only available as read only parameter:
>                * type (qsub job => qlogin ...)
>                * script file to be executed
>                * arguments passed to the job
>                * user who submitted the job
>             The job script contend itself is not available in the job
>             submission verification script.
>
>
>       (N7)  As a minimum requirement at least following parameters  
> have to be
>             changeable by the job verification procedure in a first
>             implementation
>                * pe request
>                * resource requests (hard and soft)
>                * queue and host requests
>                * project request
>
>       Implementation notes and necessary steps:
>
>       (I1)  (N1) and (N2) will be realized as script. The script  
> language can
>             be chosen by the administrator.
>
>       (I2)  The script has to be written in a way so that it can be  
> executed
>             like a loadsensor script. It has to accept commands and
>             parameters from stdin and return results via stdout.
>             It should not terminate until it gets a corresponding  
> command.
>
>       (I3)  A file named "client_jsv" and located in $SGE_ROOT/ 
> $SGE_CELL/common
>             will be started by the clients qsub, qrsh, qsh, qlogin  
> and qmon and
>             DRMAA library (N1) before a new job will be sent to  
> qmaster. This
>             script will be started under the user account of the  
> user which
>             tries to start a new job
>
>       (I4)  The script to be evaluated in qmaster (N2) has to be  
> configured
>             in the cluster configuration. The parameter will be named
>             "server_jsv" and similar to "prolog" and "epilog" it will
>             allow to specify under which user privileges this  
> procedure will
>             be started. (N3)
>
>       (I5)  One instance of server_jsv will be started during  
> startup of
>             qmaster for each worker thread or whenever the cluster
>             configuration parameter changes or whenever the  
> timestamp of the
>             script file changes.
>
>       (I6)  The server side instances of the verification scripts  
> are connected
>             to the worker threads via pipes. Parameters and  
> commands will
>             be send to the scripts and the response is read from  
> the script
>             output.
>
>       (I7)  After the script has been started it has to be  
> responsive to
>             execute following commands. Please note that each command
>             might print ERROR=<message> to stdout to indicate an  
> error.
>
>             command  action
>             -------   
> ---------------------------------------------------------
>             START    Trashes cached data and starts a verification  
> for a
>                      new job.
>
>                      Prints STARTED to stdout
>
>                      After that the script accepts only a BEGIN or  
> one or
>                      multiple PARAM_<name>=<value> commands
>
>             BEGIN    This command triggers the verification of  
> provided
>                      parameters set by PARAM_<name>=<value>
>
>                      Prints RESULT=<result> and optionally
>                      RESULT_MSG=<message> or RESULT_MSG_LOG=<message>
>
>                      <result> might be:
>                         ACCEPT
>                            job is accepted without changes
>                         CORRECT
>                            job is accepted but all PARAM_<name>...  
> which have
>                            been sent between the initial BEGIN and  
> the final
>                            RESULT have to be evaluated and applied  
> to the job
>                            before it is accepted.
>                         REJECT
>                            job is rejected
>                         REJECT_WAIT
>                            job is rejected but might be accepted later
>
>                      <message> is a user readable message
>                         which will be sent to the client to be  
> printed as
>                         GDI answer (RESULT_MSG) or it will be  
> printed to
>                         stdout of the client command  
> (RESULT_MSG_LOG on
>                         client side) or it will be printed to the  
> master
>                         messages file (RESULT_MSG_LOG in master side)
>
>             PARAM_<name>=<value>    <name> and <value> are  
> parameter names
>                      and corresponding values as documented in  
> submit(1) e.g.
>
>                      <name>      <value>
>                      ----------- ---------------------
>                      a           <date_time>
>                      ac          <variable>[=<value>],...
>                      b           "y" | "n"
>                      ...
>
>                      additionally following names are supported
>
>                      CLIENT      "qsub" | "qsh" | "qlogin" | "qmon"  
> | "qalter"
>                      CONTEXT     "client" | "server"
>                                  explains if the script is executed  
> in a client
>                                  (N1) or in the master (N2)
>                      JOB_ID      <job_id>
>                                  (only available on server side)
>                      SCRIPT      <path_of_job_script>
>                      SCRIPT_ARGS <arguments_for_job_script>
>                      USER        <submit_user_name>
>
>             QUIT     Terminates the job submission verification script
>
>             Exampe: Find below the data which is sent to the job  
> submission
>                     verification script, when following job is  
> submitted:
>
>> qsub -pe pe1 3 -hard -l lic=1 -soft -l q=all.q troete.sh
>
>                     Please note that parameters that are not  
> explicitely
>                     requested by the submitter of a job are not passed
>                     to the script. This means that e.g "-b n" of  
> qsub won't be
>                     passed to the script because this is the default
>                     when nothing else is specified.
>
>                 Input                Output
>
>             01) "START\n"
>             02)                      "STARTED\n"
>             03) "PARAM_CLIENT=qsub"
>             04) "PARAM_USER=ernst"
>             05) "PARAM_pe=pe1 3\n"
>             06) "PARAM_hard=\n"
>             07) "PARAM_l=lic=1\n"
>             08) "PARAM_soft=\n"
>             09) "PARAM_l=q=all.q\n"
>             10) "PARAM_SCRIPT=troete.sh\n"
>             11) "BEGIN\n"
>             12)                      "PARAM_pe=pe1 4\n"
>             13)                      "RESULT_MSG=no multiple of 4\n"
>             14)                      "RESULT=CORRECT\n"
>
>             13) "START\n"
>             14)                      "STARTED\n"
>             15) ...
>
>             99) "QUIT\n"

looks feasible. Questions:

- are all options from "sge_request" already included here?

- will -soft and -hard be grouped (maybe they should be mentioned per  
parameter for easier parsing)?

- how are many resource request coded? I mean "-l type1=5,type2=8"

will it be "PARAM_type1=5\n" plus "PARAM_type2=8\n" or just in one  
statement?

Somehow this means to implement a parser in the script to look for  
"=" and strip of the "PARAM_". Maybe it would be easier to send these  
items by sending a line with:

"PARAM" "CLIENT" "qsub"\n

Then the script could simply use (note the use of ' and " for  
demonstration purpose):

$ line='"PARAM" "CLIENT" "qsub"'
$ eval set $line
$ echo $1
PARAM
$ echo $2
CLIENT
$ echo $3
qsub

even this works:

$ line='"PARAM" "l" "type" "with some blanks"'
$ eval set $line
$ echo $4
with some blanks


-- Reuti

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list