[GE users] job submission verifier

Andy Schwierskott andy.schwierskott at sun.com
Tue Sep 23 14:39:54 BST 2008


Ernst,

we might have some iterations over time how we pass the job details to the
script. Perhaps we should include a version string (easy to parse and
compare as digits, eg. "100" for version 1.0), esp. in the light that in the
future we might support a mix of older and newer clients.

Regards the cluster config parameter "server_jsv" I'd suggest to support a
syntax:

   type[:[user@]/path]

where now we'd support only

  script:[user@]/path/to/script

In the future there might be e.g.

  builtin   (if the verification is done in qmaster)
  shared_lib:/path/to/shared_lib.so

I was thinking about the client side "client_jsv" script: If it's requested
and it were specified in the "sge_request" file by adding a new submission
argument then we would save one stat() system call on NFS. E.g it could be
called "-jsv"

   -jsv /path/to/script

For this first version only the system wide sge_request file could support
this, however in the longer term a user could give his own JSV script which
could be executed in addition to the system wide script.

Andy

> Hi Reuti,
>
> find comments inlined...
>
> Reuti wrote:
> > Hi,
> >
> > Am 23.09.2008 um 13:45 schrieb Ernst Bablick:
> >
> > > in the past some users expressed their need for some kind of presubmission
> > > procedure which is executed whenever a job enters the GE system (see also
> > > issue #2621).
> > >
> > > Find attached a draft for a corresponding GE enhancement. Please give
> > > feedback by Friday.
> > >
> > > Regards,
> > >
> > > Ernst
> > >       Functional Specification: Job Submission Verifier
> > >       =================================================
> > >
> > >       Version  Comment                                Date      Author
> > >       -------  -------------------------------------  --------
> > > -------------
> > >       0.1      Initial version                        ?         Andreas
> > > Haas
> > >       0.5      Describe changes so that enhancement   17-09-08  Ernst
> > > Bablick
> > >                can be implemented for Urubu with
> > >                less performance loss
> > >       0.6      added missing parts according to       22-09-08  Ernst
> > > Bablick
> > >                discussion with RD and AS
> > >
> > > 1     INTRODUCTION
> > >       ============
> > >
> > >       In the past some of our users expressed their need for some kind of
> > >       presubmission procedure which is executed whenever a job enters the
> > > GE
> > >       system. (see also issue #2621). Here are some examples what should
> > > be
> > >       done in such a procedure:
> > >
> > >       -  Check accounting DB to make sure the user has enough wall clock
> > >          hours in their account to run the requested job on the requested
> > >          slots for the requested time.
> > >
> > >       -  Guarantee that the number of slots requested is a multiple of 16
> > > for
> > >          parallel jobs.
> > >
> > >       -  Verify that the user can write to various shared filesystems.
> > >
> > >       -  Make sure that the user does not request certain -l resources
> > > that
> > >          might not behave the way the user expects them to (h_vmem,
> > > h_data,
> > >          etc).
> > >
> > >       -  Add required resource requests that users don't now are
> > > mandatory.
> > >
> > >       -  Add a project request of the form -P queue_name where queue_name
> > > is
> > >          the queue used with the -q option.
> > >
> > >       -  Make sure that the user hasn't messed up their ssh keys so badly
> > >          that they cannot ssh into compute nodes w/o a passphrase.
> > >
> > >       -  Print out status messages and errors about the above as well as
> > >          printing out the queue, allocation account name, PE,
> > >          total number of tasks requested, and number of tasks per node
> > >          requested.
> > >
> > >       -  Print out an motd-like message at the top of qsub output
> > >
> > > > qsub job.sge
> > >          Welcome
> > >          -------
> > >          Please note that we strongly advise using the mvapich-devel MPI
> > >          stack for running jobs with more than 2048 MPI tasks.
> > >          ---------------------------------------------------------------
> > >          --> Submitting 16 tasks...
> > >          --> Submitting 16 tasks/host...
> > >          --> Submitting exclusive job to 1 hosts...
> > >          ...
> > >
> > >
> > > 2     PROJECT OVERVIEW
> > >       ================
> > >
> > > 2.1   Project Aim
> > >
> > >       Aim of the project is it to provide a interface enhancement for GE
> > > that
> > >       allows it to define job verification/modification routines which
> > > will
> > >       either be executed on client side or within qmaster process when a
> > >       job enters the system or both.
> > >
> > > 2.2   Project Benefit
> > >
> > >       The administrator of a GE cluster can define additional policies
> > > needed.
> > >
> > >       The GE cluster will not be loaded with jobs which would break a
> > > defined
> > >       policy if a job verification/modification routine is defined.
> > >
> > > 2.3   Project Duration
> > >
> > > 2.4   Project Dependencies
> > >
> > >       There are no known dependencies with other projects
> > >
> > >
> > > 3     SYSTEM ARCHITECTURE
> > >       ===================
> > >
> > > 3.1   Enhancement Functions
> > >
> > >       Here is the summary of the customer needs:
> > >
> > >       (N1)  The administrator gets the possibility to define job
> > > verification
> > >             procedure which will be executed in qsub, qrsh, qsh, qlogin,
> > > qmon
> > >             and applications using DRMAA, to evaluate a job before it is
> > > send
> > >             to qmaster
> > >
> > >       (N2)  The administrator gets the possibility to define a job
> > > verification
> > >             procedure which will be executed on qmaster side before a job
> > >             is finally added to the qmaster data store or before the
> > >             modification of a job is finally accepted.
> > >
> > >       (N3)  It will be possible to define under which user account the
> > >             verification procedure within the master is executed. By
> > > default
> > >             the script is executed as sgeadmin user. Within the client
> > > context
> > >             the script is executed as submit used.
> > >
> > >       (N4)  Data defining the job will be provided to the verification
> > >             procedure.
> > >
> > >       (N5)  After evaluating a job the verification result might either
> > > be:
> > >                *  accept job
> > >                *  correct parameters part of the job specification
> > >                *  reject job
> > >                *  temporarily reject job (it might be accepted later)
> > >
> > >       (N6)  Nearly all parameters which define a job can be changed by the
> > >             verification procedure but there are some exceptions.
> > > Following
> > >             things are only available as read only parameter:
> > >                * type (qsub job => qlogin ...)
> > >                * script file to be executed
> > >                * arguments passed to the job
> > >                * user who submitted the job
> > >             The job script contend itself is not available in the job
> > >             submission verification script.
> > >
> > >
> > >       (N7)  As a minimum requirement at least following parameters have to
> > > be
> > >             changeable by the job verification procedure in a first
> > >             implementation
> > >                * pe request
> > >                * resource requests (hard and soft)
> > >                * queue and host requests
> > >                * project request
> > >
> > >       Implementation notes and necessary steps:
> > >
> > >       (I1)  (N1) and (N2) will be realized as script. The script language
> > > can
> > >             be chosen by the administrator.
> > >
> > >       (I2)  The script has to be written in a way so that it can be
> > > executed
> > >             like a loadsensor script. It has to accept commands and
> > >             parameters from stdin and return results via stdout.
> > >             It should not terminate until it gets a corresponding command.
> > >
> > >       (I3)  A file named "client_jsv" and located in
> > > $SGE_ROOT/$SGE_CELL/common
> > >             will be started by the clients qsub, qrsh, qsh, qlogin and
> > > qmon and
> > >             DRMAA library (N1) before a new job will be sent to qmaster.
> > > This
> > >             script will be started under the user account of the user
> > > which
> > >             tries to start a new job
> > >
> > >       (I4)  The script to be evaluated in qmaster (N2) has to be
> > > configured
> > >             in the cluster configuration. The parameter will be named
> > >             "server_jsv" and similar to "prolog" and "epilog" it will
> > >             allow to specify under which user privileges this procedure
> > > will
> > >             be started. (N3)
> > >
> > >       (I5)  One instance of server_jsv will be started during startup of
> > >             qmaster for each worker thread or whenever the cluster
> > >             configuration parameter changes or whenever the timestamp of
> > > the
> > >             script file changes.
> > >
> > >       (I6)  The server side instances of the verification scripts are
> > > connected
> > >             to the worker threads via pipes. Parameters and commands will
> > >             be send to the scripts and the response is read from the
> > > script
> > >             output.
> > >
> > >       (I7)  After the script has been started it has to be responsive to
> > >             execute following commands. Please note that each command
> > >             might print ERROR=<message> to stdout to indicate an error.
> > >
> > >             command  action
> > >             -------
> > > ---------------------------------------------------------
> > >             START    Trashes cached data and starts a verification for a
> > >                      new job.
> > >
> > >                      Prints STARTED to stdout
> > >
> > >                      After that the script accepts only a BEGIN or one or
> > >                      multiple PARAM_<name>=<value> commands
> > >
> > >             BEGIN    This command triggers the verification of provided
> > >                      parameters set by PARAM_<name>=<value>
> > >
> > >                      Prints RESULT=<result> and optionally
> > >                      RESULT_MSG=<message> or RESULT_MSG_LOG=<message>
> > >
> > >                      <result> might be:
> > >                         ACCEPT
> > >                            job is accepted without changes
> > >                         CORRECT
> > >                            job is accepted but all PARAM_<name>... which
> > > have
> > >                            been sent between the initial BEGIN and the
> > > final
> > >                            RESULT have to be evaluated and applied to the
> > > job
> > >                            before it is accepted.
> > >                         REJECT
> > >                            job is rejected
> > >                         REJECT_WAIT
> > >                            job is rejected but might be accepted later
> > >
> > >                      <message> is a user readable message
> > >                         which will be sent to the client to be printed as
> > >                         GDI answer (RESULT_MSG) or it will be printed to
> > >                         stdout of the client command (RESULT_MSG_LOG on
> > >                         client side) or it will be printed to the master
> > >                         messages file (RESULT_MSG_LOG in master side)
> > >
> > >             PARAM_<name>=<value>    <name> and <value> are parameter names
> > >                      and corresponding values as documented in submit(1)
> > > e.g.
> > >
> > >                      <name>      <value>
> > >                      ----------- ---------------------
> > >                      a           <date_time>
> > >                      ac          <variable>[=<value>],...
> > >                      b           "y" | "n"
> > >                      ...
> > >
> > >                      additionally following names are supported
> > >
> > >                      CLIENT      "qsub" | "qsh" | "qlogin" | "qmon" |
> > > "qalter"
> > >                      CONTEXT     "client" | "server"
> > >                                  explains if the script is executed in a
> > > client
> > >                                  (N1) or in the master (N2)
> > >                      JOB_ID      <job_id>
> > >                                  (only available on server side)
> > >                      SCRIPT      <path_of_job_script>
> > >                      SCRIPT_ARGS <arguments_for_job_script>
> > >                      USER        <submit_user_name>
> > >
> > >             QUIT     Terminates the job submission verification script
> > >
> > >             Exampe: Find below the data which is sent to the job
> > > submission
> > >                     verification script, when following job is submitted:
> > >
> > > > qsub -pe pe1 3 -hard -l lic=1 -soft -l q=all.q troete.sh
> > >
> > >                     Please note that parameters that are not explicitely
> > >                     requested by the submitter of a job are not passed
> > >                     to the script. This means that e.g "-b n" of qsub
> > > won't be
> > >                     passed to the script because this is the default
> > >                     when nothing else is specified.
> > >
> > >                 Input                Output
> > >
> > >             01) "START\n"
> > >             02)                      "STARTED\n"
> > >             03) "PARAM_CLIENT=qsub"
> > >             04) "PARAM_USER=ernst"
> > >             05) "PARAM_pe=pe1 3\n"
> > >             06) "PARAM_hard=\n"
> > >             07) "PARAM_l=lic=1\n"
> > >             08) "PARAM_soft=\n"
> > >             09) "PARAM_l=q=all.q\n"
> > >             10) "PARAM_SCRIPT=troete.sh\n"
> > >             11) "BEGIN\n"
> > >             12)                      "PARAM_pe=pe1 4\n"
> > >             13)                      "RESULT_MSG=no multiple of 4\n"
> > >             14)                      "RESULT=CORRECT\n"
> > >
> > >             13) "START\n"
> > >             14)                      "STARTED\n"
> > >             15) ...
> > >
> > >             99) "QUIT\n"
> >
> > looks feasible. Questions:
> >
> > - are all options from "sge_request" already included here?
> Yes
> >
> > - will -soft and -hard be grouped (maybe they should be mentioned per
> > parameter for easier parsing)?
> > - how are many resource request coded? I mean "-l type1=5,type2=8"
> >
> > will it be "PARAM_type1=5\n" plus "PARAM_type2=8\n" or just in one
> > statement?
> I would send one statement. Otherwise we would need to enhance the protocol by
> commands which address elements in lists like in -l or -v so that new elements
> can be added or removed by JSV scripts.
> >
> > Somehow this means to implement a parser in the script to look for "=" and
> > strip of the "PARAM_". Maybe it would be easier to send these items by
> > sending a line with:
> >
> > "PARAM" "CLIENT" "qsub"\n
> >
> > Then the script could simply use (note the use of ' and " for demonstration
> > purpose):
> >
> > $ line='"PARAM" "CLIENT" "qsub"'
> > $ eval set $line
> > $ echo $1
> > PARAM
> > $ echo $2
> > CLIENT
> > $ echo $3
> > qsub
> >
> > even this works:
> >
> > $ line='"PARAM" "l" "type" "with some blanks"'
> > $ eval set $line
> > $ echo $4
> > with some blanks
> You are right. At least here we can save some parsing effort. I will change
> that...

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list