[GE users] job submission verifier (v0.7)

Ernst Bablick Ernst.Bablick at Sun.COM
Fri Sep 26 14:46:11 BST 2008

    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]


find attached v0.7 of the functional specification.

      0.7      Changes according to discussion on     26-09-08  Ernst 
               users mailing list. Additional
               section 4. Defined work packages



Ernst Bablick wrote:
> Hi,
> in the past some users expressed their need for some kind of 
> presubmission procedure which is executed whenever a job enters the GE 
> system (see also issue #2621).
> Find attached a draft for a corresponding GE enhancement. Please give 
> feedback by Friday.
> Regards,
> Ernst
> ------------------------------------------------------------------------
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

    [ Part 2: "Attached Text" ]

???      Functional Specification: Job Submission Verifier (JSV) 

      Version  Comment                                Date      Author
      -------  -------------------------------------  --------  -------------
      0.1      Initial version                        ?         Andreas Haas
      0.5      Describe changes so that enhancement   17-09-08  Ernst Bablick
               can be implemented for Urubu with
               less performance loss
      0.6      added missing parts according to       22-09-08  Ernst Bablick
               discussion with RD and AS
      0.7      Changes according to discussion on     26-09-08  Ernst Bablick
               users mailing list. Additional
               section 4. Defined work packages.


      In the past some of our users expressed their need for some kind of 
      presubmission procedure which is executed whenever a job enters the GE
      system. (see also issue #2621). Here are some examples what should be 
      done in such a procedure:
      -  Check accounting DB to make sure the user has enough wall clock 
         hours in their account to run the requested job on the requested 
         slots for the requested time. 

      -  Guarantee that the number of slots requested is a multiple of 16 for
         parallel jobs.
      -  Verify that the user can write to various shared filesystems.

      -  Make sure that the user does not request certain -l resources that 
         might not behave the way the user expects them to (h_vmem, h_data, 

      -  Add required resource requests that users don't now are mandatory. 

      -  Add a project request of the form -P queue_name where queue_name is 
         the queue used with the -q option.

      -  Make sure that the user hasn't messed up their ssh keys so badly 
         that they cannot ssh into compute nodes w/o a pass phrase.

      -  Print out status messages and errors about the above as well as 
         printing out the queue, allocation account name, PE, 
         total number of tasks requested, and number of tasks per node 

      -  Print out an motd-like message at the top of qsub output

         > qsub job.sge
         Please note that we strongly advise using the mvapich-devel MPI
         stack for running jobs with more than 2048 MPI tasks.
         --> Submitting 16 tasks...
         --> Submitting 16 tasks/host...
         --> Submitting exclusive job to 1 hosts...


2.1   Project Aim

      Aim of the project is it to provide a interface enhancement for GE that
      allows it to define job verification/modification routines which will 
      either be executed on client side or within qmaster process when a 
      job enters the system or both.

2.2   Project Benefit

      The administrator of a GE cluster can define additional policies needed.

      The GE cluster will not be loaded with jobs which would break a defined
      policy if a job verification/modification routine is defined. 

2.3   Project Duration


2.4   Project Dependencies

      There are no known dependencies with other projects


3.1   Enhancement Functions

      Here is the summary of the customer needs:
      (N1)  The administrator gets the possibility to define job verification
            procedure which will be executed in qsub, qrsh, qsh, qlogin, qmon 
            and applications using DRMAA, to evaluate a job before it is send 
            to qmaster

      (N2)  The administrator gets the possibility to define a job verification
            procedure which will be executed on qmaster side before a job 
            is finally added to the qmaster data store or before the 
            modification of a job is finally accepted.

      (N3)  It will be possible to define under which user account the
            verification procedure within the master is executed. By default 
            the script is executed as sgeadmin user.

      (N4)  Data defining the job will be provided to the verification 

      (N5)  After evaluating a job the verification result might either be:
               *  accept job
               *  correct parameters part of the job specification
               *  reject job 
               *  temporarily reject job (it might be accepted later)

      (N6)  Nearly all parameters which define a job can be changed by the 
            verification procedure but there are some exceptions. Following
            things are only available as read only parameter:
               * type (qsub job => qlogin ...)
               * script file to be executed
               * arguments passed to the job  
               * user who submitted the job
            The job script contend itself is not available in the job
            submission verification script.
      (N7)  As a minimum requirement at least following parameters have to be
            changeable by the job verification procedure in a first 
               * pe request 
               * resource requests (hard and soft)
               * queue and host requests 
               * project request

      Implementation notes and necessary steps:

      (I1)  (N1) and (N2) will be realized as script. The script language can 
            be chosen by the administrator. 

      (I2)  The script has to be written in a way so that it can be executed
            like a load sensor script. It has to accept commands and 
            parameters from stdin and return results via stdout. 
            It should not terminate until it gets a corresponding command.

      (I3)  qsub, qrsh, qsh, qlogin and qmon and the DRMAA library (N1) 
            will start a presubmission script which can be configured by using 
            "-jsv <jsv_url>" in the cluster wide sge_request file. The script
            specified with <jsv_url> will be started under the user account 
            of the user which tries to submit a job.

      (I4)  The script to be evaluated in qmaster (N2) has to be configured
            in the cluster configuration. The parameter will be named
            "server_jsv". The value is also a <jsi_url>

      (I4.1) <jsi_url> will have following format

               <jsi_url> := [ <type> ":" ] [ <user> "@" ] <path> 

            <user> is the username under which the JSV code will be executed.
            <type> might be the string "script". (In future we might support
            shared libraries or master plugins which might be written in java.)

      (I5)  One instance of server_jsv will be started during startup of 
            qmaster for each worker thread or whenever the cluster
            configuration parameter changes or whenever the time stamp of the
            script file changes. 
      (I6)  The server side instances of the verification scripts are connected
            to the worker threads via pipes. Parameters and commands will
            be send to the scripts and the response is read from the script 

      (I7)  After the script has been started it has to be responsive to 
            execute following commands. Please note that each command 
            might print "ERROR <message>" to stdout to indicate an error
            or "LOG <level> <message>" to print additional information
            to the message file of the master or to stdout for command line
            clients. <level> might be "INFO", "WARNING", "ERROR" 

            command  action
            -------  --------------------------------------------------------- 
            START    Trash cached data and start a verification for a 
                     new job. 
                     Prints "STARTED" to stdout

                     After that the script accepts only a "BEGIN" or one or 
                     multiple "PARAM <name> <value>" commands 

            BEGIN    This command triggers the verification of provided 
                     parameters set by "PARAM <name> <value>"

                     Prints "RESULT STATE <result> [<message>]" as last line 
                     in the output. 

                     <result> might be:
                           job is accepted without changes
                           job is accepted but all PARAM... which have 
                           been sent between the initial BEGIN and the final 
                           RESULT have to be evaluated and applied to the job
                           before it is accepted.
                           job is rejected
                           job is rejected but might be accepted later 

                     <message> (optional) is a user readable message
                        which will be sent to the client to be printed as
                        GDI answer.

                     Before the "RESULT STATE <result>" line the string
                     "SEND <arg>" can be send to request information
                     which is not send per default. <arg> might be
                           to trigger the client/master to send the
                           job environment

            PARAM <name> <value>    
                     <name> and <value> are parameter names 
                     and corresponding values as documented in submit(1) e.g.

                     <name>      <value>
                     ----------- ---------------------
                     a           <date_time> in the format CCYYMMDDhhmmSS
                     ac          <variable>[=<value>],...
                     ar          <ar_id>
                     A           <account_string>
                     b           "y" | "n"

                     switches without additional arguments like -cwd or -notify
                     will be handled like arguments with yes/no argument

                     cwd         "y" | "n"
                     notify      "y" | "n"
                     for soft/hard resource requests the word "_soft", "_hard"
                     is appended to the switch name. E.g.
                     l_soft      ...
                     l_hard      ...

                     Client parameter which were not specified by the 
                     submitter of a job and which have a default value
                     during the submission won't be passed to the JSV script. 

                     Additionally to the client switches following names 
                     are supported. In contrast to the client switches
                     they are read-only parameters in Urubu. Some of them 
                     might be changeable in future releases.

                     CLIENT      "qsub" | "qsh" | "qlogin" | "qmon" | 
                                 "qalter" | "qmake" | "drmaa"
                     CONTEXT     "client" | "server"
                                 explains if the script is executed in a 
                                 client (N1) or in the master (N2)
                     ENV         <name>=<value>,<name>=<value>,...
                                 (not send per default)
                     JOB_ID      <job_id> (only within master available)
                     SCRIPT      <path_of_job_script>
                     SCRIPT_ARGS <arguments_for_job_script>
                     USER        <submit_user_name>
                     VERSION     <major>.<minor> 
                                 e.g "1.0"

            QUIT     Terminates the job submission verification script

            Exampe: Find below the data which is sent to the job submission
                    verification script, when following job is submitted:

                    > qsub -pe p 3 -hard -l a=1,b=5 -soft -l q=all.q troete.sh 

                    Please note that parameters that are not explicitly 
                    requested by the submitter of a job are not passed
                    to the script. This means that e.g "-b n" of qsub won't be
                    passed to the script because this is the default
                    when nothing else is specified.

                Input                Output

            01) "START" 
            02)                      "STARTED"
            03) "PARAM CLIENT qsub"
            04) "PARAM USER ernst"
            05) "PARAM pe p 3"
            07) "PARAM l_hard a=1,b=5"
            09) "PARAM l_soft q=all.q" 
            10) "PARAM SCRIPT troete.sh"
            11) "BEGIN"
            12)                      "PARAM pe p 4"
            13)                      "LOG WARNING jsv pe slots correction"
            14)                      "SEND ENV"
            15) "ENV HOME2=/troete/rabarber"
            16)                      "RESULT CORRECT no multiple of 4"

            17) "START"
            18)                      "STARTED"
            19) ...

            99) "QUIT"

      (I8)  At least a bourne shell JSV script will be part of Urubu which 
            can be used as template for a GE administrator.

3.2   Overall Block Diagram



4.1   Performance

      GE clusters which do not use JSV functionality will have the same 
      performance as before. 

      The use of the JSV (master) will decrease the overall performance of a 
      GE cluster. Especially the submit rate in highly loaded cluster will 
      decrease. We will see how big the performance decrease will be when 
      first performance tests can be done.

      We might increase the number of worker threads to compensate performance
      decrease but this will only have a positive effect if the global
      lock is not hold, when the job verification script is triggered in 
      the master.

4.2   Reliability, Availability, Serviceability


4.3   Diagnostics

      Job Submission Verification scripts can be started stand-alone. The
      protocol to trigger the verification of a job is simple so that a
      GE administrator has the possibility to test the script intensively
      before it is integrated in a running cluster.

      To find errors in scripts there is the possibility to return
      logging messages which will be printed either to stdout (N2)
      or which will be added to the master message file (N1).

4.4   User Experience

      see section 3       

4.5   Manufacturing

4.6   Quality Assurance

      Several Testsuite scenarios have to be realized (see 2.3)

4.7   Security & Privacy

      Data is passed to JSV scripts which initially was provided from users
      which submitted a job. Depending on the data and depending how this
      data is interpreted in the JSV script this might be a security risk.

      Special notes have to be added to the JSV script templates, man pages
      and documentation. 

4.8   Mitigation Path


4.9   Documentation

      Several man pages have to be changed: submit.1, sge_conf.5, sge_request.5

      A new man page has to be created which defines the protocol between
      client/master and JSV script. All provided parameters for a JSV script 
      have to be documented.

      An additional section has to be written which should be added to the
      admin guide and notes might be added to the users guide which refer to
      the new section in the admin guide.

4.10  Installation

      There are no changes needed for the installation process

      Special care has to be taken for the cluster configuration change.
      The master process of a Urubu cluster has to deal with cluster
      configurations where the JSV script parameter does not exist. We won't  
      have an stand-alone update procedure. Therefore the master of an
      6.2 Urubu cluster has to read the 6.2 cluster configuration and 
      write a Urubu cluster configuration during the startup phase.

4.11  Packing

      The JSV template scripts and an additional man page will be added to the 

4.12  Issues/Risks and Purposed Mitigation

      see section 4.1

    [ Part 3: "Attached Text" ]

    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list