[GE users] Prioritised VIP jobs and master queue setup

erilon78se erik.lonroth at scania.com
Fri Jul 23 11:42:56 BST 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hello!

We have run SGE (5.X) for some good four years, with very good results.
We've invested a new cluster and are now trying to improve on our queueing setup.

Let me describe our environment briefly:

* SGE v6.2u5

* We have a number of paralell (running on multiple hosts) applications, lets stay 10 different.

* We have a number of serial (running on a single node) applications, lets say 5 different.

* Our hardware is homogenous with 8cores/host.

* Some of the applications require a "master process", that do I/O, controlling and needs to be exclusively allocated to a single node.
A "master process" cannot be run in conjunction with any other application. (Aka "Over-subscription")

We have previously solved this problem, with the use of a "master.q" containing dedicated hosts with slots=1 and a PE where job_is_first_task being set.
This results in a correct slot allocation as shown below.


Fig:A - "master.q solution"

job-ID  prior   name       user         state submit/start at     queue                          master ja-task-ID task-ID state cpu        mem     io      stat failed
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
   1031 0.95734 test_check konrns       r     07/19/2010 10:53:22 mastr.q at ts201-c-1-1.sss.se.sca MASTER
   1031 0.95734 test_check konrns       r     07/19/2010 10:53:22 the.q at ts201-c-1-0.sss.se.scani SLAVE
                                                                  the.q at ts201-c-1-0.sss.se.scani SLAVE
                                                                  the.q at ts201-c-1-0.sss.se.scani SLAVE
                                                                  the.q at ts201-c-1-0.sss.se.scani SLAVE
                                                                  the.q at ts201-c-1-0.sss.se.scani SLAVE
                                                                  the.q at ts201-c-1-0.sss.se.scani SLAVE
                                                                  the.q at ts201-c-1-0.sss.se.scani SLAVE
                                                                  the.q at ts201-c-1-0.sss.se.scani SLAVE



Now, I have two core problems to solve on which I need your help:

Problem A - Using any host for any purpose.
-------------------------------------------
The problem with the "master.q" strategy is that hosts inside the "mastr.q" (above), are not available for other jobs
NOT requiering a "master process". They idle until such a job enters the system, which is not what we want.

We want to be able to use all hosts as a potential "master host" and lock it down once a "master process" enters it,
AND, never let in a "master process" on a node that runs anything at all.

If we can solve this, we can use all hosts in a "parallell queue" for any application,
but wont risk failure when over-subscribing a host running a master process.



Problem B - Bow the way for VIP-jobs.
-------------------------------------
Once "the.q" becomes full enough, we have not found a good way to "suspend/checkpoint" one - or a few - jobs,
in order to free up "just enough" resources for a "VIP-job" entering the system.

A VIP-job can be any application, submitted at any time, with any resource requests and with any time limit or whatever.

We need a way to let OGE automatically "choose" enough "normal jobs", selected based on the VIP-job requirements,
suspend those, and make sure the VIP-job will be started before any other job in a waiting state.

If more than one VIP-job is submitted, VIP-jobs currently running should be protected from suspension.


I have been trying to forumlate my problems as good as I can, and I greatly apprechiate any ideas for a setup.

Kind regards
/Erik Lönroth, Technical Responsible on High Performance Computing, Scania Infomate AB - Sweden.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=269892

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list