[GE users] Prioritised VIP jobs and master queue setup

erilon78se erik.lonroth at scania.com
Mon Jul 26 09:04:52 BST 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hello again!

First, I would say that the functionality of:  "run this job now, and suspend what ever is necessary to get it working" would be a very good enhancement.

We are currently trying to build around this, by having external logic (aka a "qsub-wrapper") that addresses this specific issue. Would you say that the "JSV" (job submission verifier) would be a suitable target for this purpose?

Secondly. Regarding bug: 2603 (http://gridengine.sunsource.net/issues/show_bug.cgi?id=2603) (Status: REOPENED) - Do you think that the issue will be addressed by developers soon, or, would you suggest we patch the SGE-software? In that case, who do you think we should be talking to to get the right functionality in place?

Last, I have to thank you Reuti for excellent help on former questions we have had on various questions before. It has been of outstanding value for us and I'm sure many others share my opinion on this.

Regards
/Erik Lönroth

-----Original Message-----
From: reuti [mailto:reuti at staff.uni-marburg.de]
Sent: den 23 juli 2010 16:49
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Prioritised VIP jobs and master queue setup


Hi,

Am 23.07.2010 um 12:42 schrieb erilon78se:

> We have run SGE (5.X) for some good four years, with very good
> results. We've invested a new cluster and are now trying to improve on
> our queueing setup.
>
> Let me describe our environment briefly:
>
> * SGE v6.2u5
>
> * We have a number of paralell (running on multiple hosts)
> applications, lets stay 10 different.
>
> * We have a number of serial (running on a single node) applications,
> lets say 5 different.
>
> * Our hardware is homogenous with 8cores/host.
>
> * Some of the applications require a "master process", that do I/O,
> controlling and needs to be exclusively allocated to a single node. A
> "master process" cannot be run in conjunction with any other
> application. (Aka "Over-subscription")
>
> We have previously solved this problem, with the use of a "master.q"
> containing dedicated hosts with slots=1 and a PE where
> job_is_first_task being set. This results in a correct slot allocation
> as shown below.
>
>
> Fig:A - "master.q solution"
>
> job-ID  prior   name       user         state submit/start at     queue                          master ja-task-ID task-ID state cpu        mem     io      stat failed
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>   1031 0.95734 test_check konrns       r     07/19/2010 10:53:22 mastr.q at ts201-c-1-1.sss.se.sca MASTER
>   1031 0.95734 test_check konrns       r     07/19/2010 10:53:22 the.q at ts201-c-1-0.sss.se.scani SLAVE
>                                                                  the.q at ts201-c-1-0.sss.se.scani SLAVE
>                                                                  the.q at ts201-c-1-0.sss.se.scani SLAVE
>                                                                  the.q at ts201-c-1-0.sss.se.scani SLAVE
>                                                                  the.q at ts201-c-1-0.sss.se.scani SLAVE
>                                                                  the.q at ts201-c-1-0.sss.se.scani SLAVE
>                                                                  the.q at ts201-c-1-0.sss.se.scani SLAVE
>
> the.q at ts201-c-1-0.sss.se.scani SLAVE
>
>
>
> Now, I have two core problems to solve on which I need your help:
>
> Problem A - Using any host for any purpose.
> -------------------------------------------
> The problem with the "master.q" strategy is that hosts inside the
> "mastr.q" (above), are not available for other jobs NOT requiering a
> "master process". They idle until such a job enters the system, which
> is not what we want.
>
> We want to be able to use all hosts as a potential "master host" and
> lock it down once a "master process" enters it, AND, never let in a
> "master process" on a node that runs anything at all.
>
> If we can solve this, we can use all hosts in a "parallell queue" for
> any application, but wont risk failure when over-subscribing a host
> running a master process.

there is a possible setup, but as long as there is issue:

http://gridengine.sunsource.net/issues/show_bug.cgi?id=2603

reopened, it won't work. (The trick is to use a host-consumable to trigger an alarm in parallel.q (i.e. getting it disabled) when at least one slot in serial.q is used. serial.q OTOH is subordinated to parallel.q to make it exclusive also the other way.)


> Problem B - Bow the way for VIP-jobs.
> -------------------------------------
> Once "the.q" becomes full enough, we have not found a good way to
> "suspend/checkpoint" one - or a few - jobs, in order to free up "just
> enough" resources for a "VIP-job" entering the system.
>
> A VIP-job can be any application, submitted at any time, with any
> resource requests and with any time limit or whatever.
>
> We need a way to let OGE automatically "choose" enough "normal jobs",
> selected based on the VIP-job requirements, suspend those, and make
> sure the VIP-job will be started before any other job in a waiting
> state.
>
> If more than one VIP-job is submitted,

Unfortunately there is no feature: "run this job now, and suspend what ever is necessary to get it working":

http://gridengine.sunsource.net/ds/viewMessage.do?dsMessageId=228184&dsForumId=38


> VIP-jobs currently running should be protected from suspension.

Would need:

http://gridengine.sunsource.net/issues/show_bug.cgi?id=3162

This would need a complete rewrite of the scheduler, which could also include more real-time and cron-like features then. There were several on the list.

-- Reuti


>
>
> I have been trying to forumlate my problems as good as I can, and I
> greatly apprechiate any ideas for a setup.
>
> Kind regards
> /Erik Lönroth, Technical Responsible on High Performance Computing,
> Scania Infomate AB - Sweden.
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessa
> geId=269892
>
> To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=269956

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=270429

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list