[GE users] Resource reservation fails for large job

andreas andreas.haas at sun.com
Wed Jun 3 13:55:40 BST 2009


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hi Sabine,

according the last few lines in your schedule file job 1029 gets a back-fill assignment before 953:

    953:1:RESERVING:1244742740:70:Q:par.q at lcc09-be1t12:slots:8.000000
    953:1:RESERVING:1244742740:70:L:max_slots_per_user:c706119/////:8.000000
    1029:1:STARTING:1244002506:70:P:openmp:slots:8.000000
    1029:1:STARTING:1244002506:70:Q:par.q at lcc09-be1t12:slots:8.000000

what I find suspicious is that your jobs have duration of not more than 70 seconds.

As long as job 1029 is finished after 70 seconds the reservation should be stable. 
But if job 1029 needs more than 70 seconds it can happen that the reservation gets 
(repeatedly) delayed.

Are you using any hard limit (-l ..) when you submit these jobs?

Regards,
Andreas

On Wed, 3 Jun 2009, s_kreidl wrote:

> Hello,
>
> we are currently running SGE 6.2u1 on a homogeneous 200 core cluster with 8core nodes. At the moment a user is cluttering the cluster with 4/8-core openmp jobs and a 100 core job with guaranteed higher priority and submitted with -R y is starving.
>
> I have excerpted a passage from $SGE_ROOT/$SGE_CELL/common/schedule (MONITOR=1), which clearly shows, where things go wrong, see attachment.
> Only the last six lines are relevant. Though a reservation of 8 slots is there for the queue instance par.q at lcc09-be1t12, the small job is nevertheless started at this queue instance within the same scheduling interval.
>
> I have checked the mailing list as much as I could. I can exclude issue #2896, as the relevant queue has the same runtime limits as the global configuration, namely 10/14 days. I can also exclude issue #2344, as all jobs are running in the same queue "par.q", which is not subordinate to any other.
>
> I have set "Maximum Reservation" to 200 just in case (though I don't really know, what it means: Number of jobs to do reservations for, number of "RESERVING" lines in the schedule file per scheduler run,...?), and I have double checked, that the small jobs do at no time have a higher priority than the large job.
>
> What additionally strikes me, is that "qstat -g c" at no time shows any reserved slots.
>
> What am I doing wrong, respectively, can anyone give me a clear description of how to reliably implement resource reservation?!?
>
> I'd be really grateful for any advice.
>
> Thanks in advance,
> Sabine
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=200636
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

http://gridengine.info/

Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschäftsführer: Thomas Schröder, Wolfgang Engels, Wolf Frenkel
Vorsitzender des Aufsichtsrates: Martin Haering

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=200641

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list