[GE users] Advance reservation strange behavior

Andreas.Haas at Sun.COM Andreas.Haas at Sun.COM
Tue Jun 27 09:40:43 BST 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Most probably your suspicion is right. The parallel jobs' reservation 
sure enough becomes valueless, if the sequential jobs do not finish
at the time foreseen by the scheduler and default_duration gets not 
enforced by Grid Engine. Have you considered putting some

    -l h_rt=:10:

into cluster-wise sge_request(5) file?

Regards,
Andreas

On Mon, 26 Jun 2006, Sili (wesley) Huang wrote:

> 
> Hi Andreas,
> 
> 
> I tried to observed on what were going on with this strange behavior. It seems to me that the reservation is
> tight with the specified run-length of a job. For example, in this record of monitoring (385889 is a parallel
> job with reservation enabled and having high priority, and 385865 is a serial job): 
> 
> 
> [root common]#  cat schedule | egrep "385865|385889|::::::::"
> 
> ::::::::
> 
> 385889:1:RESERVING:1151341235:3660:P:mpich:slots:12.000000
> 
> 385889:1:RESERVING:1151341235:3660:G:global:ncpus_agerber:12.000000
> 
> 385889:1:RESERVING:1151341235:3660:H:v60-n28:singular:2.000000
> 
> 385889:1:RESERVING:1151341235:3660:H:v60-n65:singular:2.000000
> 
> 385889:1:RESERVING:1151341235:3660:H:v60-n75:singular:1.000000
> 
> 385889:1:RESERVING:1151341235:3660:H:v60-n66:singular:2.000000
> 
> 385889:1:RESERVING:1151341235:3660:H:v60-n31:singular:2.000000
> 
> 385889:1:RESERVING:1151341235:3660:H:v60-n62:singular:1.000000
> 
> 385889:1:RESERVING:1151341235:3660:H:v60-n15:singular:2.000000
> 
> 385889:1:RESERVING:1151341235:3660:Q:all.q at v60-n28:slots:2.000000
> 
> 385889:1:RESERVING:1151341235:3660:Q:all.q at v60-n65:slots:2.000000
> 
> 385889:1:RESERVING:1151341235:3660:Q:all.q at v60-n75:slots:1.000000
> 
> 385889:1:RESERVING:1151341235:3660:Q:all.q at v60-n31:slots:2.000000
> 
> 385889:1:RESERVING:1151341235:3660:Q:all.q at v60-n66:slots:2.000000
> 
> 385889:1:RESERVING:1151341235:3660:Q:all.q at v60-n62:slots:1.000000
> 
> 385889:1:RESERVING:1151341235:3660:Q:all.q at v60-n15:slots:2.000000
> 
> ::::::::
> 
> 385889:1:RESERVING:1151341250:3660:P:mpich:slots:12.000000
> 
> 385889:1:RESERVING:1151341250:3660:G:global:ncpus_agerber:12.000000
> 
> 385889:1:RESERVING:1151341250:3660:H:v60-n28:singular:2.000000
> 
> 385889:1:RESERVING:1151341250:3660:H:v60-n65:singular:2.000000
> 
> 385889:1:RESERVING:1151341250:3660:H:v60-n75:singular:1.000000
> 
> 385889:1:RESERVING:1151341250:3660:H:v60-n62:singular:1.000000
> 
> 385889:1:RESERVING:1151341250:3660:H:v60-n73:singular:2.000000
> 
> 385889:1:RESERVING:1151341250:3660:H:v60-n52:singular:2.000000
> 
> 385889:1:RESERVING:1151341250:3660:H:v60-n66:singular:2.000000
> 
> 385889:1:RESERVING:1151341250:3660:Q:all.q at v60-n28:slots:2.000000
> 
> 385889:1:RESERVING:1151341250:3660:Q:all.q at v60-n65:slots:2.000000
> 
> 385889:1:RESERVING:1151341250:3660:Q:all.q at v60-n75:slots:1.000000
> 
> 385889:1:RESERVING:1151341250:3660:Q:all.q at v60-n62:slots:1.000000
> 
> 385889:1:RESERVING:1151341250:3660:Q:all.q at v60-n73:slots:2.000000
> 
> 385889:1:RESERVING:1151341250:3660:Q:all.q at v60-n52:slots:2.000000
> 
> 385889:1:RESERVING:1151341250:3660:Q:all.q at v60-n66:slots:2.000000
> 
> ::::::::
> 
> 385889:1:RESERVING:1151341265:3660:P:mpich:slots:12.000000
> 
> 385889:1:RESERVING:1151341265:3660:G:global:ncpus_agerber:12.000000
> 
> 385889:1:RESERVING:1151341265:3660:H:v60-n28:singular:2.000000
> 
> 385889:1:RESERVING:1151341265:3660:H:v60-n65:singular:2.000000
> 
> 385889:1:RESERVING:1151341265:3660:H:v60-n75:singular:1.000000
> 
> 385889:1:RESERVING:1151341265:3660:H:v60-n62:singular:1.000000
> 
> 385889:1:RESERVING:1151341265:3660:H:v60-n73:singular:2.000000
> 
> 385889:1:RESERVING:1151341265:3660:H:v60-n52:singular:2.000000
> 
> 385889:1:RESERVING:1151341265:3660:H:v60-n66:singular:2.000000
> 
> 385889:1:RESERVING:1151341265:3660:Q:all.q at v60-n28:slots:2.000000
> 
> 385889:1:RESERVING:1151341265:3660:Q:all.q at v60-n65:slots:2.000000
> 
> 385889:1:RESERVING:1151341265:3660:Q:all.q at v60-n75:slots:1.000000
> 
> 385889:1:RESERVING:1151341265:3660:Q:all.q at v60-n62:slots:1.000000
> 
> 385889:1:RESERVING:1151341265:3660:Q:all.q at v60-n73:slots:2.000000
> 
> 385889:1:RESERVING:1151341265:3660:Q:all.q at v60-n52:slots:2.000000
> 
> 385889:1:RESERVING:1151341265:3660:Q:all.q at v60-n66:slots:2.000000
> 
> 385865:1:STARTING:1151341250:3660:H:v60-n47:singular:1.000000
> 
> 385865:1:STARTING:1151341250:3660:Q:all.q at v60-n47:slots:1.000000
> 
> ::::::::
> 
> 385865:1:RUNNING:1151341251:3660:H:v60-n47:singular:1.000000
> 
> 385865:1:RUNNING:1151341251:3660:Q:all.q at v60-n47:slots:1.000000
> 
> ::::::::
> 
> 
> I suspect that SGE behavior is because: 
> 
> 
> It seems to me that SGE is trying to reserve the processor resources which are expected to be released soonest.
> SGE determines which CPUs to be reserved by h_rt or s_rt or default_duration by default. However, in our
> cluster, we do not require users to specify h_rt or s_rt, so a default_duration specified as one hour is used.
> Therefore, if a serial job is finished very short, e.g. 10 minutes, SGE doesn't reserve this CPU resource to the
> reservation and hence the serial jobs still fill this CPU at the time it is released. The same to the scenario
> where a long job is occupying a CPU, e.g. 2 days, and SGE is always expecting this CPU can be released soon and
> reserves it to the reservation. 
> 
> 
> My suspicions may be wrong. It would be great if someone having the same problem can observe in their SGEs. If
> my suspicions are correct, I think this is an odd implementation on reservation since the reservation should not
> only based on runtime specified. 
> 
> 
> Cheers. 
> 
> 
> Best regards,
> 
> Sili(wesley) Huang
> 
> 
> Monday, June 26, 2006, 5:41:25 AM, you wrote:
> 
> 
> Andreas> Have you observed reservation behaviour via the 'schedule' file?
> 
> 
> Andreas> Andreas
> 
> 
> Andreas> On Fri, 23 Jun 2006, Brady Catherman wrote:
> 
> 
> >> Yes. If there is space they start fine. If they have reservation enabled, and 
> 
> >> they have a much higher priority than every other single process job they 
> 
> >> just sit at the top of the queue as if the reservation is not doing anything 
> 
> >> (max_reservations is currently set at 1000)
> 
> 
> 
> >> On Jun 23, 2006, at 2:07 PM, Reuti wrote:
> 
> 
> >>> Am 23.06.2006 um 22:45 schrieb Brady Catherman:
> 
> 
> >>>> I have done both of these and yet my clusters still hate parallel jobs. 
> 
> >>>> Does anybody have this working? everything I have seen is that parallel 
> 
> >>>> jobs are always shunned by grid engine. I would appreciate any solutions 
> 
> >>>> to this being passed my way! =) I have been working on this on and off 
> 
> >>>> since January.
> 
> 
> >>> But if the cluster is empty, they are starting? - Reuti
> 
> 
> 
> 
> >>>> On Jun 23, 2006, at 11:46 AM, Reuti wrote:
> 
> 
> >>>>> Hi,
> 
> 
> >>>>> you submitted with "-R y" and adjusted the scheduler to "max_reservation 
> 
> >>>>> 20" or an appropriate value?
> 
> 
> >>>>> -- Reuti
> 
> 
> 
> >>>>> Am 23.06.2006 um 18:31 schrieb Sili (wesley) Huang:
> 
> 
> >>>>>> Hi Jean-Paul,
> 
> 
> 
> 
> >>>>>> I have the similar problem as yours in our cluster. the low-priority 
> 
> >>>>>> serial jobs still get loaded into run state and the high-priority 
> 
> >>>>>> parallel jobs are waiting. Did you figure out the solution towards this 
> 
> >>>>>> problem? Does the upgrade help?
> 
> 
> 
> 
> >>>>>> Cheers.
> 
> 
> 
> 
> >>>>>> Best regards,
> 
> 
> >>>>>> Sili(wesley) Huang
> 
> 
> 
> 
> >>>>>> --
> 
> 
> >>>>>> mailto:shuang at unb.ca
> 
> 
> >>>>>> Scientific Computing Support
> 
> 
> >>>>>> Advanced Computational Research Laboratory
> 
> 
> >>>>>> University of New Brunswick
> 
> 
> >>>>>> Tel(office):  (506) 452-6348
> 
> 
> >>>>>> --------------------------------------------------------------------- To 
> 
> >>>>>> unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net For 
> 
> >>>>>> additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> >>>>> ---------------------------------------------------------------------
> 
> >>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> 
> >>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> 
> >>>> ---------------------------------------------------------------------
> 
> >>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> 
> >>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> >>> ---------------------------------------------------------------------
> 
> >>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> 
> >>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> 
> >> ---------------------------------------------------------------------
> 
> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> 
> >> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> 
> Andreas> ---------------------------------------------------------------------
> 
> Andreas> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> 
> Andreas> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> 
> --
> 
> mailto:shuang at unb.ca
> 
> Scientific Computing Support
> 
> Advanced Computational Research Laboratory
> 
> University of New Brunswick
> 
> Tel(office):  (506) 452-6348
> 
> 
>



    [ Part 2: "Attached Text" ]

    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



More information about the gridengine-users mailing list