[GE users] Advance reservation strange behavior

Sili (wesley) Huang shuang at unb.ca
Mon Jun 26 18:38:08 BST 2006


    [ The following text is in the "iso-8859-15" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]


Hi Andreas,


I tried to observed on what were going on with this strange behavior. It seems
to me that the reservation is tight with the specified run-length of a job. For
example, in this record of monitoring (385889 is a parallel job with reservation
enabled and having high priority, and 385865 is a serial job): 


[root common]#  cat schedule | egrep "385865|385889|::::::::"

::::::::

385889:1:RESERVING:1151341235:3660:P:mpich:slots:12.000000

385889:1:RESERVING:1151341235:3660:G:global:ncpus_agerber:12.000000

385889:1:RESERVING:1151341235:3660:H:v60-n28:singular:2.000000

385889:1:RESERVING:1151341235:3660:H:v60-n65:singular:2.000000

385889:1:RESERVING:1151341235:3660:H:v60-n75:singular:1.000000

385889:1:RESERVING:1151341235:3660:H:v60-n66:singular:2.000000

385889:1:RESERVING:1151341235:3660:H:v60-n31:singular:2.000000

385889:1:RESERVING:1151341235:3660:H:v60-n62:singular:1.000000

385889:1:RESERVING:1151341235:3660:H:v60-n15:singular:2.000000

385889:1:RESERVING:1151341235:3660:Q:all.q at v60-n28:slots:2.000000

385889:1:RESERVING:1151341235:3660:Q:all.q at v60-n65:slots:2.000000

385889:1:RESERVING:1151341235:3660:Q:all.q at v60-n75:slots:1.000000

385889:1:RESERVING:1151341235:3660:Q:all.q at v60-n31:slots:2.000000

385889:1:RESERVING:1151341235:3660:Q:all.q at v60-n66:slots:2.000000

385889:1:RESERVING:1151341235:3660:Q:all.q at v60-n62:slots:1.000000

385889:1:RESERVING:1151341235:3660:Q:all.q at v60-n15:slots:2.000000

::::::::

385889:1:RESERVING:1151341250:3660:P:mpich:slots:12.000000

385889:1:RESERVING:1151341250:3660:G:global:ncpus_agerber:12.000000

385889:1:RESERVING:1151341250:3660:H:v60-n28:singular:2.000000

385889:1:RESERVING:1151341250:3660:H:v60-n65:singular:2.000000

385889:1:RESERVING:1151341250:3660:H:v60-n75:singular:1.000000

385889:1:RESERVING:1151341250:3660:H:v60-n62:singular:1.000000

385889:1:RESERVING:1151341250:3660:H:v60-n73:singular:2.000000

385889:1:RESERVING:1151341250:3660:H:v60-n52:singular:2.000000

385889:1:RESERVING:1151341250:3660:H:v60-n66:singular:2.000000

385889:1:RESERVING:1151341250:3660:Q:all.q at v60-n28:slots:2.000000

385889:1:RESERVING:1151341250:3660:Q:all.q at v60-n65:slots:2.000000

385889:1:RESERVING:1151341250:3660:Q:all.q at v60-n75:slots:1.000000

385889:1:RESERVING:1151341250:3660:Q:all.q at v60-n62:slots:1.000000

385889:1:RESERVING:1151341250:3660:Q:all.q at v60-n73:slots:2.000000

385889:1:RESERVING:1151341250:3660:Q:all.q at v60-n52:slots:2.000000

385889:1:RESERVING:1151341250:3660:Q:all.q at v60-n66:slots:2.000000

::::::::

385889:1:RESERVING:1151341265:3660:P:mpich:slots:12.000000

385889:1:RESERVING:1151341265:3660:G:global:ncpus_agerber:12.000000

385889:1:RESERVING:1151341265:3660:H:v60-n28:singular:2.000000

385889:1:RESERVING:1151341265:3660:H:v60-n65:singular:2.000000

385889:1:RESERVING:1151341265:3660:H:v60-n75:singular:1.000000

385889:1:RESERVING:1151341265:3660:H:v60-n62:singular:1.000000

385889:1:RESERVING:1151341265:3660:H:v60-n73:singular:2.000000

385889:1:RESERVING:1151341265:3660:H:v60-n52:singular:2.000000

385889:1:RESERVING:1151341265:3660:H:v60-n66:singular:2.000000

385889:1:RESERVING:1151341265:3660:Q:all.q at v60-n28:slots:2.000000

385889:1:RESERVING:1151341265:3660:Q:all.q at v60-n65:slots:2.000000

385889:1:RESERVING:1151341265:3660:Q:all.q at v60-n75:slots:1.000000

385889:1:RESERVING:1151341265:3660:Q:all.q at v60-n62:slots:1.000000

385889:1:RESERVING:1151341265:3660:Q:all.q at v60-n73:slots:2.000000

385889:1:RESERVING:1151341265:3660:Q:all.q at v60-n52:slots:2.000000

385889:1:RESERVING:1151341265:3660:Q:all.q at v60-n66:slots:2.000000

385865:1:STARTING:1151341250:3660:H:v60-n47:singular:1.000000

385865:1:STARTING:1151341250:3660:Q:all.q at v60-n47:slots:1.000000

::::::::

385865:1:RUNNING:1151341251:3660:H:v60-n47:singular:1.000000

385865:1:RUNNING:1151341251:3660:Q:all.q at v60-n47:slots:1.000000

::::::::


I suspect that SGE behavior is because: 


It seems to me that SGE is trying to reserve the processor resources which are
expected to be released soonest. SGE determines which CPUs to be reserved by
h_rt or s_rt or default_duration by default. However, in our cluster, we do not
require users to specify h_rt or s_rt, so a default_duration specified as one
hour is used. Therefore, if a serial job is finished very short, e.g. 10
minutes, SGE doesn't reserve this CPU resource to the reservation and hence the
serial jobs still fill this CPU at the time it is released. The same to the
scenario where a long job is occupying a CPU, e.g. 2 days, and SGE is always
expecting this CPU can be released soon and reserves it to the reservation. 


My suspicions may be wrong. It would be great if someone having the same problem
can observe in their SGEs. If my suspicions are correct, I think this is an odd
implementation on reservation since the reservation should not only based on
runtime specified. 


Cheers. 


Best regards,

Sili(wesley) Huang


Monday, June 26, 2006, 5:41:25 AM, you wrote:


Andreas> Have you observed reservation behaviour via the 'schedule' file?


Andreas> Andreas


Andreas> On Fri, 23 Jun 2006, Brady Catherman wrote:


>> Yes. If there is space they start fine. If they have reservation enabled,
and 

>> they have a much higher priority than every other single process job they 

>> just sit at the top of the queue as if the reservation is not doing anything 

>> (max_reservations is currently set at 1000)



>> On Jun 23, 2006, at 2:07 PM, Reuti wrote:


>>> Am 23.06.2006 um 22:45 schrieb Brady Catherman:


>>>> I have done both of these and yet my clusters still hate parallel jobs. 

>>>> Does anybody have this working? everything I have seen is that parallel 

>>>> jobs are always shunned by grid engine. I would appreciate any solutions 

>>>> to this being passed my way! =) I have been working on this on and off 

>>>> since January.


>>> But if the cluster is empty, they are starting? - Reuti




>>>> On Jun 23, 2006, at 11:46 AM, Reuti wrote:


>>>>> Hi,


>>>>> you submitted with "-R y" and adjusted the scheduler to "max_reservation 

>>>>> 20" or an appropriate value?


>>>>> -- Reuti



>>>>> Am 23.06.2006 um 18:31 schrieb Sili (wesley) Huang:


>>>>>> Hi Jean-Paul,




>>>>>> I have the similar problem as yours in our cluster. the low-priority 

>>>>>> serial jobs still get loaded into run state and the high-priority 

>>>>>> parallel jobs are waiting. Did you figure out the solution towards this 

>>>>>> problem? Does the upgrade help?




>>>>>> Cheers.




>>>>>> Best regards,


>>>>>> Sili(wesley) Huang




>>>>>> --


>>>>>> mailto:shuang at unb.ca


>>>>>> Scientific Computing Support


>>>>>> Advanced Computational Research Laboratory


>>>>>> University of New Brunswick


>>>>>> Tel(office):  (506) 452-6348


>>>>>> --------------------------------------------------------------------- To 

>>>>>> unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net For 

>>>>>> additional commands, e-mail: users-help at gridengine.sunsource.net


>>>>> ---------------------------------------------------------------------

>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net

>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net



>>>> ---------------------------------------------------------------------

>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net

>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net


>>> ---------------------------------------------------------------------

>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net

>>> For additional commands, e-mail: users-help at gridengine.sunsource.net



>> ---------------------------------------------------------------------

>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net

>> For additional commands, e-mail: users-help at gridengine.sunsource.net



Andreas> ---------------------------------------------------------------------

Andreas> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net

Andreas> For additional commands, e-mail: users-help at gridengine.sunsource.net



--

mailto:shuang at unb.ca

Scientific Computing Support

Advanced Computational Research Laboratory

University of New Brunswick

Tel(office):  (506) 452-6348

--------------------------------------------------------------------- To
unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net For additional
commands, e-mail: users-help at gridengine.sunsource.net



More information about the gridengine-users mailing list