[GE users] Problems with Advanced Reservations

reuti reuti at staff.uni-marburg.de
Mon Oct 11 14:42:22 BST 2010


Am 11.10.2010 um 13:33 schrieb pablorey:

>     Hi Reuti,
> 
>     This is the definition of s_rt in the queue and complex definition:
> 
> prey at fs001:~> qconf -sq small_queue| grep s_rt
> s_rt                  1500:00:00
> prey at fs001:~> qconf -sq medium_queue| grep s_rt
> s_rt                  1500:00:00
> prey at fs001:~> qconf -sc | grep s_rt
> s_rt                s_rt       TIME        <=    FORCED      NO         0:0:0    -10
> 
>     We have done several tests changing the requested queue, the requested num_proc or the parallel environment used and we have always problems when we reserve more than one node (independently of the number of requested slots). For example:
> 
>     * qrsub -l num_proc=1,s_rt=01:00:00,s_vmem=10G,h_fsize=20G -pe mpi 10 -a 10112015 -d 11:00:00: One node is reserved (10 slots mpi in the same node) and I can submit any job.
> 
>     * qrsub -l num_proc=1,s_rt=01:00:00,s_vmem=10G,h_fsize=20G -pe mpi_1p 10 -a 10112015 -d 11:00:00: Ten nodes are reserved (1 slot mpi per node) and we can request only 1 slot mpi. We can submit as much jobs requesting 1 slot mpi as we want but only will run 1 job each time and in the same node (the first reserved node)

Do you also request "mpi_1p" for your job submissions? I don't know whether this is related to the s_rt  problem, but:

http://gridengine.sunsource.net/issues/show_bug.cgi?id=3249

-- Reuti


>     So it seems to be some kind of problem that don't let us use more that 1 node.
> 
>     Regards,
>     Pablo
> 
> 
> 
> On 11/10/2010 13:19, reuti wrote:
>> Hi,
>> 
>> Am 11.10.2010 um 12:53 schrieb pablorey:
>> 
>> 
>>>     We are doing some tests with Advanced Reservations and we have some problems that need to be solved because we need to start to use them.
>>> 
>>>     We submitted the AR without problems and the nodes were reserved properly for the required time:
>>> 
>>> prey at fs001:~> qrsub -q small_queue*,medium_queue*,large_queue*,superdome* -l num_proc=16,s_rt=01:00:00,s_vmem=10G,h_fsize=20G -pe mpi 10 -a 10111115 -d 3:00:00
>>> 
>>> prey at fs001:~> qrstat -ar 83
>>> --------------------------------------------------------------------------------
>>> id                             83
>>> name
>>> owner                          prey
>>> state                          r
>>> start_time                     10/11/2010 11:15:00
>>> end_time                       10/11/2010 14:15:00
>>> duration                       03:00:00
>>> submission_time                10/11/2010 11:12:38
>>> group                          root
>>> account                        sge
>>> resource_list                  num_proc=16, s_rt=3600, s_vmem=10G, h_fsize=20G
>>> granted_slots_list             
>>> small_queue at cn008.null=1,medium_queue at cn014.null=1,medium_queue at cn015.null=1,medium_queue at cn026.null=1,medium_queue at cn027.null=1,medium_queue at cn028.null=1,medium_queue at cn029.null=1,medium_queue at cn030.null=1,medium_queue at cn032.null=1,medium_queue at cn033.null=1
>>> 
>>> granted_parallel_environment   mpi slots 10
>>> 
>>>     The problem is detected when we want to submit a job associated to this AR:
>>> 
>>> prey at fs001:~> qsub.orig -w v -ar 83  -l num_proc=16,s_rt=01:00:00,s_vmem=10G,h_fsize=20G -pe mpi 2 test2.sh
>>> Unable to run job: Job 2407116 cannot run in queue instance "all.q" because it was not reserved by advance reservation 83
>>> Job 2407116 cannot run in queue instance "meteogalicia_HP" because it was not reserved by advance reservation 83
>>> .....
>>> Job 2407116 cannot run in queue instance "failed_nodes" because it was not reserved by advance reservation 83
>>> Job 2407116 (-l h_fsize=20G,num_proc=16,s_rt=3600,s_vmem=10G) cannot run in queue 
>>> "medium_queue at cn014.null"
>>>  because it offers only qf:s_rt=00:00:00
>>> Job 2407116 (-l h_fsize=20G,num_proc=16,s_rt=3600,s_vmem=10G) cannot run in queue 
>>> "medium_queue at cn015.null"
>>>  because it offers only qf:s_rt=00:00:00
>>> Job 2407116 (-l h_fsize=20G,num_proc=16,s_rt=3600,s_vmem=10G) cannot run in queue 
>>> "medium_queue at cn026.null"
>>>  because it offers only qf:s_rt=00:00:00
>>> Job 2407116 (-l h_fsize=20G,num_proc=16,s_rt=3600,s_vmem=10G) cannot run in queue 
>>> "medium_queue at cn027.null"
>>>  because it offers only qf:s_rt=00:00:00
>>> Job 2407116 (-l h_fsize=20G,num_proc=16,s_rt=3600,s_vmem=10G) cannot run in queue 
>>> "medium_queue at cn028.null"
>>>  because it offers only qf:s_rt=00:00:00
>>> Job 2407116 (-l h_fsize=20G,num_proc=16,s_rt=3600,s_vmem=10G) cannot run in queue 
>>> "medium_queue at cn029.null"
>>>  because it offers only qf:s_rt=00:00:00
>>> Job 2407116 (-l h_fsize=20G,num_proc=16,s_rt=3600,s_vmem=10G) cannot run in queue 
>>> "medium_queue at cn030.null"
>>>  because it offers only qf:s_rt=00:00:00
>>> Job 2407116 (-l h_fsize=20G,num_proc=16,s_rt=3600,s_vmem=10G) cannot run in queue 
>>> "medium_queue at cn032.null"
>>>  because it offers only qf:s_rt=00:00:00
>>> Job 2407116 (-l h_fsize=20G,num_proc=16,s_rt=3600,s_vmem=10G) cannot run in queue 
>>> "medium_queue at cn033.null"
>>>  because it offers only qf:s_rt=00:00:00
>>> Job 2407116 cannot run in PE "mpi" because it only offers 1 slots
>>> verification: no suitable queues.
>>> Exiting.
>>> 
>>>     As you can see, the problem seems to be which 9 of the 10 reserved nodes (all of them in the same queue). We have tested requesting different s_rt values without success. We also have tested requesting different number of mpi slots. It only work when we request "-pe mpi 1" because one node (
>>> small_queue at cn008.null
>>> ) seems to be reserved properly.
>>> 
>>>     Any idea?. What should we check?
>>> 
>> what is the definition of s_rt in the queue definition?
>> 
>> -- Reuti
>> 
>> 
>> 
>>>     I am very sorry if this is a know issue. The AR is something new for me.
>>> 
>>>     Thanks for having read.
>>> 
>>> -- 
>>> Pablo Rey Mayo
>>> Tecnico de Sistemas
>>> Centro de Supercomputacion de Galicia (CESGA)
>>> Avda. de Vigo s/n (Campus Sur)
>>> 15705 Santiago de Compostela (Spain)
>>> Tel: +34 981 56 98 10 ext. 233; Fax: +34 981 59 46 16
>>> email: 
>>> prey at cesga.es; http://www.cesga.es/
>>> 
>>> ------------------------------------------------
>>> NOTA: Este mensaje ha sido redactado intencionadamente sin utilizar
>>> acentos ni caracteres especiales, para que pueda ser visualizado
>>> correctamente desde cualquier cliente de correo y sistema.
>>> ------------------------------------------------
>>> <xacobeo.jpg>
>>> 
>> ------------------------------------------------------
>> 
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=286485
>> 
>> 
>> To unsubscribe from this discussion, e-mail: [
>> users-unsubscribe at gridengine.sunsource.net
>> ].
>> 
>> 
> 
> -- 
> Pablo Rey Mayo
> Tecnico de Sistemas
> Centro de Supercomputacion de Galicia (CESGA)
> Avda. de Vigo s/n (Campus Sur)
> 15705 Santiago de Compostela (Spain)
> Tel: +34 981 56 98 10 ext. 233; Fax: +34 981 59 46 16
> email: prey at cesga.es; http://www.cesga.es/
> ------------------------------------------------
> NOTA: Este mensaje ha sido redactado intencionadamente sin utilizar
> acentos ni caracteres especiales, para que pueda ser visualizado
> correctamente desde cualquier cliente de correo y sistema.
> ------------------------------------------------
> <xacobeo.jpg>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=286489

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list