[GE users] Problems with Advanced Reservations

reuti reuti at staff.uni-marburg.de
Mon Oct 11 18:32:39 BST 2010


Am 11.10.2010 um 16:23 schrieb pablorey:

>     Hi Reuti,
> 
>     Yes, I request always the same parallel environment used to submit the AR when I submit jobs (mpi_1p or mpi). The first test job is always done requesting the same resources used in the qrsub command. As it don't work, I change the requirements (num_proc, s_rt, s_vmen, ...) o the number of slots but always use the PE requested in the qrsub command.

And the "mpi_1p" has a fixed allocation rule of 1 then?

For now I can't reproduce this. Can you force the execution with "-w n" instead of "-w v"?

Do you request any queues by an .sge_request by default?

-- Reuti


>     It is something rare. It seems that the reservation is done properly (they are without jobs) but however I cannot use them. I can only use the first reserved node (independently of the number of reserved nodes).
> 
>     Regards,
>     Pablo
> 
> 
> 
> On 11/10/2010 15:42, reuti wrote:
>> Am 11.10.2010 um 13:33 schrieb pablorey:
>> 
>> 
>>>     Hi Reuti,
>>> 
>>>     This is the definition of s_rt in the queue and complex definition:
>>> 
>>> prey at fs001:~> qconf -sq small_queue| grep s_rt
>>> s_rt                  1500:00:00
>>> prey at fs001:~> qconf -sq medium_queue| grep s_rt
>>> s_rt                  1500:00:00
>>> prey at fs001:~> qconf -sc | grep s_rt
>>> s_rt                s_rt       TIME        <=    FORCED      NO         0:0:0    -10
>>> 
>>>     We have done several tests changing the requested queue, the requested num_proc or the parallel environment used and we have always problems when we reserve more than one node (independently of the number of requested slots). For example:
>>> 
>>>     * qrsub -l num_proc=1,s_rt=01:00:00,s_vmem=10G,h_fsize=20G -pe mpi 10 -a 10112015 -d 11:00:00: One node is reserved (10 slots mpi in the same node) and I can submit any job.
>>> 
>>>     * qrsub -l num_proc=1,s_rt=01:00:00,s_vmem=10G,h_fsize=20G -pe mpi_1p 10 -a 10112015 -d 11:00:00: Ten nodes are reserved (1 slot mpi per node) and we can request only 1 slot mpi. We can submit as much jobs requesting 1 slot mpi as we want but only will run 1 job each time and in the same node (the first reserved node)
>>> 
>> Do you also request "mpi_1p" for your job submissions? I don't know whether this is related to the s_rt  problem, but:
>> 
>> 
>> http://gridengine.sunsource.net/issues/show_bug.cgi?id=3249
>> 
>> 
>> -- Reuti
>> 
>> 
>> 
>>>     So it seems to be some kind of problem that don't let us use more that 1 node.
>>> 
>>>     Regards,
>>>     Pablo
>>> 
>>> 
>>> 
>>> On 11/10/2010 13:19, reuti wrote:
>>> 
>>>> Hi,
>>>> 
>>>> Am 11.10.2010 um 12:53 schrieb pablorey:
>>>> 
>>>> 
>>>> 
>>>>>     We are doing some tests with Advanced Reservations and we have some problems that need to be solved because we need to start to use them.
>>>>> 
>>>>>     We submitted the AR without problems and the nodes were reserved properly for the required time:
>>>>> 
>>>>> prey at fs001:~> qrsub -q small_queue*,medium_queue*,large_queue*,superdome* -l num_proc=16,s_rt=01:00:00,s_vmem=10G,h_fsize=20G -pe mpi 10 -a 10111115 -d 3:00:00
>>>>> 
>>>>> prey at fs001:~> qrstat -ar 83
>>>>> --------------------------------------------------------------------------------
>>>>> id                             83
>>>>> name
>>>>> owner                          prey
>>>>> state                          r
>>>>> start_time                     10/11/2010 11:15:00
>>>>> end_time                       10/11/2010 14:15:00
>>>>> duration                       03:00:00
>>>>> submission_time                10/11/2010 11:12:38
>>>>> group                          root
>>>>> account                        sge
>>>>> resource_list                  num_proc=16, s_rt=3600, s_vmem=10G, h_fsize=20G
>>>>> granted_slots_list             
>>>>> 
>>>>> small_queue at cn008.null=1,medium_queue at cn014.null=1,medium_queue at cn015.null=1,medium_queue at cn026.null=1,medium_queue at cn027.null=1,medium_queue at cn028.null=1,medium_queue at cn029.null=1,medium_queue at cn030.null=1,medium_queue at cn032.null=1,medium_queue at cn033.null=1
>>>>> 
>>>>> 
>>>>> granted_parallel_environment   mpi slots 10
>>>>> 
>>>>>     The problem is detected when we want to submit a job associated to this AR:
>>>>> 
>>>>> prey at fs001:~> qsub.orig -w v -ar 83  -l num_proc=16,s_rt=01:00:00,s_vmem=10G,h_fsize=20G -pe mpi 2 test2.sh
>>>>> Unable to run job: Job 2407116 cannot run in queue instance "all.q" because it was not reserved by advance reservation 83
>>>>> Job 2407116 cannot run in queue instance "meteogalicia_HP" because it was not reserved by advance reservation 83
>>>>> .....
>>>>> Job 2407116 cannot run in queue instance "failed_nodes" because it was not reserved by advance reservation 83
>>>>> Job 2407116 (-l h_fsize=20G,num_proc=16,s_rt=3600,s_vmem=10G) cannot run in queue 
>>>>> 
>>>>> "medium_queue at cn014.null"
>>>>> 
>>>>>  because it offers only qf:s_rt=00:00:00
>>>>> Job 2407116 (-l h_fsize=20G,num_proc=16,s_rt=3600,s_vmem=10G) cannot run in queue 
>>>>> 
>>>>> "medium_queue at cn015.null"
>>>>> 
>>>>>  because it offers only qf:s_rt=00:00:00
>>>>> Job 2407116 (-l h_fsize=20G,num_proc=16,s_rt=3600,s_vmem=10G) cannot run in queue 
>>>>> 
>>>>> "medium_queue at cn026.null"
>>>>> 
>>>>>  because it offers only qf:s_rt=00:00:00
>>>>> Job 2407116 (-l h_fsize=20G,num_proc=16,s_rt=3600,s_vmem=10G) cannot run in queue 
>>>>> 
>>>>> "medium_queue at cn027.null"
>>>>> 
>>>>>  because it offers only qf:s_rt=00:00:00
>>>>> Job 2407116 (-l h_fsize=20G,num_proc=16,s_rt=3600,s_vmem=10G) cannot run in queue 
>>>>> 
>>>>> "medium_queue at cn028.null"
>>>>> 
>>>>>  because it offers only qf:s_rt=00:00:00
>>>>> Job 2407116 (-l h_fsize=20G,num_proc=16,s_rt=3600,s_vmem=10G) cannot run in queue 
>>>>> 
>>>>> "medium_queue at cn029.null"
>>>>> 
>>>>>  because it offers only qf:s_rt=00:00:00
>>>>> Job 2407116 (-l h_fsize=20G,num_proc=16,s_rt=3600,s_vmem=10G) cannot run in queue 
>>>>> 
>>>>> "medium_queue at cn030.null"
>>>>> 
>>>>>  because it offers only qf:s_rt=00:00:00
>>>>> Job 2407116 (-l h_fsize=20G,num_proc=16,s_rt=3600,s_vmem=10G) cannot run in queue 
>>>>> 
>>>>> "medium_queue at cn032.null"
>>>>> 
>>>>>  because it offers only qf:s_rt=00:00:00
>>>>> Job 2407116 (-l h_fsize=20G,num_proc=16,s_rt=3600,s_vmem=10G) cannot run in queue 
>>>>> 
>>>>> "medium_queue at cn033.null"
>>>>> 
>>>>>  because it offers only qf:s_rt=00:00:00
>>>>> Job 2407116 cannot run in PE "mpi" because it only offers 1 slots
>>>>> verification: no suitable queues.
>>>>> Exiting.
>>>>> 
>>>>>     As you can see, the problem seems to be which 9 of the 10 reserved nodes (all of them in the same queue). We have tested requesting different s_rt values without success. We also have tested requesting different number of mpi slots. It only work when we request "-pe mpi 1" because one node (
>>>>> 
>>>>> small_queue at cn008.null
>>>>> 
>>>>> ) seems to be reserved properly.
>>>>> 
>>>>>     Any idea?. What should we check?
>>>>> 
>>>>> 
>>>> what is the definition of s_rt in the queue definition?
>>>> 
>>>> -- Reuti
>>>> 
>>>> 
>>>> 
>>>> 
>>>>>     I am very sorry if this is a know issue. The AR is something new for me.
>>>>> 
>>>>>     Thanks for having read.
>>>>> 
>>>>> -- 
>>>>> Pablo Rey Mayo
>>>>> Tecnico de Sistemas
>>>>> Centro de Supercomputacion de Galicia (CESGA)
>>>>> Avda. de Vigo s/n (Campus Sur)
>>>>> 15705 Santiago de Compostela (Spain)
>>>>> Tel: +34 981 56 98 10 ext. 233; Fax: +34 981 59 46 16
>>>>> email: 
>>>>> 
>>>>> prey at cesga.es; http://www.cesga.es/
>>>>> 
>>>>> 
>>>>> ------------------------------------------------
>>>>> NOTA: Este mensaje ha sido redactado intencionadamente sin utilizar
>>>>> acentos ni caracteres especiales, para que pueda ser visualizado
>>>>> correctamente desde cualquier cliente de correo y sistema.
>>>>> ------------------------------------------------
>>>>> <xacobeo.jpg>
>>>>> 
>>>>> 
>>>> ------------------------------------------------------
>>>> 
>>>> 
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=286485
>>>> 
>>>> 
>>>> 
>>>> To unsubscribe from this discussion, e-mail: [
>>>> 
>>>> users-unsubscribe at gridengine.sunsource.net
>>>> 
>>>> ].
>>>> 
>>>> 
>>>> 
>>> -- 
>>> Pablo Rey Mayo
>>> Tecnico de Sistemas
>>> Centro de Supercomputacion de Galicia (CESGA)
>>> Avda. de Vigo s/n (Campus Sur)
>>> 15705 Santiago de Compostela (Spain)
>>> Tel: +34 981 56 98 10 ext. 233; Fax: +34 981 59 46 16
>>> email: 
>>> prey at cesga.es; http://www.cesga.es/
>>> 
>>> ------------------------------------------------
>>> NOTA: Este mensaje ha sido redactado intencionadamente sin utilizar
>>> acentos ni caracteres especiales, para que pueda ser visualizado
>>> correctamente desde cualquier cliente de correo y sistema.
>>> ------------------------------------------------
>>> <xacobeo.jpg>
>>> 
>> ------------------------------------------------------
>> 
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=286489
>> 
>> 
>> To unsubscribe from this discussion, e-mail: [
>> users-unsubscribe at gridengine.sunsource.net
>> ].
>> 
>> 
> 
> -- 
> Pablo Rey Mayo
> Tecnico de Sistemas
> Centro de Supercomputacion de Galicia (CESGA)
> Avda. de Vigo s/n (Campus Sur)
> 15705 Santiago de Compostela (Spain)
> Tel: +34 981 56 98 10 ext. 233; Fax: +34 981 59 46 16
> email: prey at cesga.es; http://www.cesga.es/
> ------------------------------------------------
> NOTA: Este mensaje ha sido redactado intencionadamente sin utilizar
> acentos ni caracteres especiales, para que pueda ser visualizado
> correctamente desde cualquier cliente de correo y sistema.
> ------------------------------------------------
> <xacobeo.jpg>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=286497

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list