[GE users] Problems with Advanced Reservations

pablorey prey at cesga.es
Mon Oct 11 12:33:47 BST 2010


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

    Hi Reuti,

    This is the definition of s_rt in the queue and complex definition:

prey at fs001:~> qconf -sq small_queue| grep s_rt
s_rt                  1500:00:00
prey at fs001:~> qconf -sq medium_queue| grep s_rt
s_rt                  1500:00:00
prey at fs001:~> qconf -sc | grep s_rt
s_rt                s_rt       TIME        <=    FORCED      NO         0:0:0    -10

    We have done several tests changing the requested queue, the requested num_proc or the parallel environment used and we have always problems when we reserve more than one node (independently of the number of requested slots). For example:

    * qrsub -l num_proc=1,s_rt=01:00:00,s_vmem=10G,h_fsize=20G -pe mpi 10 -a 10112015 -d 11:00:00: One node is reserved (10 slots mpi in the same node) and I can submit any job.

    * qrsub -l num_proc=1,s_rt=01:00:00,s_vmem=10G,h_fsize=20G -pe mpi_1p 10 -a 10112015 -d 11:00:00: Ten nodes are reserved (1 slot mpi per node) and we can request only 1 slot mpi. We can submit as much jobs requesting 1 slot mpi as we want but only will run 1 job each time and in the same node (the first reserved node)

    So it seems to be some kind of problem that don't let us use more that 1 node.

    Regards,
    Pablo



On 11/10/2010 13:19, reuti wrote:

Hi,

Am 11.10.2010 um 12:53 schrieb pablorey:



    We are doing some tests with Advanced Reservations and we have some problems that need to be solved because we need to start to use them.

    We submitted the AR without problems and the nodes were reserved properly for the required time:

prey at fs001:~> qrsub -q small_queue*,medium_queue*,large_queue*,superdome* -l num_proc=16,s_rt=01:00:00,s_vmem=10G,h_fsize=20G -pe mpi 10 -a 10111115 -d 3:00:00

prey at fs001:~> qrstat -ar 83
--------------------------------------------------------------------------------
id                             83
name
owner                          prey
state                          r
start_time                     10/11/2010 11:15:00
end_time                       10/11/2010 14:15:00
duration                       03:00:00
submission_time                10/11/2010 11:12:38
group                          root
account                        sge
resource_list                  num_proc=16, s_rt=3600, s_vmem=10G, h_fsize=20G
granted_slots_list             small_queue at cn008.null=1,medium_queue at cn014.null=1,medium_queue at cn015.null=1,medium_queue at cn026.null=1,medium_queue at cn027.null=1,medium_queue at cn028.null=1,medium_queue at cn029.null=1,medium_queue at cn030.null=1,medium_queue at cn032.null=1,medium_queue at cn033.null=1<mailto:small_queue at cn008.null=1,medium_queue at cn014.null=1,medium_queue at cn015.null=1,medium_queue at cn026.null=1,medium_queue at cn027.null=1,medium_queue at cn028.null=1,medium_queue at cn029.null=1,medium_queue at cn030.null=1,medium_queue at cn032.null=1,medium_queue at cn033.null=1>
granted_parallel_environment   mpi slots 10

    The problem is detected when we want to submit a job associated to this AR:

prey at fs001:~> qsub.orig -w v -ar 83  -l num_proc=16,s_rt=01:00:00,s_vmem=10G,h_fsize=20G -pe mpi 2 test2.sh
Unable to run job: Job 2407116 cannot run in queue instance "all.q" because it was not reserved by advance reservation 83
Job 2407116 cannot run in queue instance "meteogalicia_HP" because it was not reserved by advance reservation 83
.....
Job 2407116 cannot run in queue instance "failed_nodes" because it was not reserved by advance reservation 83
Job 2407116 (-l h_fsize=20G,num_proc=16,s_rt=3600,s_vmem=10G) cannot run in queue "medium_queue at cn014.null"<mailto:medium_queue at cn014.null> because it offers only qf:s_rt=00:00:00
Job 2407116 (-l h_fsize=20G,num_proc=16,s_rt=3600,s_vmem=10G) cannot run in queue "medium_queue at cn015.null"<mailto:medium_queue at cn015.null> because it offers only qf:s_rt=00:00:00
Job 2407116 (-l h_fsize=20G,num_proc=16,s_rt=3600,s_vmem=10G) cannot run in queue "medium_queue at cn026.null"<mailto:medium_queue at cn026.null> because it offers only qf:s_rt=00:00:00
Job 2407116 (-l h_fsize=20G,num_proc=16,s_rt=3600,s_vmem=10G) cannot run in queue "medium_queue at cn027.null"<mailto:medium_queue at cn027.null> because it offers only qf:s_rt=00:00:00
Job 2407116 (-l h_fsize=20G,num_proc=16,s_rt=3600,s_vmem=10G) cannot run in queue "medium_queue at cn028.null"<mailto:medium_queue at cn028.null> because it offers only qf:s_rt=00:00:00
Job 2407116 (-l h_fsize=20G,num_proc=16,s_rt=3600,s_vmem=10G) cannot run in queue "medium_queue at cn029.null"<mailto:medium_queue at cn029.null> because it offers only qf:s_rt=00:00:00
Job 2407116 (-l h_fsize=20G,num_proc=16,s_rt=3600,s_vmem=10G) cannot run in queue "medium_queue at cn030.null"<mailto:medium_queue at cn030.null> because it offers only qf:s_rt=00:00:00
Job 2407116 (-l h_fsize=20G,num_proc=16,s_rt=3600,s_vmem=10G) cannot run in queue "medium_queue at cn032.null"<mailto:medium_queue at cn032.null> because it offers only qf:s_rt=00:00:00
Job 2407116 (-l h_fsize=20G,num_proc=16,s_rt=3600,s_vmem=10G) cannot run in queue "medium_queue at cn033.null"<mailto:medium_queue at cn033.null> because it offers only qf:s_rt=00:00:00
Job 2407116 cannot run in PE "mpi" because it only offers 1 slots
verification: no suitable queues.
Exiting.

    As you can see, the problem seems to be which 9 of the 10 reserved nodes (all of them in the same queue). We have tested requesting different s_rt values without success. We also have tested requesting different number of mpi slots. It only work when we request "-pe mpi 1" because one node (small_queue at cn008.null<mailto:small_queue at cn008.null>) seems to be reserved properly.

    Any idea?. What should we check?



what is the definition of s_rt in the queue definition?

-- Reuti





    I am very sorry if this is a know issue. The AR is something new for me.

    Thanks for having read.

--
Pablo Rey Mayo
Tecnico de Sistemas
Centro de Supercomputacion de Galicia (CESGA)
Avda. de Vigo s/n (Campus Sur)
15705 Santiago de Compostela (Spain)
Tel: +34 981 56 98 10 ext. 233; Fax: +34 981 59 46 16
email: prey at cesga.es<mailto:prey at cesga.es>; http://www.cesga.es/
------------------------------------------------
NOTA: Este mensaje ha sido redactado intencionadamente sin utilizar
acentos ni caracteres especiales, para que pueda ser visualizado
correctamente desde cualquier cliente de correo y sistema.
------------------------------------------------
<xacobeo.jpg>



------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=286485

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].



--
Pablo Rey Mayo
Tecnico de Sistemas
Centro de Supercomputacion de Galicia (CESGA)
Avda. de Vigo s/n (Campus Sur)
15705 Santiago de Compostela (Spain)
Tel: +34 981 56 98 10 ext. 233; Fax: +34 981 59 46 16
email: prey at cesga.es<mailto:prey at cesga.es>; http://www.cesga.es/
------------------------------------------------
NOTA: Este mensaje ha sido redactado intencionadamente sin utilizar
acentos ni caracteres especiales, para que pueda ser visualizado
correctamente desde cualquier cliente de correo y sistema.
------------------------------------------------

[cid:part1.08000204.01090409 at cesga.es]


    [ Part 2, "xacobeo.jpg"  Image/JPEG (Name: "xacobeo.jpg") 28 KB. ]
    [ Unable to print this part. ]



More information about the gridengine-users mailing list