[GE users] Problems with Advanced Reservations

pablorey prey at cesga.es
Mon Oct 11 11:53:14 BST 2010


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

    Hi,

    We are doing some tests with Advanced Reservations and we have some problems that need to be solved because we need to start to use them.

    We submitted the AR without problems and the nodes were reserved properly for the required time:

prey at fs001:~> qrsub -q small_queue*,medium_queue*,large_queue*,superdome* -l num_proc=16,s_rt=01:00:00,s_vmem=10G,h_fsize=20G -pe mpi 10 -a 10111115 -d 3:00:00

prey at fs001:~> qrstat -ar 83
--------------------------------------------------------------------------------
id                             83
name
owner                          prey
state                          r
start_time                     10/11/2010 11:15:00
end_time                       10/11/2010 14:15:00
duration                       03:00:00
submission_time                10/11/2010 11:12:38
group                          root
account                        sge
resource_list                  num_proc=16, s_rt=3600, s_vmem=10G, h_fsize=20G
granted_slots_list             small_queue at cn008.null=1,medium_queue at cn014.null=1,medium_queue at cn015.null=1,medium_queue at cn026.null=1,medium_queue at cn027.null=1,medium_queue at cn028.null=1,medium_queue at cn029.null=1,medium_queue at cn030.null=1,medium_queue at cn032.null=1,medium_queue at cn033.null=1<mailto:small_queue at cn008.null=1,medium_queue at cn014.null=1,medium_queue at cn015.null=1,medium_queue at cn026.null=1,medium_queue at cn027.null=1,medium_queue at cn028.null=1,medium_queue at cn029.null=1,medium_queue at cn030.null=1,medium_queue at cn032.null=1,medium_queue at cn033.null=1>
granted_parallel_environment   mpi slots 10

    The problem is detected when we want to submit a job associated to this AR:

prey at fs001:~> qsub.orig -w v -ar 83  -l num_proc=16,s_rt=01:00:00,s_vmem=10G,h_fsize=20G -pe mpi 2 test2.sh
Unable to run job: Job 2407116 cannot run in queue instance "all.q" because it was not reserved by advance reservation 83
Job 2407116 cannot run in queue instance "meteogalicia_HP" because it was not reserved by advance reservation 83
.....
Job 2407116 cannot run in queue instance "failed_nodes" because it was not reserved by advance reservation 83
Job 2407116 (-l h_fsize=20G,num_proc=16,s_rt=3600,s_vmem=10G) cannot run in queue "medium_queue at cn014.null"<mailto:medium_queue at cn014.null> because it offers only qf:s_rt=00:00:00
Job 2407116 (-l h_fsize=20G,num_proc=16,s_rt=3600,s_vmem=10G) cannot run in queue "medium_queue at cn015.null"<mailto:medium_queue at cn015.null> because it offers only qf:s_rt=00:00:00
Job 2407116 (-l h_fsize=20G,num_proc=16,s_rt=3600,s_vmem=10G) cannot run in queue "medium_queue at cn026.null"<mailto:medium_queue at cn026.null> because it offers only qf:s_rt=00:00:00
Job 2407116 (-l h_fsize=20G,num_proc=16,s_rt=3600,s_vmem=10G) cannot run in queue "medium_queue at cn027.null"<mailto:medium_queue at cn027.null> because it offers only qf:s_rt=00:00:00
Job 2407116 (-l h_fsize=20G,num_proc=16,s_rt=3600,s_vmem=10G) cannot run in queue "medium_queue at cn028.null"<mailto:medium_queue at cn028.null> because it offers only qf:s_rt=00:00:00
Job 2407116 (-l h_fsize=20G,num_proc=16,s_rt=3600,s_vmem=10G) cannot run in queue "medium_queue at cn029.null"<mailto:medium_queue at cn029.null> because it offers only qf:s_rt=00:00:00
Job 2407116 (-l h_fsize=20G,num_proc=16,s_rt=3600,s_vmem=10G) cannot run in queue "medium_queue at cn030.null"<mailto:medium_queue at cn030.null> because it offers only qf:s_rt=00:00:00
Job 2407116 (-l h_fsize=20G,num_proc=16,s_rt=3600,s_vmem=10G) cannot run in queue "medium_queue at cn032.null"<mailto:medium_queue at cn032.null> because it offers only qf:s_rt=00:00:00
Job 2407116 (-l h_fsize=20G,num_proc=16,s_rt=3600,s_vmem=10G) cannot run in queue "medium_queue at cn033.null"<mailto:medium_queue at cn033.null> because it offers only qf:s_rt=00:00:00
Job 2407116 cannot run in PE "mpi" because it only offers 1 slots
verification: no suitable queues.
Exiting.

    As you can see, the problem seems to be which 9 of the 10 reserved nodes (all of them in the same queue). We have tested requesting different s_rt values without success. We also have tested requesting different number of mpi slots. It only work when we request "-pe mpi 1" because one node (small_queue at cn008.null<mailto:small_queue at cn008.null>) seems to be reserved properly.

    Any idea?. What should we check?

    I am very sorry if this is a know issue. The AR is something new for me.

    Thanks for having read.

--
Pablo Rey Mayo
Tecnico de Sistemas
Centro de Supercomputacion de Galicia (CESGA)
Avda. de Vigo s/n (Campus Sur)
15705 Santiago de Compostela (Spain)
Tel: +34 981 56 98 10 ext. 233; Fax: +34 981 59 46 16
email: prey at cesga.es<mailto:prey at cesga.es>; http://www.cesga.es/
------------------------------------------------
NOTA: Este mensaje ha sido redactado intencionadamente sin utilizar
acentos ni caracteres especiales, para que pueda ser visualizado
correctamente desde cualquier cliente de correo y sistema.
------------------------------------------------

[cid:part1.06040707.07080805 at cesga.es]


    [ Part 2, "xacobeo.jpg"  Image/JPEG (Name: "xacobeo.jpg") 28 KB. ]
    [ Unable to print this part. ]



More information about the gridengine-users mailing list