[GE users] master node selection and $fill_up behaviour

andy andreas.schwierskott at oracle.com
Thu Jul 22 08:50:48 BST 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hi,

> On Tue, Jul 20, 2010 at 04:06:55PM +0200, Michael Weiser wrote:
>
>> With SGE 6.2 this seems to have changed: As long as the cluster is
>> completely empty, the old behaviour is still followed. But if there are
>> jobs already, slot allocation becomes erratic. SGE seems to prefer
>> filling up already used machines before re-using free machines.
>
> Just now, I was able to produce the behaviour with only three jobs on
> completely unloaded machines. It needed two tries, though.
>
> If it's relevant: queue_sort_method is seq_no, but the seq_no of all
> queue instances is 0.

If queue_sort_method is seq_no then the second sort criterion is the 
load (according to the load_formula). And still the load_adjustments 
will apply.

Without looking at the complete picture it indeed my look erratic.

SGE internally uses load values with a few more digits as you see in a 
qstat/qhost (do a qconf -se <hostname>). That's another source which may 
make the scheduling decisions erratic.


Andy



>
> scmic at l5-auto-du ~ $ qconf -ssconf | grep queue_sort
> queue_sort_method                 seqno
> scmic at l5-auto-du ~ $ qstat -F | grep seq_no | sort -u
>          qf:seq_no=0
>
> Submitting three jobs, which get their own machines to run:
>
> scmic at l5-auto-du ~ $ echo sleep 100 | qsub -pe dmp 3
> Your job 154 ("STDIN") has been submitted
> scmic at l5-auto-du ~ $ echo sleep 100 | qsub -pe dmp 3
> Your job 155 ("STDIN") has been submitted
> scmic at l5-auto-du ~ $ echo sleep 100 | qsub -pe dmp 3
> Your job 156 ("STDIN") has been submitted
> scmic at l5-auto-du ~ $ qhost -j
> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
> -------------------------------------------------------------------------------
> global                  -               -     -       -       -       -       -
> l5-auto-du              -               -     -       -       -       -       -
> l5-node01               lx26-amd64      4  0.02    7.9G  380.7M    9.8G     0.0
>     job-ID  prior   name       user         state submit/start at     queue      master ja-task-ID
>     ----------------------------------------------------------------------------------------------
>         156 0.55500 STDIN      scmic        r     07/21/2010 09:25:57 express-dm MASTER
>                                                                       express-dm SLAVE
>                                                                       express-dm SLAVE
> l5-node02               lx26-amd64      4  0.03    7.9G  377.2M    9.8G     0.0
> l5-node03               lx26-amd64      4  0.01    7.9G  385.0M    9.8G     0.0
> l5-node04               lx26-amd64      4  0.00    7.9G  386.7M    9.8G     0.0
>         155 0.55500 STDIN      scmic        r     07/21/2010 09:25:57 express-dm MASTER
>                                                                       express-dm SLAVE
>                                                                       express-dm SLAVE
> l5-node05               lx26-amd64      4  0.00    7.9G  384.5M    9.8G     0.0
> l5-node06               lx26-amd64      4  0.01    7.9G  390.6M    9.8G     0.0
> l5-node07               lx26-amd64      4  0.00    7.9G  387.8M    9.8G     0.0
> l5-node08               lx26-amd64      4  0.01    7.9G  373.0M    9.8G     0.0
>         154 0.55500 STDIN      scmic        r     07/21/2010 09:25:57 express-dm MASTER
>                                                                       express-dm SLAVE
>                                                                       express-dm SLAVE
>
> Deleting one and submitting a new one, which still gets its own machine:
>
> scmic at l5-auto-du ~ $ qdel "154"
> scmic has registered the job 154 for deletion
> scmic at l5-auto-du ~ $ echo sleep 100 | qsub -pe dmp 3
> Your job 157 ("STDIN") has been submitted
> scmic at l5-auto-du ~ $ qhost -j
> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
> -------------------------------------------------------------------------------
> global                  -               -     -       -       -       -       -
> l5-auto-du              -               -     -       -       -       -       -
> l5-node01               lx26-amd64      4  0.01    7.9G  383.5M    9.8G     0.0
>     job-ID  prior   name       user         state submit/start at     queue      master ja-task-ID
>     ----------------------------------------------------------------------------------------------
>         156 0.55500 STDIN      scmic        r     07/21/2010 09:25:57 express-dm MASTER
>                                                                       express-dm SLAVE
>                                                                       express-dm SLAVE
> l5-node02               lx26-amd64      4  0.03    7.9G  377.2M    9.8G     0.0
> l5-node03               lx26-amd64      4  0.05    7.9G  383.0M    9.8G     0.0
> l5-node04               lx26-amd64      4  0.00    7.9G  386.7M    9.8G     0.0
>         155 0.55500 STDIN      scmic        r     07/21/2010 09:25:57 express-dm MASTER
>                                                                       express-dm SLAVE
>                                                                       express-dm SLAVE
> l5-node05               lx26-amd64      4  0.00    7.9G  384.5M    9.8G     0.0
> l5-node06               lx26-amd64      4  0.04    7.9G  388.7M    9.8G     0.0
> l5-node07               lx26-amd64      4  0.00    7.9G  387.8M    9.8G     0.0
>         157 0.55500 STDIN      scmic        r     07/21/2010 09:26:11 express-dm MASTER
>                                                                       express-dm SLAVE
>                                                                       express-dm SLAVE
> l5-node08               lx26-amd64      4  0.02    7.9G  370.7M    9.8G     0.0
>
> Same thing again, but this time, the job's master slot is put on a
> machine that already has a job on it:
>
> scmic at l5-auto-du ~ $ qdel "157"
> scmic has registered the job 157 for deletion
> scmic at l5-auto-du ~ $ echo sleep 100 | qsub -pe dmp 3
> Your job 158 ("STDIN") has been submitted
> scmic at l5-auto-du ~ $ qhost -j
> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
> -------------------------------------------------------------------------------
> global                  -               -     -       -       -       -       -
> l5-auto-du              -               -     -       -       -       -       -
> l5-node01               lx26-amd64      4  0.01    7.9G  383.5M    9.8G     0.0
>     job-ID  prior   name       user         state submit/start at     queue      master ja-task-ID
>     ----------------------------------------------------------------------------------------------
>         156 0.55500 STDIN      scmic        r     07/21/2010 09:25:57 express-dm MASTER
>                                                                       express-dm SLAVE
>                                                                       express-dm SLAVE
> l5-node02               lx26-amd64      4  0.03    7.9G  377.2M    9.8G     0.0
> l5-node03               lx26-amd64      4  0.05    7.9G  383.0M    9.8G     0.0
> l5-node04               lx26-amd64      4  0.00    7.9G  386.7M    9.8G     0.0
>         155 0.55500 STDIN      scmic        r     07/21/2010 09:25:57 express-dm MASTER
>                                                                       express-dm SLAVE
>                                                                       express-dm SLAVE
>         158 0.55500 STDIN      scmic        r     07/21/2010 09:26:22 express-dm MASTER
> l5-node05               lx26-amd64      4  0.00    7.9G  384.5M    9.8G     0.0
>         158 0.55500 STDIN      scmic        r     07/21/2010 09:26:22 express-dm SLAVE
>                                                                       express-dm SLAVE
> l5-node06               lx26-amd64      4  0.04    7.9G  388.7M    9.8G     0.0
> l5-node07               lx26-amd64      4  0.00    7.9G  387.8M    9.8G     0.0
> l5-node08               lx26-amd64      4  0.02    7.9G  370.7M    9.8G     0.0
>
> Thanks,

-- 
Andy

--
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - -
Andy Schwierskott                      Tel: +49 (0)941 3075-200 (x60200)
Manager Oracle Grid Engine Engineering Fax: +49 (0)941 3075-222 (x60222)
ORACLE Deutschland B.V. & Co. KG
Dr.-Leo-Ritter-Str. 7 
mailto:andreas.schwierskott at oracle.com
D-93049 Regensburg                     http://www.sun.com/sge
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - -
Hauptverwaltung: Riesstr. 25, D-80992 München
Registergericht: Amtsgericht München, HRA 9560
Komplementärin: ORACLE Deutschland Verwaltung B.V., Rijnzathe 6,
3454PV De Meern, Niederlande
Handelsregister der Handelskammer Midden-Niederlande, Nr. 30143697
Geschäftsführer: Jürgen Kunz, Marcel van de Molen, Alexander van der Ven
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - -

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=269633

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list