[GE users] master node selection and $fill_up behaviour

weiser m.weiser at science-computing.de
Wed Jul 21 08:34:11 BST 2010


Hello,

On Tue, Jul 20, 2010 at 04:06:55PM +0200, Michael Weiser wrote:

> With SGE 6.2 this seems to have changed: As long as the cluster is
> completely empty, the old behaviour is still followed. But if there are
> jobs already, slot allocation becomes erratic. SGE seems to prefer
> filling up already used machines before re-using free machines.

Just now, I was able to produce the behaviour with only three jobs on
completely unloaded machines. It needed two tries, though.

If it's relevant: queue_sort_method is seq_no, but the seq_no of all
queue instances is 0.

scmic at l5-auto-du ~ $ qconf -ssconf | grep queue_sort
queue_sort_method                 seqno
scmic at l5-auto-du ~ $ qstat -F | grep seq_no | sort -u
        qf:seq_no=0

Submitting three jobs, which get their own machines to run:

scmic at l5-auto-du ~ $ echo sleep 100 | qsub -pe dmp 3
Your job 154 ("STDIN") has been submitted
scmic at l5-auto-du ~ $ echo sleep 100 | qsub -pe dmp 3
Your job 155 ("STDIN") has been submitted
scmic at l5-auto-du ~ $ echo sleep 100 | qsub -pe dmp 3
Your job 156 ("STDIN") has been submitted
scmic at l5-auto-du ~ $ qhost -j
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
l5-auto-du              -               -     -       -       -       -       -
l5-node01               lx26-amd64      4  0.02    7.9G  380.7M    9.8G     0.0
   job-ID  prior   name       user         state submit/start at     queue      master ja-task-ID
   ----------------------------------------------------------------------------------------------
       156 0.55500 STDIN      scmic        r     07/21/2010 09:25:57 express-dm MASTER
                                                                     express-dm SLAVE
                                                                     express-dm SLAVE
l5-node02               lx26-amd64      4  0.03    7.9G  377.2M    9.8G     0.0
l5-node03               lx26-amd64      4  0.01    7.9G  385.0M    9.8G     0.0
l5-node04               lx26-amd64      4  0.00    7.9G  386.7M    9.8G     0.0
       155 0.55500 STDIN      scmic        r     07/21/2010 09:25:57 express-dm MASTER
                                                                     express-dm SLAVE
                                                                     express-dm SLAVE
l5-node05               lx26-amd64      4  0.00    7.9G  384.5M    9.8G     0.0
l5-node06               lx26-amd64      4  0.01    7.9G  390.6M    9.8G     0.0
l5-node07               lx26-amd64      4  0.00    7.9G  387.8M    9.8G     0.0
l5-node08               lx26-amd64      4  0.01    7.9G  373.0M    9.8G     0.0
       154 0.55500 STDIN      scmic        r     07/21/2010 09:25:57 express-dm MASTER
                                                                     express-dm SLAVE
                                                                     express-dm SLAVE

Deleting one and submitting a new one, which still gets its own machine:

scmic at l5-auto-du ~ $ qdel "154"
scmic has registered the job 154 for deletion
scmic at l5-auto-du ~ $ echo sleep 100 | qsub -pe dmp 3
Your job 157 ("STDIN") has been submitted
scmic at l5-auto-du ~ $ qhost -j
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
l5-auto-du              -               -     -       -       -       -       -
l5-node01               lx26-amd64      4  0.01    7.9G  383.5M    9.8G     0.0
   job-ID  prior   name       user         state submit/start at     queue      master ja-task-ID
   ----------------------------------------------------------------------------------------------
       156 0.55500 STDIN      scmic        r     07/21/2010 09:25:57 express-dm MASTER
                                                                     express-dm SLAVE
                                                                     express-dm SLAVE
l5-node02               lx26-amd64      4  0.03    7.9G  377.2M    9.8G     0.0
l5-node03               lx26-amd64      4  0.05    7.9G  383.0M    9.8G     0.0
l5-node04               lx26-amd64      4  0.00    7.9G  386.7M    9.8G     0.0
       155 0.55500 STDIN      scmic        r     07/21/2010 09:25:57 express-dm MASTER
                                                                     express-dm SLAVE
                                                                     express-dm SLAVE
l5-node05               lx26-amd64      4  0.00    7.9G  384.5M    9.8G     0.0
l5-node06               lx26-amd64      4  0.04    7.9G  388.7M    9.8G     0.0
l5-node07               lx26-amd64      4  0.00    7.9G  387.8M    9.8G     0.0
       157 0.55500 STDIN      scmic        r     07/21/2010 09:26:11 express-dm MASTER
                                                                     express-dm SLAVE
                                                                     express-dm SLAVE
l5-node08               lx26-amd64      4  0.02    7.9G  370.7M    9.8G     0.0

Same thing again, but this time, the job's master slot is put on a
machine that already has a job on it:

scmic at l5-auto-du ~ $ qdel "157"
scmic has registered the job 157 for deletion
scmic at l5-auto-du ~ $ echo sleep 100 | qsub -pe dmp 3
Your job 158 ("STDIN") has been submitted
scmic at l5-auto-du ~ $ qhost -j
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
l5-auto-du              -               -     -       -       -       -       -
l5-node01               lx26-amd64      4  0.01    7.9G  383.5M    9.8G     0.0
   job-ID  prior   name       user         state submit/start at     queue      master ja-task-ID
   ----------------------------------------------------------------------------------------------
       156 0.55500 STDIN      scmic        r     07/21/2010 09:25:57 express-dm MASTER
                                                                     express-dm SLAVE
                                                                     express-dm SLAVE
l5-node02               lx26-amd64      4  0.03    7.9G  377.2M    9.8G     0.0
l5-node03               lx26-amd64      4  0.05    7.9G  383.0M    9.8G     0.0
l5-node04               lx26-amd64      4  0.00    7.9G  386.7M    9.8G     0.0
       155 0.55500 STDIN      scmic        r     07/21/2010 09:25:57 express-dm MASTER
                                                                     express-dm SLAVE
                                                                     express-dm SLAVE
       158 0.55500 STDIN      scmic        r     07/21/2010 09:26:22 express-dm MASTER
l5-node05               lx26-amd64      4  0.00    7.9G  384.5M    9.8G     0.0
       158 0.55500 STDIN      scmic        r     07/21/2010 09:26:22 express-dm SLAVE
                                                                     express-dm SLAVE
l5-node06               lx26-amd64      4  0.04    7.9G  388.7M    9.8G     0.0
l5-node07               lx26-amd64      4  0.00    7.9G  387.8M    9.8G     0.0
l5-node08               lx26-amd64      4  0.02    7.9G  370.7M    9.8G     0.0

Thanks,
-- 
Michael Weiser                science + computing ag
Senior Systems Engineer       Geschaeftsstelle Duesseldorf
                              Martinstrasse 47-55, Haus A
phone: +49 211 302 708 32     D-40223 Duesseldorf
fax:   +49 211 302 708 50     www.science-computing.de
-- 
Vorstand/Board of Management:
Dr. Bernd Finkbeiner, Dr. Roland Niemeier, 
Dr. Arno Steitz, Dr. Ingrid Zech
Vorsitzender des Aufsichtsrats/
Chairman of the Supervisory Board:
Michel Lepert
Sitz/Registered Office: Tuebingen
Registergericht/Registration Court: Stuttgart
Registernummer/Commercial Register No.: HRB 382196

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=269419

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list