[GE users] Problem filling up cores on a node in a PE using fill_up allocation

leonardz leonardz at sickkids.ca
Wed Feb 18 19:56:24 GMT 2009


Javier:

Thanks this almost does it.

I am using a beta of 6.2u2 now and I found that for the queue definition adding:
....
slots                 8,[cn-r3-4=4],[cn-r3-5=4],[cn-r3-6=4],[cn-r3-7=4]
....
complex_values        slots=8

scheduled 8 cores on 2 4 core nodes as expected.

I have 16 cores on 4 nodes in this test queue.

When I schedule 5 identical jobs, each using 8 cores, I expect two jobs to run in parallel, one job on 2 nodes and another on the other 2 nodes.

Much to my surprise, it runs only one parallel job in the PE which has 16 slots defined. (PS this is just hello world in mpi, very short run times).

It runs 1 job on 2 nodes, leaving two idle, and then schedules the next job on two nodes, after this completed, etc.

How do I get all cores to be used?

Details from qhost -j

HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
cn-r3-4                 lx24-amd64      4  0.00    7.7G   37.5M    2.0G     0.0
cn-r3-5                 lx24-amd64      4  0.00    7.7G   38.7M    2.0G     0.0
   job-ID  prior   name       user         state submit/start at     queue      master ja-task-ID
   ----------------------------------------------------------------------------------------------
       101 0.55500 ompi_test_ leonardz     r     02/18/2009 14:01:04 ompitest-8 SLAVE
                                                                     ompitest-8 SLAVE
                                                                     ompitest-8 SLAVE
                                                                     ompitest-8 SLAVE
cn-r3-6                 lx24-amd64      4  0.00    7.7G   38.0M    2.0G     0.0
cn-r3-7                 lx24-amd64      4  0.00    7.7G   41.0M    2.0G     0.0
       101 0.55500 ompi_test_ leonardz     r     02/18/2009 14:01:04 ompitest-8 MASTER
                                                                     ompitest-8 SLAVE
                                                                     ompitest-8 SLAVE
                                                                     ompitest-8 SLAVE
                                                                     ompitest-8 SLAVE


and then job 102 starts 15 seconds later

HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
cn-r3-4                 lx24-amd64      4  0.00    7.7G   37.5M    2.0G     0.0
cn-r3-5                 lx24-amd64      4  0.00    7.7G   41.3M    2.0G     0.0
cn-r3-6                 lx24-amd64      4  0.00    7.7G   38.2M    2.0G     0.0
   job-ID  prior   name       user         state submit/start at     queue      master ja-task-ID
   ----------------------------------------------------------------------------------------------
       102 0.55500 ompi_test_ leonardz     r     02/18/2009 14:01:19 ompitest-8 SLAVE
                                                                     ompitest-8 SLAVE
                                                                     ompitest-8 SLAVE
                                                                     ompitest-8 SLAVE
cn-r3-7                 lx24-amd64      4  0.00    7.7G   41.0M    2.0G     0.0
       102 0.55500 ompi_test_ leonardz     r     02/18/2009 14:01:19 ompitest-8 MASTER
                                                                     ompitest-8 SLAVE
                                                                     ompitest-8 SLAVE
                                                                     ompitest-8 SLAVE
                                                                     ompitest-8 SLAVE

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=109195

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list