[GE users] Problem filling up cores on a node in a PE using fill_up allocation

reuti reuti at staff.uni-marburg.de
Wed Feb 18 21:19:16 GMT 2009


Do you run all jobs with multiple of 4? Then you could also use  
"allocation_rule 4" in your PE.

Am 18.02.2009 um 20:56 schrieb leonardz:

> Javier:
>
> Thanks this almost does it.
>
> I am using a beta of 6.2u2 now and I found that for the queue  
> definition adding:
> ....
> slots                 8,[cn-r3-4=4],[cn-r3-5=4],[cn-r3-6=4],[cn- 
> r3-7=4]
> ....
> complex_values        slots=8

Well, the built in "slots" is special and schoudn't be touched by  
such a setup. You can already see a contradiction here: first you  
define 4 slots, then 8 slots per queue-instance.


> scheduled 8 cores on 2 4 core nodes as expected.

Then you should also just define 8 in the "slots ..." line. But I  
don't know, whether this is working. You want to oversubscribe the  
nodes with 8 processes per node?


> I have 16 cores on 4 nodes in this test queue.

You also defined "16" in the PE definition?


> When I schedule 5 identical jobs, each using 8 cores, I expect two  
> jobs to run in parallel, one job on 2 nodes and another on the  
> other 2 nodes.
>
> Much to my surprise, it runs only one parallel job in the PE which  
> has 16 slots defined. (PS this is just hello world in mpi, very  
> short run times).

When you "qstat -j <jobid>" you will see some messages of the  
scheduler to investigate this behavior. I would assume, that only 8  
slots in total are available. There is a mpihello which will run  
longer and also put some load on the machines:

http://gridengine.sunsource.net/howto/mpich2-integration/mpihello.tgz

-- Reuti



> It runs 1 job on 2 nodes, leaving two idle, and then schedules the  
> next job on two nodes, after this completed, etc.
>
> How do I get all cores to be used?
>
> Details from qhost -j
>
> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE   
> SWAPTO  SWAPUS
> ---------------------------------------------------------------------- 
> ---------
> global                  -               -     -       -        
> -       -       -
> cn-r3-4                 lx24-amd64      4  0.00    7.7G   37.5M     
> 2.0G     0.0
> cn-r3-5                 lx24-amd64      4  0.00    7.7G   38.7M     
> 2.0G     0.0
>    job-ID  prior   name       user         state submit/start  
> at     queue      master ja-task-ID
>     
> ---------------------------------------------------------------------- 
> ------------------------
>        101 0.55500 ompi_test_ leonardz     r     02/18/2009  
> 14:01:04 ompitest-8 SLAVE
>                                                                      o 
> mpitest-8 SLAVE
>                                                                      o 
> mpitest-8 SLAVE
>                                                                      o 
> mpitest-8 SLAVE
> cn-r3-6                 lx24-amd64      4  0.00    7.7G   38.0M     
> 2.0G     0.0
> cn-r3-7                 lx24-amd64      4  0.00    7.7G   41.0M     
> 2.0G     0.0
>        101 0.55500 ompi_test_ leonardz     r     02/18/2009  
> 14:01:04 ompitest-8 MASTER
>                                                                      o 
> mpitest-8 SLAVE
>                                                                      o 
> mpitest-8 SLAVE
>                                                                      o 
> mpitest-8 SLAVE
>                                                                      o 
> mpitest-8 SLAVE
>
>
> and then job 102 starts 15 seconds later
>
> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE   
> SWAPTO  SWAPUS
> ---------------------------------------------------------------------- 
> ---------
> global                  -               -     -       -        
> -       -       -
> cn-r3-4                 lx24-amd64      4  0.00    7.7G   37.5M     
> 2.0G     0.0
> cn-r3-5                 lx24-amd64      4  0.00    7.7G   41.3M     
> 2.0G     0.0
> cn-r3-6                 lx24-amd64      4  0.00    7.7G   38.2M     
> 2.0G     0.0
>    job-ID  prior   name       user         state submit/start  
> at     queue      master ja-task-ID
>     
> ---------------------------------------------------------------------- 
> ------------------------
>        102 0.55500 ompi_test_ leonardz     r     02/18/2009  
> 14:01:19 ompitest-8 SLAVE
>                                                                      o 
> mpitest-8 SLAVE
>                                                                      o 
> mpitest-8 SLAVE
>                                                                      o 
> mpitest-8 SLAVE
> cn-r3-7                 lx24-amd64      4  0.00    7.7G   41.0M     
> 2.0G     0.0
>        102 0.55500 ompi_test_ leonardz     r     02/18/2009  
> 14:01:19 ompitest-8 MASTER
>                                                                      o 
> mpitest-8 SLAVE
>                                                                      o 
> mpitest-8 SLAVE
>                                                                      o 
> mpitest-8 SLAVE
>                                                                      o 
> mpitest-8 SLAVE
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=109195
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=109240

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list