[GE users] process to core distribution

Reuti reuti at Staff.Uni-Marburg.DE
Fri Oct 17 15:57:58 BST 2008


Am 17.10.2008 um 16:12 schrieb Joseph Hargitai:

> ----- Original Message -----
> From: Reuti <reuti at staff.uni-marburg.de>
> Date: Friday, October 17, 2008 9:37 am
> Subject: Re: [GE users] process to core distribution
>
>> Am 17.10.2008 um 12:48 schrieb Joseph Hargitai:
>>
>>> Running parallel jobs:
>>>
>>> with openmpi the distribution of processes on 16 core nodes are
>>> even when using a pe that allows 8 or 4 cores only. Meaning you get
>>
>>> 2 processes on each of the cores.
>>
>> Can you explain this in more detail? I don't get it.
>
> We have a set of pe for openmpi 1x as follows: orte-4 orte-8  
> orte-12  - to allow less than all 16 cores to be used on our 16  
> core nodes. (4 quadcore cpus)
>
> here is orte-8
>
> pe_name           orte-8
> slots             9999
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   /bin/true
> stop_proc_args    /bin/true
> allocation_rule   8
> control_slaves    TRUE
> job_is_first_task FALSE
> urgency_slots     min
>
> When you run jobs with orte-8 16 - the 16 processes get distributed  
> evenly on the nodes -

This is done by SGE

> 8 and 8. And on the node the processes also get evenly distributed:  
> each of the 4 cpus get 2 processes.

This is done by the Linux scheduler.

> Whereas with our mvapich configuration:
>
> pe_name           mvapich-8
> slots             16
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   /opt/gridengine/mpi/startmpi.sh -catch_rsh  
> $pe_hostfile
> stop_proc_args    /opt/gridengine/mpi/stopmpi.sh
> allocation_rule   8
> control_slaves    TRUE
> job_is_first_task FALSE
> urgency_slots     min
>
>
> an mvapich-8 16  invocation would distribute 8 and 8 between the  
> nodes as it should, but

This is done by SGE.

> the process distribution on the node itself, first two cpus get all  
> 8 processes, nothing on the remaining two cpus on the node.

Maybe some cpu affinity is built in to MVAPICH. Advantage would be to  
use the same cache. From their homepage:

- Processor Affinity
- Flexible user defined processor affinity for better resource  
utilization on multi-core systems

So the MVAPICH manual will explain it and how to change it.

> We are aware of the option to use numactl. Was just wondering what  
> would cause the openmpi to do the mapping differently than mvapich.
>
>
> ps - also, since you mentioned that possibility of cpus being  
> already used by other jobs, which is not the case here. What is the  
> SGE way to make nodes job exclusive the easiest way? We do not want  
> parallel jobs to mix. They need to own the nodes when running even  
> when not using all cores.

It's an RFE. For now you will need to request

- all cores on every machine and adjust the hostlist and slot count  
to use only the desired number of cores

- request an amount of memory, so that there is no memory left for  
other applications (suppose h_vmem or virtual_free was made  
consumable and set to a sensible value in the exechopst definition)

-- Reuti

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list