[GE users] mpich <-> sge --> controlling hosts machinefile

Reuti reuti at staff.uni-marburg.de
Thu Jul 5 11:11:50 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,

please try in  the PE:

job_is_first_task TRUE

for the MPICH1.

-- Reuti


Am 05.07.2007 um 11:47 schrieb Gerolf Ziegenhain:

> Ok. This seems to be a point:
>
> The current state on the node running three jobs is:
>
> ssh lc19 ps x
>   PID  STAT   TIME COMMAND
> 9493 S 0:00 /gu2/ziegen/LINUX/data/bin//lmp_rhrk_parallel -p4pg / 
> nas2/ziegen/CSI/LJ/L300_R15/PI9268 -p4wd /nas2/ziegen/CSI/LJ/L300_R15
> 10321 S 37:54 /gu2/ziegen/LINUX/data/bin//lmp_rhrk_parallel  
> lc12.rhrk.uni-kl.de 43886   4amslave -p4yourname lc19 -p4rmrank 1
> 10322 S 0:00 /gu2/ziegen/LINUX/data/bin//lmp_rhrk_parallel  
> lc12.rhrk.uni-kl.de 43886   4amslave -p4yourname lc19 -p4rmrank 1
> 10324 D 37:24 /gu2/ziegen/LINUX/data/bin//lmp_rhrk_parallel  
> lc12.rhrk.uni-kl.de 43886   4amslave -p4yourname lc19 -p4rmrank 2
> 10325 S 0:00 /gu2/ziegen/LINUX/data/bin//lmp_rhrk_parallel  
> lc12.rhrk.uni-kl.de 43886   4amslave -p4yourname lc19 -p4rmrank 2
> 10327 D 39:02 /gu2/ziegen/LINUX/data/bin//lmp_rhrk_parallel  
> lc12.rhrk.uni-kl.de 43886   4amslave -p4yourname lc19 -p4rmrank 7
> 10328 S 0:00 /gu2/ziegen/LINUX/data/bin//lmp_rhrk_parallel  
> lc12.rhrk.uni-kl.de 43886   4amslave -p4yourname lc19 -p4rmrank 7
>
> So there are definitively three jobs running.
>
> But the the queue says this:
> qstat -g t -u ziegen
> job-ID  prior   name       user         state submit/start at      
> queue                          master ja-task-ID
> ---------------------------------------------------------------------- 
> --------------------------------------------
>  244224 1.60000 R81        ziegen       r     07/04/2007 20:38:33  
> q_mpich at lc12                   MASTER
>                                                                    
> q_mpich at lc12                   SLAVE
>                                                                    
> q_mpich at lc12                   SLAVE
>  244224 1.60000 R81        ziegen       r     07/04/2007 20:38:33  
> q_mpich at lc13                   SLAVE
>                                                                    
> q_mpich at lc13                   SLAVE
>  244224 1.60000 R81        ziegen       r     07/04/2007 20:38:33  
> q_mpich at lc14                   SLAVE
>                                                                    
> q_mpich at lc14                   SLAVE
>  244224 1.60000 R81        ziegen       r     07/04/2007 20:38:33  
> q_mpich at lc19                   SLAVE
>                                                                    
> q_mpich at lc19                   SLAVE
>
>
> So there should be only two jobs running on lc19.
>
> For consistency here again how the parallel environment mpich is  
> configured at the moment:
> qconf -sp mpich
> pe_name           mpich
> slots             60
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   /opt/N1GE/mpi/startmpi.sh -catch_rsh $pe_hostfile
> stop_proc_args    /opt/N1GE/mpi/stopmpi.sh
> allocation_rule   $fill_up
> control_slaves    TRUE
> job_is_first_task FALSE
> urgency_slots     min
>
> And this is the queue configuration:
>  qconf -sq q_mpich
> qname                 q_mpich
> hostlist              lc10 lc11 lc12 lc13 lc14 lc15 lc18 lc19
> seq_no                21,[@b_hosts=22],[@x_hosts=23]
> load_thresholds       np_load_avg=1,np_load_short=1,n_slots=2, \
>                        
> [@b_hosts=np_load_avg=1,np_load_short=1,n_slots=2], \
>                        
> [@x_hosts=np_load_avg=1,np_load_short=1,n_slots=2]
> suspend_thresholds    NONE
> nsuspend              1
> suspend_interval      00:05:00
> priority              0
> min_cpu_interval      00:05:00
> processors            UNDEFINED
> qtype                 BATCH
> ckpt_list             NONE
> pe_list               mpich
> rerun                 TRUE
> slots                 2
> tmpdir                /tmp
> shell                 /bin/bash
> prolog                NONE
> epilog                NONE
> shell_start_mode      unix_behavior
> starter_method        NONE
> suspend_method        NONE
> resume_method         NONE
> terminate_method      NONE
> notify                00:00:60
> owner_list            NONE
> user_lists            ziegen,[@x_hosts=big]
> xuser_lists           matlab matlab1 thor
> subordinate_list      NONE
> complex_values        synchron=0,virtual_free=3G,n_slots=2, \
>                        
> [@b_hosts=synchron=0,virtual_free=5G,n_slots=2], \
>                       [@x_hosts=synchron=0,virtual_free=17G,n_slots=2]
> projects              NONE
> xprojects             NONE
> calendar              NONE
> initial_state         default
> s_rt                  INFINITY
> h_rt                  INFINITY
> s_cpu                 INFINITY
> h_cpu                 100:00:00
> s_fsize               INFINITY
> h_fsize               INFINITY
> s_data                INFINITY
> h_data                2G,[@b_hosts=4G],[@x_hosts=16G]
> s_stack               INFINITY
> h_stack               INFINITY
> s_core                INFINITY
> h_core                INFINITY
> s_rss                 INFINITY
> h_rss                 INFINITY
> s_vmem                INFINITY
> h_vmem                3G,[@b_hosts=5G],[@x_hosts=17G]
>
>
> I submit my jobs with this script:
> #!/bin/zsh
> #$ -pe mpich 8
> #$ -S /bin/zsh
> #$ -M ziegen
> #$ -N "R81"
> #$ -l h_cpu=300000,h_vmem=3000m
> #$ -m bea
> #$ -r n
> #$ -cwd
> #$ -e ./lammps.stderr
> #$ -o ./lammps.stdout
> QSUB="/opt/N1GE/bin/lx24-amd64/qsub"
> MPIRUN="/opt/mpich/bin/mpirun"
> BINDIR="/gu2/ziegen/LINUX/data/bin/"
> MACHINEFILE=$TMPDIR/machines
> PATH=$PATH
> function run {${MPIRUN} -v -machinefile $MACHINEFILE -np $NSLOTS  
> $BINDIR/lmp_rhrk_parallel < ${GRID_SCRIPT}}
> GRID_SCRIPT=simulation_script
> run
>
> The installed mpich is mpich-1.2.5.2.
>
> What do I miss in my SGE-configuration?
>
> /BR:
>    Gerolf
>
>
>
>
> 2007/7/5, Ravi Chandra Nallan < Ravichandra.Nallan at sun.com>:Hi Gerolf,
> you must have created the complex n_slots, which is a consumable.  
> But it
> seems redundant. sge maintains a built in complex 'slots', which does
> the same in your case (both slots and n_slots is 2). you can see  
> the no.
> of slots assigned to the qinstance(s) using qstat -g c
> I am not sure how 3 tasks are scheduled on a node, can you please  
> check
> with qstat -g t , just to make sure we are seeing the right thing?
>
> regards,
> -Ravi
>
> Gerolf Ziegenhain wrote:
> > Hi Chris,
> >
> > The result of this experiment is predictable: If I remove the  
> n_slots
> > threshold, the SGE will also start single slots on nodes which  
> already
> > have a load of 1. This means it can only start 1 process; however if
> > the allocation of slots is fixed with "2" nothing will happen:
> > behaviour as described before.
> >
> > /BR: Gerolf
> >
> > 2007/7/4, Chris Dagdigian <dag at sonsorol.org  
> <mailto:dag at sonsorol.org >>:
> >
> >
> >     Hi Gerolf,
> >
> >     I run MPI apps on 2-way compute nodes all the time without  
> problems
> >     and I've always just let "slots 2" stay in the queue  
> configuration.
> >     The SGE scheduler always did the right thing. Can you remove  
> your
> >     n_slots load thresholds and queue consumables and see what  
> happens?
> >     I'm still curious as to why you had three tasks land on one  
> node.
> >
> >
> >
> >
> >
> >     On Jul 4, 2007, at 3:35 PM, Gerolf Ziegenhain wrote:
> >
> >     > Thanks for the very quick reply ;)
> >     >
> >     > allocation_rule = $round_robin results in 1job/node. This  
> increases
> >     > the communication effort. So maybe allocation_rule=2 would  
> be the
> >     > best choice in my case?
> >     >
> >     > This is the configuration of the queue:
> >     > qconf -sq q_mpich
> >     > qname                 q_mpich
> >     > hostlist              lc10 lc11 lc12 lc13 lc14 lc15 lc18 lc19
> >     > seq_no                21,[@b_hosts=22],[@x_hosts=23]
> >     > load_thresholds        
> np_load_avg=1,np_load_short=1,n_slots=2, \
> >     >
> >     > [@b_hosts=np_load_avg=1,np_load_short=1,n_slots=2], \
> >     >
> >     > [@x_hosts=np_load_avg=1,np_load_short=1,n_slots=2]
> >     > suspend_thresholds    NONE
> >     > nsuspend              1
> >     > suspend_interval      00:05:00
> >     > priority              0
> >     > min_cpu_interval      00:05:00
> >     > processors            UNDEFINED
> >     > qtype                 BATCH
> >     > ckpt_list             NONE
> >     > pe_list               mpich
> >     > rerun                 TRUE
> >     > slots                 2
> >     > tmpdir                /tmp
> >     > shell                 /bin/bash
> >     > prolog                NONE
> >     > epilog                NONE
> >     > shell_start_mode      unix_behavior
> >     > starter_method        NONE
> >     > suspend_method        NONE
> >     > resume_method         NONE
> >     > terminate_method      NONE
> >     > notify                00:00:60
> >     > owner_list            NONE
> >     > user_lists            ziegen,[@x_hosts=big]
> >     > xuser_lists           matlab matlab1 thor
> >     > subordinate_list      NONE
> >     > complex_values        synchron=0,virtual_free=3G,n_slots=2, \
> >     >
> >     > [@b_hosts=synchron=0,virtual_free=5G,n_slots=2], \
> >     >
> >     [@x_hosts=synchron=0,virtual_free=17G,n_slots=2]
> >     > projects              NONE
> >     > xprojects             NONE
> >     > calendar              NONE
> >     > initial_state         default
> >     > s_rt                  INFINITY
> >     > h_rt                  INFINITY
> >     > s_cpu                 INFINITY
> >     > h_cpu                 100:00:00
> >     > s_fsize               INFINITY
> >     > h_fsize               INFINITY
> >     > s_data                INFINITY
> >     > h_data                2G,[@b_hosts=4G],[@x_hosts=16G]
> >     > s_stack               INFINITY
> >     > h_stack               INFINITY
> >     > s_core                INFINITY
> >     > h_core                INFINITY
> >     > s_rss                 INFINITY
> >     > h_rss                 INFINITY
> >     > s_vmem                INFINITY
> >     > h_vmem                3G,[@b_hosts=5G],[@x_hosts=17G]
> >     >
> >     >
> >     > /BR:
> >     >    Gerolf
> >     >
> >     >
> >     > 2007/7/4, Chris Dagdigian <dag at sonsorol.org
> >     <mailto: dag at sonsorol.org>>:
> >     > Not sure if this totally answers your question but you can  
> play
> >     with
> >     > the host selection process by adjusting your  
> $allocation_rule in
> >     your
> >     > parallel environment configuration.
> >     >
> >     > For instance, you have $fill_up configured which is why your
> >     parallel
> >     > slots are being packed on as few nodes as possible.  
> Changing to
> >     > $round_robin will spread it out among as many machines as  
> possible.
> >     >
> >     > For your main symptom:
> >     >
> >     > If your parallel jobs are running more than 2 tasks per  
> node then
> >     > something may be off with your slot count - perhaps SGE is
> >     detecting
> >     > multi-core CPUs on your 2-way boxes and setting slots=4 on  
> each
> >     node.
> >     > Posting the config of the queue "mpich-qeueue" may help get  
> to the
> >     > bottom of this as I'm not sure about the n_slots "limit"  
> you are
> >     > referring to.
> >     >
> >     >
> >     > Regards,
> >     > Chris
> >     >
> >     >
> >     >
> >     > On Jul 4, 2007, at 3:14 PM, Gerolf Ziegenhain wrote:
> >     >
> >     > > Hi,
> >     > >
> >     > > Maybe it is a very stupid question, but: How do I control  
> the
> >     > > number of jobs per node? Consider the following hardware:  
> 38 nodes
> >     > > with two processors on each. When I start a job with -pe  
> mpich 8
> >     > > there should be 4 nodes used with 2 jobs on each. What do I
> >     have to
> >     > > do in order to achieve this?
> >     > >
> >     > > My parallel environment is configured like this:
> >     > > qconf -sp mpich
> >     > > pe_name           mpich
> >     > > slots             60
> >     > > user_lists        NONE
> >     > > xuser_lists       NONE
> >     > > start_proc_args   /opt/N1GE/mpi/startmpi.sh -catch_rsh
> >     $pe_hostfile
> >     > > stop_proc_args    /opt/N1GE/mpi/stopmpi.sh
> >     > > allocation_rule   $fill_up
> >     > > control_slaves    TRUE
> >     > > job_is_first_task FALSE
> >     > > urgency_slots     min
> >     > >
> >     > > My mpich-queue has limits:
> >     > > np_load_av=1
> >     > > np_load_sh=1
> >     > > n_slots=2
> >     > >
> >     > > However if I start a job, something like this will happen  
> in the
> >     > > PI1234-file:
> >     > > lc12.rhrk.uni-kl.de <http://lc12.rhrk.uni-kl.de> 0 prog
> >     > > lc19 1 prog
> >     > > lc19 1 prog
> >     > > lc19 1 prog
> >     > > lc14 1 prog
> >     > > lc14 1 prog
> >     > > lc13 1 prog
> >     > > lc13 1 prog
> >     > >
> >     > > So there are particularly three jobs on lc19 with only two
> >     CPUs, On
> >     > > of these three jobs would better be running on lc12. How  
> can I fix
> >     > > this?
> >     > >
> >     > >
> >     > > Thanks in advance:
> >     > >    Gerolf
> >     > >
> >     > >
> >     > >
> >     > >
> >     > > --
> >     > > Dipl. Phys. Gerolf Ziegenhain
> >     > > Office: Room 46-332 - Erwin-Schrödinger-Str.46 - TU  
> Kaiserslautern
> >     > > - Germany
> >     > > Web: gerolf.ziegenhain.com <http://gerolf.ziegenhain.com>
> >     > >
> >     > >
> >     >
> >     >
> >      
> ---------------------------------------------------------------------
> >     > To unsubscribe, e-mail:
> >     users-unsubscribe at gridengine.sunsource.net
> >     <mailto: users-unsubscribe at gridengine.sunsource.net>
> >     > For additional commands, e-mail:
> >     users-help at gridengine.sunsource.net
> >     <mailto: users-help at gridengine.sunsource.net>
> >     >
> >     >
> >     >
> >     >
> >     > --
> >     > Dipl. Phys. Gerolf Ziegenhain
> >     > Office: Room 46-332 - Erwin-Schrödinger-Str.46 - TU  
> Kaiserslautern
> >     > - Germany
> >     > Web: gerolf.ziegenhain.com < http://gerolf.ziegenhain.com>
> >     >
> >
> >      
> ---------------------------------------------------------------------
> >     To unsubscribe, e-mail: users- 
> unsubscribe at gridengine.sunsource.net
> >     <mailto:users-unsubscribe at gridengine.sunsource.net>
> >     For additional commands, e-mail:
> >     users-help at gridengine.sunsource.net
> >     <mailto:users-help at gridengine.sunsource.net >
> >
> >
> >
> >
> > --
> > Dipl. Phys. Gerolf Ziegenhain
> > Office: Room 46-332 - Erwin-Schrödinger-Str.46 - TU Kaiserslautern -
> > Germany
> > Web: gerolf.ziegenhain.com <http://gerolf.ziegenhain.com>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>
>
> -- 
> Dipl. Phys. Gerolf Ziegenhain
> Office: Room 46-332 - Erwin-Schrödinger-Str.46 - TU Kaiserslautern  
> - Germany
> Web: gerolf.ziegenhain.com

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list