[GE users] Still problems submitting mpich jobs - wrong hosts

Gerolf Ziegenhain mail.gerolf at ziegenhain.com
Fri Jul 6 18:44:13 BST 2007


    [ The following text is in the "UTF-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

We are using mpich-1.5.2. Maybe this is the reason? Even mpirun -machinefile
XXX -np N -nodes N/2 doesn't help.

/BR
   Gerolf


2007/7/6, Reuti <reuti at staff.uni-marburg.de>:
>
> Am 06.07.2007 um 19:17 schrieb Gerolf Ziegenhain:
>
> > Solved:
> > mpirun -map host1:host2:...
> > Everything else behaves _strange_
>
> As we use mpich1.2.7p1 for years and never experienced this behavior,
> it would be interesting to get to the cause of it. Only thing
> different in our setup is the order of arguments: I specify usually -
> np before -machinefile.
>
> -- Reuti
>
> > /BR: Gerolf
> >
> > 2007/7/6, Reuti <reuti at staff.uni-marburg.de >:Am 06.07.2007 um
> > 13:40 schrieb Gerolf Ziegenhain:
> >
> > > Everything ok: All of them have a FQDN.
> >
> > Are all the jobs using the same script? I mean, is in some a "-
> > nolocal" or so included by accident?
> >
> > -- Reuti
> >
> > > /BR: Gerolf
> > >
> > > 2007/7/6, Reuti < reuti at staff.uni-marburg.de>: Okay,
> > >
> > > can you please check the response of the command `hostname` on
> > all of
> > > your nodes? If this is not consistent (either only the hostname or
> > > the FQDN), the creation of the machinefile must be adjusted to
> > handle
> > > this.
> > >
> > > -- Reuti
> > >
> > >
> > > Am 06.07.2007 um 12:33 schrieb Gerolf Ziegenhain:
> > >
> > > > Yes. It seems to be a random effect :/ Sometimes it is working
> > very
> > > > nicely and sometimes not.
> > > >
> > > > /BR: Gerolf
> > > >
> > > > 2007/7/6, Reuti <reuti at staff.uni-marburg.de >:Am 06.07.2007 um
> > > > 11:24 schrieb Gerolf Ziegenhain:
> > > >
> > > > > To sum it up once again: I want to start mpich-jobs on my
> > SGE. On
> > > > > each node there should be exatcly two jobs running. How can I
> > > > > achieve this?
> > > >
> > > > You mean: it is still not working. although you patched the
> > creation
> > > > of the machinefile in startmpi.sh ? - Reuti
> > > >
> > > > > My script looks like this:
> > > > > #$ -pe mpich 8
> > > > > #$ -S /bin/zsh
> > > > > #$ -r n
> > > > > #$ -cwd
> > > > > MPIRUN="/opt/mpich/bin/mpirun"
> > > > > ${MPIRUN} -v -machinefile $TMPDIR/machines -np $NSLOTS PROGRAM
> > > > >
> > > > > The parallel environment is
> > > > > qconf -sp mpich
> > > > > pe_name           mpich
> > > > > slots             72
> > > > > user_lists        NONE
> > > > > xuser_lists       NONE
> > > > > start_proc_args   /opt/N1GE/mpi/startmpi.sh -catch_rsh
> > > $pe_hostfile
> > > > > stop_proc_args    /opt/N1GE/mpi/stopmpi.sh
> > > > > allocation_rule   2
> > > > > control_slaves    TRUE
> > > > > job_is_first_task TRUE
> > > > > urgency_slots     min
> > > > >
> > > > > The queue is
> > > > > qconf -sq q_mpich
> > > > > qname                 q_mpich
> > > > > hostlist              @s_hosts
> > > > > seq_no                21,[@b_hosts=22],[@x_hosts=23]
> > > > > load_thresholds       np_load_avg=1,np_load_short=1,n_slots=2, \
> > > > >
> > > > > [@b_hosts=np_load_avg=1,np_load_short=1,n_slots=2], \
> > > > >
> > > > > [@x_hosts=np_load_avg=1,np_load_short=1,n_slots=2]
> > > > > suspend_thresholds    NONE
> > > > > nsuspend              1
> > > > > suspend_interval      00:05:00
> > > > > priority              0
> > > > > min_cpu_interval      00:05:00
> > > > > processors            UNDEFINED
> > > > > qtype                 BATCH
> > > > > ckpt_list             NONE
> > > > > pe_list               mpich mpich2
> > > > > rerun                 TRUE
> > > > > slots                 2
> > > > > tmpdir                /tmp
> > > > > shell                 /bin/bash
> > > > > prolog                NONE
> > > > > epilog                NONE
> > > > > shell_start_mode      unix_behavior
> > > > > starter_method        NONE
> > > > > suspend_method        NONE
> > > > > resume_method         NONE
> > > > > terminate_method      NONE
> > > > > notify                00:00:60
> > > > > owner_list            NONE
> > > > > user_lists            ziegen,[@x_hosts=big]
> > > > > xuser_lists           matlab matlab1 thor
> > > > > subordinate_list      NONE
> > > > > complex_values        synchron=0,virtual_free=3G,n_slots=2, \
> > > > >
> > > > > [@b_hosts=synchron=0,virtual_free=5G,n_slots=2], \
> > > > >
> > > > [@x_hosts=synchron=0,virtual_free=17G,n_slots=2]
> > > > > projects              NONE
> > > > > xprojects             NONE
> > > > > calendar              NONE
> > > > > initial_state         default
> > > > > s_rt                  INFINITY
> > > > > h_rt                  INFINITY
> > > > > s_cpu                 INFINITY
> > > > > h_cpu                 100:00:00
> > > > > s_fsize               INFINITY
> > > > > h_fsize               INFINITY
> > > > > s_data                INFINITY
> > > > > h_data                2G,[@b_hosts=4G],[@x_hosts=16G]
> > > > > s_stack               INFINITY
> > > > > h_stack               INFINITY
> > > > > s_core                INFINITY
> > > > > h_core                INFINITY
> > > > > s_rss                 INFINITY
> > > > > h_rss                 INFINITY
> > > > > s_vmem                INFINITY
> > > > > h_vmem                3G,[@b_hosts=5G],[@x_hosts=17G]
> > > > >
> > > > >
> > > > >
> > > > > /BR: Gerolf
> > > > >
> > > > > --
> > > > > Dipl. Phys. Gerolf Ziegenhain
> > > > > Office: Room 46-332 - Erwin-Schrödinger-Str.46 - TU
> > Kaiserslautern
> > > > > - Germany
> > > > > Web: gerolf.ziegenhain.com
> > > > >
> > > >
> > > >
> > >
> > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > > > For additional commands, e-mail: users-
> > help at gridengine.sunsource.net
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Dipl. Phys. Gerolf Ziegenhain
> > > > Office: Room 46-332 - Erwin-Schrödinger-Str.46 - TU Kaiserslautern
> > > > - Germany
> > > > Web: gerolf.ziegenhain.com
> > > >
> > >
> > >
> > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > > For additional commands, e-mail: users-help at gridengine.sunsource.net
> > >
> > >
> > >
> > >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >
> >
> >
> > --
> > Dipl. Phys. Gerolf Ziegenhain
> > Office: Room 46-332 - Erwin-Schrödinger-Str.46 - TU Kaiserslautern
> > - Germany
> > Web: gerolf.ziegenhain.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>


-- 
Dipl. Phys. Gerolf Ziegenhain
Office: Room 46-332 - Erwin-Schrödinger-Str.46 - TU Kaiserslautern - Germany
Web: gerolf.ziegenhain.com



More information about the gridengine-users mailing list