[GE users] Still problems submitting mpich jobs - wrong hosts

Reuti reuti at staff.uni-marburg.de
Fri Jul 6 18:36:59 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Am 06.07.2007 um 19:17 schrieb Gerolf Ziegenhain:

> Solved:
> mpirun -map host1:host2:...
> Everything else behaves _strange_

As we use mpich1.2.7p1 for years and never experienced this behavior,  
it would be interesting to get to the cause of it. Only thing  
different in our setup is the order of arguments: I specify usually - 
np before -machinefile.

-- Reuti

> /BR: Gerolf
>
> 2007/7/6, Reuti <reuti at staff.uni-marburg.de >:Am 06.07.2007 um  
> 13:40 schrieb Gerolf Ziegenhain:
>
> > Everything ok: All of them have a FQDN.
>
> Are all the jobs using the same script? I mean, is in some a "-
> nolocal" or so included by accident?
>
> -- Reuti
>
> > /BR: Gerolf
> >
> > 2007/7/6, Reuti < reuti at staff.uni-marburg.de>: Okay,
> >
> > can you please check the response of the command `hostname` on  
> all of
> > your nodes? If this is not consistent (either only the hostname or
> > the FQDN), the creation of the machinefile must be adjusted to  
> handle
> > this.
> >
> > -- Reuti
> >
> >
> > Am 06.07.2007 um 12:33 schrieb Gerolf Ziegenhain:
> >
> > > Yes. It seems to be a random effect :/ Sometimes it is working  
> very
> > > nicely and sometimes not.
> > >
> > > /BR: Gerolf
> > >
> > > 2007/7/6, Reuti <reuti at staff.uni-marburg.de >:Am 06.07.2007 um
> > > 11:24 schrieb Gerolf Ziegenhain:
> > >
> > > > To sum it up once again: I want to start mpich-jobs on my  
> SGE. On
> > > > each node there should be exatcly two jobs running. How can I
> > > > achieve this?
> > >
> > > You mean: it is still not working. although you patched the  
> creation
> > > of the machinefile in startmpi.sh ? - Reuti
> > >
> > > > My script looks like this:
> > > > #$ -pe mpich 8
> > > > #$ -S /bin/zsh
> > > > #$ -r n
> > > > #$ -cwd
> > > > MPIRUN="/opt/mpich/bin/mpirun"
> > > > ${MPIRUN} -v -machinefile $TMPDIR/machines -np $NSLOTS PROGRAM
> > > >
> > > > The parallel environment is
> > > > qconf -sp mpich
> > > > pe_name           mpich
> > > > slots             72
> > > > user_lists        NONE
> > > > xuser_lists       NONE
> > > > start_proc_args   /opt/N1GE/mpi/startmpi.sh -catch_rsh
> > $pe_hostfile
> > > > stop_proc_args    /opt/N1GE/mpi/stopmpi.sh
> > > > allocation_rule   2
> > > > control_slaves    TRUE
> > > > job_is_first_task TRUE
> > > > urgency_slots     min
> > > >
> > > > The queue is
> > > > qconf -sq q_mpich
> > > > qname                 q_mpich
> > > > hostlist              @s_hosts
> > > > seq_no                21,[@b_hosts=22],[@x_hosts=23]
> > > > load_thresholds       np_load_avg=1,np_load_short=1,n_slots=2, \
> > > >
> > > > [@b_hosts=np_load_avg=1,np_load_short=1,n_slots=2], \
> > > >
> > > > [@x_hosts=np_load_avg=1,np_load_short=1,n_slots=2]
> > > > suspend_thresholds    NONE
> > > > nsuspend              1
> > > > suspend_interval      00:05:00
> > > > priority              0
> > > > min_cpu_interval      00:05:00
> > > > processors            UNDEFINED
> > > > qtype                 BATCH
> > > > ckpt_list             NONE
> > > > pe_list               mpich mpich2
> > > > rerun                 TRUE
> > > > slots                 2
> > > > tmpdir                /tmp
> > > > shell                 /bin/bash
> > > > prolog                NONE
> > > > epilog                NONE
> > > > shell_start_mode      unix_behavior
> > > > starter_method        NONE
> > > > suspend_method        NONE
> > > > resume_method         NONE
> > > > terminate_method      NONE
> > > > notify                00:00:60
> > > > owner_list            NONE
> > > > user_lists            ziegen,[@x_hosts=big]
> > > > xuser_lists           matlab matlab1 thor
> > > > subordinate_list      NONE
> > > > complex_values        synchron=0,virtual_free=3G,n_slots=2, \
> > > >
> > > > [@b_hosts=synchron=0,virtual_free=5G,n_slots=2], \
> > > >
> > > [@x_hosts=synchron=0,virtual_free=17G,n_slots=2]
> > > > projects              NONE
> > > > xprojects             NONE
> > > > calendar              NONE
> > > > initial_state         default
> > > > s_rt                  INFINITY
> > > > h_rt                  INFINITY
> > > > s_cpu                 INFINITY
> > > > h_cpu                 100:00:00
> > > > s_fsize               INFINITY
> > > > h_fsize               INFINITY
> > > > s_data                INFINITY
> > > > h_data                2G,[@b_hosts=4G],[@x_hosts=16G]
> > > > s_stack               INFINITY
> > > > h_stack               INFINITY
> > > > s_core                INFINITY
> > > > h_core                INFINITY
> > > > s_rss                 INFINITY
> > > > h_rss                 INFINITY
> > > > s_vmem                INFINITY
> > > > h_vmem                3G,[@b_hosts=5G],[@x_hosts=17G]
> > > >
> > > >
> > > >
> > > > /BR: Gerolf
> > > >
> > > > --
> > > > Dipl. Phys. Gerolf Ziegenhain
> > > > Office: Room 46-332 - Erwin-Schrödinger-Str.46 - TU  
> Kaiserslautern
> > > > - Germany
> > > > Web: gerolf.ziegenhain.com
> > > >
> > >
> > >
> >  
> ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > > For additional commands, e-mail: users- 
> help at gridengine.sunsource.net
> > >
> > >
> > >
> > >
> > > --
> > > Dipl. Phys. Gerolf Ziegenhain
> > > Office: Room 46-332 - Erwin-Schrödinger-Str.46 - TU Kaiserslautern
> > > - Germany
> > > Web: gerolf.ziegenhain.com
> > >
> >
> >  
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >
> >
> >
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>
>
> -- 
> Dipl. Phys. Gerolf Ziegenhain
> Office: Room 46-332 - Erwin-Schrödinger-Str.46 - TU Kaiserslautern  
> - Germany
> Web: gerolf.ziegenhain.com

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list