[GE users] Running MPICH jobs

Reuti reuti at staff.uni-marburg.de
Tue May 2 15:12:28 BST 2006


Am 02.05.2006 um 11:47 schrieb Duong Ta:

> Dear Reuti,
>
> - I am using Linux x86_64, 2.6.9-22.0.2.ELsmp, AMD Dual-core machines.
> - The /etc/hosts file is identical on every node in the system.  
> Entries look like:
>
> IP address           xxx.ihpc.a-star.edu.sg xxx
> IP address           yyy.ihpc.a-star.edu.sg yyy
>
> - Command to submit MPI jobs: mpirun -np $NSLOTS -machinefile  
> $TMPDIR/machines
>
> I do not quite understand this one. Could you please explain a bit  
> more?
> >> For parallel jobs resource requests are for each requested slots.

E.g. if you request a license=1, this will be multiplied by the  
number of requested slots. And each memory request will be handled as  
a request for each task of the parallel job. - Reuti


> Thank you very much for your time.
>
> Best regards,
> Duong
>
>
> On 5/2/06, Reuti <reuti at staff.uni-marburg.de> wrote: Hi,
>
> Am 02.05.2006 um 11:07 schrieb Duong Ta:
>
> > Dear Reuti,
> >
> > Previously I used $round-robin in my PE. Just now I changed to
> > $fill_up following what you have suggested. For allocation_rule=
> > $fill_up, when I submit a MPI job requiring 4 slots, all 4 "fake"
> > slots in one machine is used, the other machine remains untouched.
> > However, when I try to request more than 4 slots, the job just
> > keeps waiting (status = qw).
> >
> > One more thing, occasionally I've got this error when running MPI
> > jobs over SGE (I suspect this might be related to the above  
> problem):
> >
> > error: commlib error: access denied (client IP resolved to host
> > name "". This is not identical to clients host name "")
> > error: executing task of job 198 failed: failed sending task to
> > execd at viz002: can't find connection
> > p0_14351:  p4_error: Child process exited while making connection
> > to remote process on viz002: 0
> > p0_14351: (3.828125 ) net_send: could not write to fd=4, errno = 32
>
> which Linux distribution are you using? What looks your /etc/hosts
> file like (on the nodes)? Did you submit the job with mpirun -
> machinefile $TMPDIR/machines?
>
> -- Reuti
>
> BTW: For parallel jobs resource requests are for each requested slots.
>
>
> > Should there be anything wrong with my cluster configuration?
> >
> > Thank you in advance.
> >
> > Best regards,
> > Duong
> >
> >
> > On 5/2/06, Reuti <reuti at staff.uni-marburg.de> wrote: Hi,
> >
> > Am 02.05.2006 um 09:45 schrieb Duong Ta:
> >
> > > Dear Rayson,
> > >
> > > After I changed the slots attribute, the output of qstat -f is as
> > > follows:
> > >
> > > queuename                      qtype used/tot. load_avg
> > > arch          states
> > >
> >  
> ----------------------------------------------------------------------
> > > ------
> > > all.q at viz001.ihpc.a-star.edu.s BIP   0/4       1.03     lx24-amd64
> > >
> >  
> ----------------------------------------------------------------------
> > > ------
> > > all.q at viz002.ihpc.a-star.edu.s BIP   0/4       1.00     lx24-amd64
> > >
> > > Then I am able to run over SGE a tightly-integrated MPI job that
> > > requires 4 slots, plus a few more batch jobs at the same time.  
> That
> > > means the trick worked, i.e., the system now has 8 "fake" slots,
> > > however, I could not run MPI jobs requiring more than 4 slots
> > > (which is the number of "real" slots in the system). Any advice?
> >
> > which allocation_rule did you specify in your PE definition? For  
> your
> > application it should be $round_robin or $fill_up.
> >
> > -- Reuti
> >
> >
> > >
> > > Thank you very much.
> > >
> > > Best regards,
> > > Duong
> > >
> > >
> > > On 5/2/06, Rayson Ho < rayrayson at gmail.com> wrote: You can change
> > > the "slots" attribute, something like:
> > > http://gridengine.sunsource.net/servlets/ReadMsg ?
> > > list=users&msgNo=13087
> > >
> > > Rayson
> > >
> > >
> > >
> > > On 5/1/06, Duong Ta <duongtnb at gmail.com> wrote:
> > > > Dear,
> > > >
> > > > I'd like to run a MPICH job over tight-integration with SGE that
> > > needs to
> > > > start 7 processes (1 master, 6 slaves) in total. However, my
> > > cluster only
> > > > has 4 slots (2 dual-core execution hosts). Is there any trick to
> > > force SGE
> > > > to start more than one MPI processes on a slot?
> > > >
> > > > Best regards,
> > > > Duong
> > > >
> > >
> > >
> >  
> ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > > For additional commands, e-mail: users- 
> help at gridengine.sunsource.net
> > >
> > >
> >
> >  
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list