[GE users] Running MPICH jobs

Duong Ta duongtnb at gmail.com
Tue May 2 10:07:19 BST 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Dear Reuti,

Previously I used $round-robin in my PE. Just now I changed to $fill_up
following what you have suggested. For allocation_rule=$fill_up, when I
submit a MPI job requiring 4 slots, all 4 "fake" slots in one machine is
used, the other machine remains untouched. However, when I try to request
more than 4 slots, the job just keeps waiting (status = qw).

One more thing, occasionally I've got this error when running MPI jobs over
SGE (I suspect this might be related to the above problem):

error: commlib error: access denied (client IP resolved to host name "".
This is not identical to clients host name "")
error: executing task of job 198 failed: failed sending task to execd at viz002:
can't find connection
p0_14351:  p4_error: Child process exited while making connection to remote
process on viz002: 0
p0_14351: (3.828125) net_send: could not write to fd=4, errno = 32

Should there be anything wrong with my cluster configuration?

Thank you in advance.

Best regards,
Duong


On 5/2/06, Reuti <reuti at staff.uni-marburg.de> wrote:
>
> Hi,
>
> Am 02.05.2006 um 09:45 schrieb Duong Ta:
>
> > Dear Rayson,
> >
> > After I changed the slots attribute, the output of qstat -f is as
> > follows:
> >
> > queuename                      qtype used/tot. load_avg
> > arch          states
> > ----------------------------------------------------------------------
> > ------
> > all.q at viz001.ihpc.a-star.edu.s BIP   0/4       1.03     lx24-amd64
> > ----------------------------------------------------------------------
> > ------
> > all.q at viz002.ihpc.a-star.edu.s BIP   0/4       1.00     lx24-amd64
> >
> > Then I am able to run over SGE a tightly-integrated MPI job that
> > requires 4 slots, plus a few more batch jobs at the same time. That
> > means the trick worked, i.e., the system now has 8 "fake" slots,
> > however, I could not run MPI jobs requiring more than 4 slots
> > (which is the number of "real" slots in the system). Any advice?
>
> which allocation_rule did you specify in your PE definition? For your
> application it should be $round_robin or $fill_up.
>
> -- Reuti
>
>
> >
> > Thank you very much.
> >
> > Best regards,
> > Duong
> >
> >
> > On 5/2/06, Rayson Ho <rayrayson at gmail.com> wrote: You can change
> > the "slots" attribute, something like:
> > http://gridengine.sunsource.net/servlets/ReadMsg?
> > list=users&msgNo=13087
> >
> > Rayson
> >
> >
> >
> > On 5/1/06, Duong Ta <duongtnb at gmail.com> wrote:
> > > Dear,
> > >
> > > I'd like to run a MPICH job over tight-integration with SGE that
> > needs to
> > > start 7 processes (1 master, 6 slaves) in total. However, my
> > cluster only
> > > has 4 slots (2 dual-core execution hosts). Is there any trick to
> > force SGE
> > > to start more than one MPI processes on a slot?
> > >
> > > Best regards,
> > > Duong
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>



More information about the gridengine-users mailing list