[GE users] Running MPICH jobs

Duong Ta duongtnb at gmail.com
Tue May 2 11:31:10 BST 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Dear Goncalo,

I think I've fixed the problem. This is because I forgot to change the
number of slots in the PE configuration (from 4 to 8) after doing a qconf
-mattr queue slots 8 `qselect`.

It also seems that one of our execution host is having problem, but I'm not
sure where is the reason, cos' I occasionally get the following error
message. However, after I removed this node from SGE and used another one,
the problem's gone.

error: commlib error: access denied (client IP resolved to host name "".
> > > This is not identical to clients host name "")
> > > error: executing task of job 198 failed: failed sending task to
> > execd at viz002:
> > > can't find connection
> > > p0_14351:  p4_error: Child process exited while making connection to
> > remote
> > > process on viz002: 0
> > > p0_14351: (3.828125) net_send: could not write to fd=4, errno = 32

Thank you all for your time and help.

Best regards,
Duong

On 5/2/06, Goncalo Borges <goncalo at lip.pt> wrote:
>
>
> Hi,
> can you do a " qconf -sp <name_of_the_pe>" and post the output?
> Thanks
> Goncalo
>
>
> On Tue, 2 May 2006, Duong Ta wrote:
>
> > Dear Goncalo,
> >
> > Yes, I did a qstat -j, and the reason was: Jobs can not run because
> > available slots combined under PE are not in range of job 199.
> >
> > A qstat -f gave (the job requires 6 slots):
> >
> > queuename                      qtype used/tot. load_avg
> arch          states
> >
> ----------------------------------------------------------------------------
> > all.q at viz001.ihpc.a-star.edu.s BIP   0/4       0.00     lx24-amd64
> >
> ----------------------------------------------------------------------------
> > all.q at viz002.ihpc.a-star.edu.s BIP   0/4       0.01     lx24-amd64
> >
> >
> ############################################################################
> > - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING
> JOBS
> >
> ############################################################################
> >    199 0.55500 cpi        griduser     qw    05/02/2006 16:54:05     6
> >
> > Best regards,
> > Duong
> >
> > On 5/2/06, Goncalo Borges <goncalo at lip.pt> wrote:
> > >
> > >
> > > Hi Duong,
> > > The fact that your job is permantly waiting could be due to different
> > > things, most of them probably related to configuration issues.
> > > Therefore I suggest to do the following thing:
> > >
> > > - when your job is in waiting state, please do a "qstat -j". This will
> > > give you some message explaining why it is not running.
> > > Alternatively, you can also use SGE GUI qmon tool. Choose the
> > > job panel -> pending jobs panel -> explain button
> > > to know why is your job not running.
> > >
> > > Cheers
> > > Goncalo
> > >
> > >
> > >
> > >
> > > On Tue, 2 May 2006, Duong Ta wrote:
> > >
> > > > Dear Reuti,
> > > >
> > > > Previously I used $round-robin in my PE. Just now I changed to
> $fill_up
> > > > following what you have suggested. For allocation_rule=$fill_up,
> when I
> > > > submit a MPI job requiring 4 slots, all 4 "fake" slots in one
> machine is
> > > > used, the other machine remains untouched. However, when I try to
> > > request
> > > > more than 4 slots, the job just keeps waiting (status = qw).
> > > >
> > > > One more thing, occasionally I've got this error when running MPI
> jobs
> > > over
> > > > SGE (I suspect this might be related to the above problem):
> > > >
> > > > error: commlib error: access denied (client IP resolved to host name
> "".
> > > > This is not identical to clients host name "")
> > > > error: executing task of job 198 failed: failed sending task to
> > > execd at viz002:
> > > > can't find connection
> > > > p0_14351:  p4_error: Child process exited while making connection to
> > > remote
> > > > process on viz002: 0
> > > > p0_14351: (3.828125) net_send: could not write to fd=4, errno = 32
> > > >
> > > > Should there be anything wrong with my cluster configuration?
> > > >
> > > > Thank you in advance.
> > > >
> > > > Best regards,
> > > > Duong
> > > >
> > > >
> > > > On 5/2/06, Reuti <reuti at staff.uni-marburg.de> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > Am 02.05.2006 um 09:45 schrieb Duong Ta:
> > > > >
> > > > > > Dear Rayson,
> > > > > >
> > > > > > After I changed the slots attribute, the output of qstat -f is
> as
> > > > > > follows:
> > > > > >
> > > > > > queuename                      qtype used/tot. load_avg
> > > > > > arch          states
> > > > > >
> > > ----------------------------------------------------------------------
> > > > > > ------
> > > > > > all.q at viz001.ihpc.a-star.edu.s BIP   0/4       1.03
> lx24-amd64
> > > > > >
> > > ----------------------------------------------------------------------
> > > > > > ------
> > > > > > all.q at viz002.ihpc.a-star.edu.s BIP   0/4       1.00
> lx24-amd64
> > > > > >
> > > > > > Then I am able to run over SGE a tightly-integrated MPI job that
> > > > > > requires 4 slots, plus a few more batch jobs at the same time.
> That
> > > > > > means the trick worked, i.e., the system now has 8 "fake" slots,
> > > > > > however, I could not run MPI jobs requiring more than 4 slots
> > > > > > (which is the number of "real" slots in the system). Any advice?
> > > > >
> > > > > which allocation_rule did you specify in your PE definition? For
> your
> > > > > application it should be $round_robin or $fill_up.
> > > > >
> > > > > -- Reuti
> > > > >
> > > > >
> > > > > >
> > > > > > Thank you very much.
> > > > > >
> > > > > > Best regards,
> > > > > > Duong
> > > > > >
> > > > > >
> > > > > > On 5/2/06, Rayson Ho <rayrayson at gmail.com> wrote: You can change
> > > > > > the "slots" attribute, something like:
> > > > > > http://gridengine.sunsource.net/servlets/ReadMsg?
> > > > > > list=users&msgNo=13087
> > > > > >
> > > > > > Rayson
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 5/1/06, Duong Ta <duongtnb at gmail.com> wrote:
> > > > > > > Dear,
> > > > > > >
> > > > > > > I'd like to run a MPICH job over tight-integration with SGE
> that
> > > > > > needs to
> > > > > > > start 7 processes (1 master, 6 slaves) in total. However, my
> > > > > > cluster only
> > > > > > > has 4 slots (2 dual-core execution hosts). Is there any trick
> to
> > > > > > force SGE
> > > > > > > to start more than one MPI processes on a slot?
> > > > > > >
> > > > > > > Best regards,
> > > > > > > Duong
> > > > > > >
> > > > > >
> > > > > >
> > > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail:
> users-unsubscribe at gridengine.sunsource.net
> > > > > > For additional commands, e-mail:
> users-help at gridengine.sunsource.net
> > > > > >
> > > > > >
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > > > > For additional commands, e-mail:
> users-help at gridengine.sunsource.net
> > > > >
> > > > >
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > > For additional commands, e-mail: users-help at gridengine.sunsource.net
> > >
> > >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>



More information about the gridengine-users mailing list