[GE users] MPICH2 tight integration: Jobs not running

Reuti reuti at staff.uni-marburg.de
Tue Apr 1 14:05:38 BST 2008


Am 01.04.2008 um 11:49 schrieb Azhar Ali Shah:
> Reuti <reuti at staff.uni-marburg.de> wrote:although often the  
> supplied >rsh-mechanism in SGE in safe enough in a private cluster  
> (as a dedicated port >and daemon is started for each qrsh - no need  
> to have rshd running all the time) >you can of course also use ssh.
>
> >you are not using 1.0.6p1 as it's SGE integration is broken (in  
> fact: rsh-startup >is broken)?
>
> I am using 1.0.7rc2

Fine.


> >you compiled mpich2 in the default way, which will use ssh?
>
> I compiled it with following configuration (where is /usr/SGE6 is a  
> NFS mounted directory):
> ./configure --prefix=/usr/SGE6/mpich2_smpd --with-pm=smpd --with- 
> pmi=smpd
>
> > you setup SGE according to http://gridengine.sunsource.net/howto/  
> >qrsh_qlogin_ssh.html
>
> Yes

But your demo mpihello used rsh (according to the ps-output you posted)?


> >in the start/stop-script the cerated/removed link must be changed
> >to read ssh to get also a Tight Integration
>
> I cann't understand this point. Could you please provide some hint?

SGE will catch the "rsh" command (initiated by mpiexec) by installing  
a link in $TMPDIR pointing to the rsh-wrapper. If your application is  
now calling "ssh" instead of "rsh", this catch-mechanism won't work  
as intended. You can either convince the application to use just rsh  
(as mentioned in the Howto: "MPIEXEC_RSH=rsh; export MPIEXEC_RSH"),  
or create a link called ssh in $TMPDIR (or both).

Although the link will then be called ssh, SGE will still use rsh  
unless you setup SGE to use ssh.

You could even set "MPIEXEC_RSH=any_name; export MPIEXEC_RSH" and  
create a link called "any_name" in $TMPDIR pointing to SGE's rsh- 
wrapper. It's just a name, and whether you use rsh or ssh in the end  
doesn't matter at that point.

Again the question: your special application was also compiled with  
this MPICH2 version's mpicc?

-- Reuti


> many thanks for your help
> Azhar
>
>
>
> Reuti <reuti at staff.uni-marburg.de> wrote:although often the  
> supplied rsh-mechanism in SGE in safe enough in a
> private cluster (as a dedicated port and daemon is started for each
> qrsh - no need to have rshd running all the time) you can of course
> also use ssh.
>
> - you are not using 1.0.6p1 as it's SGE integration is broken (in
> fact: rsh-startup is broken)?
>
> - you compiled mpich2 in the default way, which will use ssh?
> - you setup SGE according to http://gridengine.sunsource.net/howto/
> qrsh_qlogin_ssh.html
> - in the start/stop-script the cerated/removed link must be changed
> to read ssh to get also a Tight Integration
>
> About your special application: you also compiled it on your own
> using your MPICH2 build?
>
> -- Reuti
>
>
> > -catch_rsh /usr/SGE6//default/spool/smeg/active_jobs/117.1/ 
> pe_hostfile
> > smeg
> > justice
> > taramel
> > eomer
> > connect to address 128.243.24.47: Connection refused
> > connect to address 128.243.24.47: Connection refused
> > trying normal rsh (/usr/bin/rsh)
> > connect to address 128.243.24.98: Connection refused
> > connect to address 128.243.24.98: Connection refused
> > trying normal rsh (/usr/bin/rsh)
> > connect to address 128.243.18.20: Connection refused
> > connect to address 128.243.18.20: Connection refused
> > trying normal rsh (/usr/bin/rsh)
> > connect to address 128.243.24.110: Connection refused
> > connect to address 128.243.24.110: Connection refused
> > trying normal rsh (/usr/bin/rsh)
> > eomer.cs.nott.ac.uk: Connection refused
> > smeg.cs.nott.ac.uk: Connection refused
> > taramel.cs.nott.ac.uk: Connection refused
> > justice.cs.nott.ac.uk: Connection refused
>
>
>
> > what could be wrong?
> >
> >
> > Reuti wrote: Hi,
> >
> > what I see is:
> >
> > Am 28.03.2008 um 20:13 schrieb Azhar Ali Shah:
> > > Now it seems that the job gets scheduled but it gets failed with
> > > following log messsages on the node:
> > >
> > > 03/28/2008 16:22:35|execd|justice|I|starting up GE 6.1u3 (lx24- 
> x86)
> > > 03/28/2008 17:38:40|execd|justice|E|shepherd of job 85.1 exited
> > > with exit status = 10
> > > 03/28/2008 17:38:40|execd|justice|W|reaping job "85" ptf  
> complains:
> > > Job does not exist
> > >
> > > I get the job failed email:
> > >
> > > Job 85 (pejob) Aborted
> > > Exit Status = -1
> > > Signal = unknown signal
> > > User = aas
> > > Queue = all.q at justice.cs.nott.ac.uk
> > > Host = justice.cs.nott.ac.uk
> > > Start Time =
> > > End Time =
> > > CPU = NA
> > > Max vmem = NA
> > > failed in pestart because:
> > >
> > > 03/28/2008 17:38:40 [0:4512]: exit_status of pe_start = 1
> > > I don't know what makes pe to behave like this?
> > >
> > >
> > > Reuti wrote: Hi,
> > >
> > > Am 28.03.2008 um 17:36 schrieb Azhar Ali Shah:
> > > > cheduling info:
> > > > cannot run in queue "all.q" because PE "mpich2_smpd_rsh" is  
> not in
> > > > pe list
> > > > cannot run in PE "mpich2_smpd_rsh" because it only offers 0  
> slots
> > > >
> > > > but
> > > >
> > > > [aas at taramel sge_jobs]$ qconf -sp mpich2_smpd_rsh
> > > > pe_name mpich2_smpd_rsh
> > > > slots 999
> > > > user_lists NONE
> > > > xuser_lists NONE
> > > > start_proc_args /usr/SGE6/mpich2_smpd/startmpich2.sh - 
> catch_rsh \
> > > > /home/aas/.smpd
> >
> > what did you include here? It must be the granted hostlist of SGE's
> > elected nodes which will be reformatted, hence $pe_hostfile
> >
> > Did you also adjust the PATHs in the start/stop-proc to reflect your
> > installation?
> >
> > -- Reuti
> >
> >
> > > > stop_proc_args /usr/SGE6/mpich2_smpd/stopmpich2.sh
> > > > allocation_rule $round_robin
> > > > control_slaves TRUE
> > > > job_is_first_task FALSE
> > > > urgency_slots min
> > > >
> > > > I am bit confused?
> > >
> > > you have to attach this PE in additioon to a queue, in your case
> > > all.q, in the entry "pe_list mpich2_smpd_rsh"
> > >
> > > -- Reuti
> > >
> > >
> > > >
> > > >
> > > > Chris Dagdigian wrote:
> > > > When you have jobs in "qw" state one of the best ways to learn
> > > why it
> > > > is still pending is to run:
> > > >
> > > > qstat -j
> > > >
> > > > ... on the job that is in "qw" state. There will be information
> > > in the
> > > > output called "scheduling message" or "scheduling info" that  
> will
> > > give
> > > > you some insight as to why the job could not be placed during  
> the
> > > > previous scheduling interval.
> > > >
> > > > Regards,
> > > > Chris
> > > >
> > > >
> > > > On Mar 27, 2008, at 4:28 PM, Azhar Ali Shah wrote:
> > > > > Hi,
> > > > >
> > > > > I am trying to integrate daemon less smpd based  
> installation of
> > > > > MPICH2-1.0.7rc on a SGE cluster using:
> > > > > http://gridengine.sunsource.net/howto/mpich2-integration/ 
> mpich2-
> > > > integration.html
> > > > >
> > > > > After finishing the process when I submit test job, it goes
> > > inot qw
> > > > > state only. I checked the startmpich2.sh and stopmpich2.sh and
> > > they
> > > > > work fine on command line.
> > > > >
> > > > > Any ideas on what could probabily be wrong? I have read the  
> docs
> > > > > again and again but didn't help!
> > > > >
> > > > > thanks in advance for your help
> > > > > Azhar
> > > > >
> > > >
> > > >
> > > >
> > >
> >  
> ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: users- 
> unsubscribe at gridengine.sunsource.net
> > > > For additional commands, e-mail: users-
> > help at gridengine.sunsource.net
> > > >
> > > >
> > > >
> > > > Never miss a thing. Make Yahoo your homepage.
> > >
> > >
> > >
> >  
> ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > > For additional commands, e-mail: users- 
> help at gridengine.sunsource.net
> > >
> > >
> > >
> > >
> > >
> >  
> ----------------------------------------------------------------------
> > > -----------------------------------
> > > Azhar Ali Shah,
> > > Doctoral Student,
> > > Automated Scheduling Optimization And Planning(ASAP) Group,
> > > School of Computer Science, University of Nottingham, UK
> > > URL: http://www.cs.nott.ac.uk/~aas/
> > >
> > > Never miss a thing. Make Yahoo your homepage.
> >
> >
> >  
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >
> >
> > No Cost - Get a month of Blockbuster Total Access now. Sweet deal
> > for Yahoo! users and friends.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>
> Special deal for Yahoo! users & friends - No Cost. Get a month of  
> Blockbuster Total Access now


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list