[GE users] MPICH2 tight integration: Jobs not running

Azhar Ali Shah aas_lakyari at yahoo.com
Tue Apr 1 10:49:29 BST 2008


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Reuti <reuti at staff.uni-marburg.de> wrote:although often the supplied >rsh-mechanism in SGE in safe enough in a private cluster (as a dedicated port >and daemon is started for each qrsh - no need to have rshd running all the time) >you can of course also use ssh.

>you are not using 1.0.6p1 as it's SGE integration is broken (in  fact: rsh-startup >is broken)?

I am using 1.0.7rc2

>you compiled mpich2 in the default way, which will use ssh?

I compiled it with following configuration (where is /usr/SGE6 is a NFS mounted directory):

./configure --prefix=/usr/SGE6/mpich2_smpd --with-pm=smpd --with-pmi=smpd


> you setup SGE according to http://gridengine.sunsource.net/howto/ >qrsh_qlogin_ssh.html

Yes

>in the start/stop-script the cerated/removed link must be changed  
>to read ssh to get also a Tight Integration

I cann't understand this point. Could you please provide some hint?

many thanks for your help
Azhar



Reuti <reuti at staff.uni-marburg.de> wrote:although often the supplied rsh-mechanism in SGE in safe enough in a  
private cluster (as a dedicated port and daemon is started for each  
qrsh - no need to have rshd running all the time) you can of course  
also use ssh.

- you are not using 1.0.6p1 as it's SGE integration is broken (in  
fact: rsh-startup is broken)?

- you compiled mpich2 in the default way, which will use ssh?
- you setup SGE according to http://gridengine.sunsource.net/howto/ 
qrsh_qlogin_ssh.html
- in the start/stop-script the cerated/removed link must be changed  
to read ssh to get also a Tight Integration

About your special application: you also compiled it on your own  
using your MPICH2 build?

-- Reuti


> -catch_rsh /usr/SGE6//default/spool/smeg/active_jobs/117.1/pe_hostfile
> smeg
> justice
> taramel
> eomer
> connect to address 128.243.24.47: Connection refused
> connect to address 128.243.24.47: Connection refused
> trying normal rsh (/usr/bin/rsh)
> connect to address 128.243.24.98: Connection refused
> connect to address 128.243.24.98: Connection refused
> trying normal rsh (/usr/bin/rsh)
> connect to address 128.243.18.20: Connection refused
> connect to address 128.243.18.20: Connection refused
> trying normal rsh (/usr/bin/rsh)
> connect to address 128.243.24.110: Connection refused
> connect to address 128.243.24.110: Connection refused
> trying normal rsh (/usr/bin/rsh)
> eomer.cs.nott.ac.uk: Connection refused
> smeg.cs.nott.ac.uk: Connection refused
> taramel.cs.nott.ac.uk: Connection refused
> justice.cs.nott.ac.uk: Connection refused



> what could be wrong?
>
>
> Reuti  wrote: Hi,
>
> what I see is:
>
> Am 28.03.2008 um 20:13 schrieb Azhar Ali Shah:
> > Now it seems that the job gets scheduled but it gets failed with
> > following log messsages on the node:
> >
> > 03/28/2008 16:22:35|execd|justice|I|starting up GE 6.1u3 (lx24-x86)
> > 03/28/2008 17:38:40|execd|justice|E|shepherd of job 85.1 exited
> > with exit status = 10
> > 03/28/2008 17:38:40|execd|justice|W|reaping job "85" ptf complains:
> > Job does not exist
> >
> > I get the job failed email:
> >
> > Job 85 (pejob) Aborted
> > Exit Status = -1
> > Signal = unknown signal
> > User = aas
> > Queue = all.q at justice.cs.nott.ac.uk
> > Host = justice.cs.nott.ac.uk
> > Start Time =
> > End Time =
> > CPU = NA
> > Max vmem = NA
> > failed in pestart because:
> >
> > 03/28/2008 17:38:40 [0:4512]: exit_status of pe_start = 1
> > I don't know what makes pe to behave like this?
> >
> >
> > Reuti wrote: Hi,
> >
> > Am 28.03.2008 um 17:36 schrieb Azhar Ali Shah:
> > > cheduling info:
> > > cannot run in queue "all.q" because PE "mpich2_smpd_rsh" is not in
> > > pe list
> > > cannot run in PE "mpich2_smpd_rsh" because it only offers 0 slots
> > >
> > > but
> > >
> > > [aas at taramel sge_jobs]$ qconf -sp mpich2_smpd_rsh
> > > pe_name mpich2_smpd_rsh
> > > slots 999
> > > user_lists NONE
> > > xuser_lists NONE
> > > start_proc_args /usr/SGE6/mpich2_smpd/startmpich2.sh -catch_rsh \
> > > /home/aas/.smpd
>
> what did you include here? It must be the granted hostlist of SGE's
> elected nodes which will be reformatted, hence $pe_hostfile
>
> Did you also adjust the PATHs in the start/stop-proc to reflect your
> installation?
>
> -- Reuti
>
>
> > > stop_proc_args /usr/SGE6/mpich2_smpd/stopmpich2.sh
> > > allocation_rule $round_robin
> > > control_slaves TRUE
> > > job_is_first_task FALSE
> > > urgency_slots min
> > >
> > > I am bit confused?
> >
> > you have to attach this PE in additioon to a queue, in your case
> > all.q, in the entry "pe_list mpich2_smpd_rsh"
> >
> > -- Reuti
> >
> >
> > >
> > >
> > > Chris Dagdigian wrote:
> > > When you have jobs in "qw" state one of the best ways to learn
> > why it
> > > is still pending is to run:
> > >
> > > qstat -j
> > >
> > > ... on the job that is in "qw" state. There will be information
> > in the
> > > output called "scheduling message" or "scheduling info" that will
> > give
> > > you some insight as to why the job could not be placed during the
> > > previous scheduling interval.
> > >
> > > Regards,
> > > Chris
> > >
> > >
> > > On Mar 27, 2008, at 4:28 PM, Azhar Ali Shah wrote:
> > > > Hi,
> > > >
> > > > I am trying to integrate daemon less smpd based installation of
> > > > MPICH2-1.0.7rc on a SGE cluster using:
> > > > http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-
> > > integration.html
> > > >
> > > > After finishing the process when I submit test job, it goes
> > inot qw
> > > > state only. I checked the startmpich2.sh and stopmpich2.sh and
> > they
> > > > work fine on command line.
> > > >
> > > > Any ideas on what could probabily be wrong? I have read the docs
> > > > again and again but didn't help!
> > > >
> > > > thanks in advance for your help
> > > > Azhar
> > > >
> > >
> > >
> > >
> >  
> ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > > For additional commands, e-mail: users- 
> help at gridengine.sunsource.net
> > >
> > >
> > >
> > > Never miss a thing. Make Yahoo your homepage.
> >
> >
> >  
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >
> >
> >
> >  
> ----------------------------------------------------------------------
> > -----------------------------------
> > Azhar Ali Shah,
> > Doctoral Student,
> > Automated Scheduling Optimization And Planning(ASAP) Group,
> > School of Computer Science, University of Nottingham, UK
> > URL: http://www.cs.nott.ac.uk/~aas/
> >
> > Never miss a thing. Make Yahoo your homepage.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>
> No Cost - Get a month of Blockbuster Total Access now. Sweet deal  
> for Yahoo! users and friends.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



       
---------------------------------
Special deal for Yahoo! users & friends - No Cost. Get a month of Blockbuster Total Access now



More information about the gridengine-users mailing list