[GE users] Interactive jobs not starting

Reuti reuti at staff.uni-marburg.de
Fri Nov 30 22:41:43 GMT 2007


Am 30.11.2007 um 22:22 schrieb VS Ang:

> Yes, with the "patched" ssh with tight-integration. There are no  
> firewalls on the cluster. Also, I tried using the patched "ssh"  
> command to login to the node directly, and it works fine. Only when  
> doing qrsh or qlogin it doesn't work.

In this case with the default port 22 only. AFAIK also the Tight SSH  
Integration will still use a random port, but supply the additonal  
group ID.

-- Reuti


> ----- Original Message ----
> From: Reuti <reuti at staff.uni-marburg.de>
> To: users at gridengine.sunsource.net
> Sent: Friday, November 30, 2007 8:50:18 AM
> Subject: Re: [GE users] Interactive jobs not starting
>
> Hi,
>
> Am 30.11.2007 um 00:12 schrieb VS Ang:
>
> > Hello,
> >
> > When I attempt to submit interactive jobs using qrsh or qlogin
> > commands, the job never starts. The "qrsh" command simply returns
> > after a while:
>
> with the default rsh or your defined ssh (from your other post)? Any
> firewall active, which blocks certain ports?
>
> -- Reuti
>
>
> > $ qrsh -verbose
> > Your job 45 ("QRLOGIN") has been submitted
> > waiting for interactive job to be scheduled ...timeout (3 s)
> > expired while waiting on socket fd 4
> >
> > Could not start interactive job.
> >
> > Same thing happens with qlogin:
> >
> > $ qlogin -verbose
> > Your job 46 ("QLOGIN") has been submitted
> > waiting for interactive job to be scheduled ...timeout (4 s)
> > expired while waiting on socket fd 4
> >
> > Could not start interactive job.
> >
> > Also, in the messages of the compute nodes, I see the following
> > errors.
> >
> > 11/29/2007 16:01:26|execd|compute-1-5|E|shepherd of job 26.1 exited
> > with exit status = 9
> > 11/29/2007 16:01:26|execd|compute-1-5|W|reaping job "26" ptf
> > complains: Job does not exist
> > 11/29/2007 16:07:23|execd|compute-1-5|E|shepherd of job 27.1 exited
> > with exit status = 9
> > 11/29/2007 16:07:23|execd|compute-1-5|W|reaping job "27" ptf
> > complains: Job does not exist
> > 11/29/2007 16:44:37|execd|compute-1-5|W|reaping job "33" ptf
> > complains: Job does not exist
> > 11/29/2007 17:57:05|execd|compute-1-5|E|shepherd of job 43.1 exited
> > with exit status = 11
> > 11/29/2007 17:57:05|execd|compute-1-5|W|reaping job "43" ptf
> > complains: Job does not exist
> > 11/29/2007 18:07:34|execd|compute-1-5|W|reaping job "46" ptf
> > complains: Job does not exist
> >
> >
> > And, on the qmaster host:
> >
> > 11/29/2007 17:57:06|qmaster|admin|W|job 43.1 failed on host
> > compute-1-5.local general before job because: 11/29/2007 17:57:05
> > [0:18592]: can't open file /tmp/43.1.all.q/pid: No such file or
> > directory
> > 11/29/2007 17:57:06|qmaster|admin|W|rescheduling job 43.1
> > 11/29/2007 18:06:27|qmaster|admin|W|job 44.1 failed on host
> > compute-1-4.local assumedly after job because: job 44.1 died
> > through signal KILL (9)
> > 11/29/2007 18:07:03|qmaster|admin|W|job 45.1 failed on host
> > compute-1-1.local assumedly after job because: job 45.1 died
> > through signal KILL (9)
> > 11/29/2007 18:07:35|qmaster|admin|W|job 46.1 failed on host
> > compute-1-5.local assumedly after job because: job 46.1 died
> > through signal KILL (9)
> >
> > I am using SGE 6.1u2 (compiled out of sources). Any pointers will
> > be appreciated!
> >
> > Srihari
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list