[GE users] only 1 set of Qs will run

reuti reuti at staff.uni-marburg.de
Thu Nov 27 12:00:04 GMT 2008


Am 27.11.2008 um 04:38 schrieb Harry Mangalam:

> I have 2 subclusters, one AMD64, one i32, both running CentOS5.2, both
> under control of one SGE 6.2, which is slowly starting to behave.
>
> The i32 nodes are running in a private net, along with the qmaster,
> which has a public and private interface.
>
> The AMD64 nodes are running 'remotely' on public IP #s across campus.

Which qrsh_command et al. entries are you using? builtin, rsh or ssh?  
Any firewall on the external interface of the login node active?

-- Reuti

> Both groups show up correctly on a qhost query.
> Both nodes can be passwordlessly ssh'ed into from the login node which
> has both private and public interfaces.
>
> Because of the arch & geographic differences, I've set up different Qs
> to feed each subcluster (xxx_i32, xxx_a64)
>
> After a few hiccups, the private net nodes are running both
> interactive and batch jobs correctly after being submitted from the
> login node, but the remote AMD64 nodes are still refusing to execute
> the jobs.
>
> for example, trying to log into an a64 Q that has been defined to be
> interactive.
>
> ----- example start -----
> $ qrsh -verbose -q int_a64
> local configuration bduc-login.nacs.uci.edu not defined - using global
> configuration
> Your job 140 ("QRLOGIN") has been submitted
> waiting for interactive job to be scheduled ...timeout (3 s) expired
> while waiting on socket fd 4
>
> Your "qrsh" request could not be scheduled, try again later.
> ----- example end -----
>
>
> A qsub of simple.sh to one of the a64 Qs are held in 'qw'
> or  'Pending' status until killed.
>
> If I use qmon and click on the job and then the "why?" button, it
> shows:
> scheduling info: (Collecting of scheduling job information is turned
> off)
>
> It also USED to show this error:
>
> Error for job 108: can't create directory active_jobs/108.1: Stale NFS
> file handle
>
> but after I restarted the sge_execd on the nodes, it no longer shows
> that error, just the one noted above.
>
> The differences between the Q definitions of the working i32 Qs and
> the nonworking a64 Qs are minimal:
> $ qconf -sq int_i32 >int_i32.q_config
> $ qconf -sq int_a64 >int_a64.q_config
>
> $ diff int_a64.q_config int_i32.q_config
> 1,2c1,2
> < qname                 int_a64
> < hostlist              @int_a64
> ---
>> qname                 int_i32
>> hostlist              @int_i32
>
> I'm missing something but don't know what...
>
> -- 
> Harry Mangalam - Research Computing, NACS, E2148, Engineering Gateway,
> UC Irvine 92697  949 824-0084(o), 949 285-4487(c)
> ---
> Good judgment comes from experience;
> Experience comes from bad judgment. [F. Brooks.]
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=90044
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=90081

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list