[GE users] only 1 set of Qs will run

Harry Mangalam harry.mangalam at uci.edu
Mon Dec 1 18:57:54 GMT 2008

The solution to this problem is embarrassing, but since it's one 
related to the mixed administration model we're running under (not 
unfamiliar in the academic world), I thought it might be useful to 
others in the same boat, even tho it's the definition of common 

1:  Having mixed administration of SGE subclusters is, in general, not 
a good idea.

2: if you have to do it, agree on a common configuration approach and 
config files.

3: if you don't, you'll be posting confusing questions to the sge list 
like mine.

The problem was that the admin of the local nodes and the admin of the 
remote nodes were using different config approaches and variables, so 
that the local nodes were using the default 536/537 ports 
(successfully) and the remote nodes were trying to use ports 
1536/1537 (unsuccessfully).

It was my fault for not enforcing a common config, but it would have 
been made much easier to debug if the error message had stated 
instead of:

" ...timeout (3 s) expired while waiting on socket fd 4"

" ...timeout (3 s) expired while waiting on port 1536"

or had somehow indicated the port number.

Apologies to those who wasted their time, but many thanks to them for 
helping me to track it down and to suggest corrections in my config.


On Wednesday 26 November 2008, Harry Mangalam wrote:
> I have 2 subclusters, one AMD64, one i32, both running CentOS5.2,
> both under control of one SGE 6.2, which is slowly starting to
> behave.
> The i32 nodes are running in a private net, along with the qmaster,
> which has a public and private interface.
> The AMD64 nodes are running 'remotely' on public IP #s across
> campus.
> Both groups show up correctly on a qhost query.
> Both nodes can be passwordlessly ssh'ed into from the login node
> which has both private and public interfaces.
> Because of the arch & geographic differences, I've set up different
> Qs to feed each subcluster (xxx_i32, xxx_a64)
> After a few hiccups, the private net nodes are running both
> interactive and batch jobs correctly after being submitted from the
> login node, but the remote AMD64 nodes are still refusing to
> execute the jobs.
> for example, trying to log into an a64 Q that has been defined to
> be interactive.
> ----- example start -----
> $ qrsh -verbose -q int_a64
> local configuration bduc-login.nacs.uci.edu not defined - using
> global configuration
> Your job 140 ("QRLOGIN") has been submitted
> waiting for interactive job to be scheduled ...timeout (3 s)
> expired while waiting on socket fd 4
> Your "qrsh" request could not be scheduled, try again later.
> ----- example end -----
> A qsub of simple.sh to one of the a64 Qs are held in 'qw'
> or  'Pending' status until killed.
> If I use qmon and click on the job and then the "why?" button, it
> shows:
> scheduling info: (Collecting of scheduling job information is
> turned off)
> It also USED to show this error:
> Error for job 108: can't create directory active_jobs/108.1: Stale
> NFS file handle
> but after I restarted the sge_execd on the nodes, it no longer
> shows that error, just the one noted above.
> The differences between the Q definitions of the working i32 Qs and
> the nonworking a64 Qs are minimal:
> $ qconf -sq int_i32 >int_i32.q_config
> $ qconf -sq int_a64 >int_a64.q_config
> $ diff int_a64.q_config int_i32.q_config
> 1,2c1,2
> < qname                 int_a64
> < hostlist              @int_a64
> ---
> > qname                 int_i32
> > hostlist              @int_i32
> I'm missing something but don't know what...

Harry Mangalam - Research Computing, NACS, E2148, Engineering Gateway, 
UC Irvine 92697  949 824-0084(o), 949 285-4487(c)
Good judgment comes from experience; 
Experience comes from bad judgment. [F. Brooks.]


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list