[GE users] only 1 set of Qs will run

Harry Mangalam harry.mangalam at uci.edu
Thu Nov 27 03:38:17 GMT 2008

I have 2 subclusters, one AMD64, one i32, both running CentOS5.2, both 
under control of one SGE 6.2, which is slowly starting to behave.  

The i32 nodes are running in a private net, along with the qmaster, 
which has a public and private interface.

The AMD64 nodes are running 'remotely' on public IP #s across campus.

Both groups show up correctly on a qhost query. 
Both nodes can be passwordlessly ssh'ed into from the login node which 
has both private and public interfaces.

Because of the arch & geographic differences, I've set up different Qs 
to feed each subcluster (xxx_i32, xxx_a64)

After a few hiccups, the private net nodes are running both 
interactive and batch jobs correctly after being submitted from the 
login node, but the remote AMD64 nodes are still refusing to execute 
the jobs.

for example, trying to log into an a64 Q that has been defined to be 

----- example start -----
$ qrsh -verbose -q int_a64
local configuration bduc-login.nacs.uci.edu not defined - using global 
Your job 140 ("QRLOGIN") has been submitted
waiting for interactive job to be scheduled ...timeout (3 s) expired 
while waiting on socket fd 4

Your "qrsh" request could not be scheduled, try again later.
----- example end -----

A qsub of simple.sh to one of the a64 Qs are held in 'qw' 
or  'Pending' status until killed.

If I use qmon and click on the job and then the "why?" button, it 
scheduling info: (Collecting of scheduling job information is turned 

It also USED to show this error:

Error for job 108: can't create directory active_jobs/108.1: Stale NFS 
file handle 

but after I restarted the sge_execd on the nodes, it no longer shows 
that error, just the one noted above.

The differences between the Q definitions of the working i32 Qs and 
the nonworking a64 Qs are minimal:
$ qconf -sq int_i32 >int_i32.q_config
$ qconf -sq int_a64 >int_a64.q_config

$ diff int_a64.q_config int_i32.q_config
< qname                 int_a64
< hostlist              @int_a64
> qname                 int_i32
> hostlist              @int_i32

I'm missing something but don't know what...

Harry Mangalam - Research Computing, NACS, E2148, Engineering Gateway, 
UC Irvine 92697  949 824-0084(o), 949 285-4487(c)
Good judgment comes from experience; 
Experience comes from bad judgment. [F. Brooks.]


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list