[GE users] Diagnosis on a queue

mad margaret_Doll at brown.edu
Fri May 15 13:51:05 BST 2009

Why would queues not distribute the jobs evenly?  Or how should I go  
about finding the problem?
Why does one compute node in particular have problems with the queues?

I have various groups of users on a ROCKS system where we are using  
Grid Engine 6.1u4 to manage the queues.  I have different queues for  
each of the groups.  I am having problems only with one of the compute  
nodes on the chemistry and het queues.

I keep getting

queue:  chemistry at compute-0-33.local
		queue chemistry marked QERROR as result of job 21162's failure at  
hos compute-0-33.local++++++++


qstat -u user1
  21162 0.50500 prog-05130 user1       r     05/15/2009 01:23:36 chemistry at compute-0-31.local 

shows that the job is running perfectly well on compute-0-31.  Both  
compute-0-33 and compute-0-31 are in both the chemistry and het queues.

[root at cluster ~]# chem
chemistry at compute-0-30.local   BIP   2/8       2.01     lx26-amd64
chemistry at compute-0-31.local   BIP   2/8       4.00     lx26-amd64
chemistry at compute-0-32.local   BIP   0/8       3.90     lx26-amd64
chemistry at compute-0-33.local   BIP   0/8       0.00     lx26-amd64    E
chemistry at compute-0-6.local    BIP   0/8       5.93     lx26-amd64
chemistry at compute-0-7.local    BIP   6/8       6.10     lx26-amd64
[root at cluster ~]# het
het at compute-0-30.local         BIP   0/8       2.02     lx26-amd64
het at compute-0-31.local         BIP   2/8       4.00     lx26-amd64
het at compute-0-32.local         BIP   4/8       3.92     lx26-amd64
het at compute-0-33.local         BIP   0/8       0.02     lx26-amd64    E

The users are now only submitting to the chemistry queue, but we have  
a few jobs which have been running a long time on the het queue.

I have cleared the error from compute-0-33 several times using qmon.   
We have several jobs waiting on the chemistry queue for slots, but the  
available slots are not being filled.

  21166 0.51167 Ladder Dim user2       qw    05/14/2009  
11:01:06                                    4
   21167 0.51167 Symmetric  user2       qw    05/14/2009  
11:01:08                                    4
   21168 0.51167 IRC RHF    user2       qw    05/14/2009  
12:21:53                                    4
   21170 0.51167 Me QST3    user2       qw    05/14/2009  
16:00:01                                    4

User2 currently has jobs running on both queues.

  21154 0.51167 Trimer 6-3  user2       r     05/11/2009 17:44:00 chemistry at compute-0-7.local 
  21161 0.51167 IRC        user2       r     05/14/2009 00:38:35 het at compute-0-32.local 


