[GE users] Diagnosis on a queue

craffi dag at sonsorol.org
Fri May 15 13:55:44 BST 2009


If you have a job that fails repeatedly on one machine that runs fine  
elsewhere then you really need to drill down to find out what is wrong  
with the "bad" machine. The queue level E state makes this even more  
clear.

Something is flaky on that one machine. It could be user related (bad  
UID/GID, user does not exist on the node), OS related (weird selinux  
config, firewall, bad permissions) or it could be flaky hardware.

Queue level E states are triggered when a job fails in a spectacular  
manner. This often means things like the user not existing on that  
node, missing NFS mounts and other machine or OS level problems.

Chris




On May 15, 2009, at 8:51 AM, mad wrote:

> Why would queues not distribute the jobs evenly?  Or how should I go
> about finding the problem?
> Why does one compute node in particular have problems with the queues?
>
> I have various groups of users on a ROCKS system where we are using
> Grid Engine 6.1u4 to manage the queues.  I have different queues for
> each of the groups.  I am having problems only with one of the compute
> nodes on the chemistry and het queues.
>
> I keep getting
>
> queue:  chemistry at compute-0-33.local
> 		queue chemistry marked QERROR as result of job 21162's failure at
> hos compute-0-33.local++++++++
>
> However,
>
> qstat -u user1
>  21162 0.50500 prog-05130 user1       r     05/15/2009 01:23:36 chemistry at compute-0-31.local
>        1
>
> shows that the job is running perfectly well on compute-0-31.  Both
> compute-0-33 and compute-0-31 are in both the chemistry and het  
> queues.
>
>
> [root at cluster ~]# chem
> chemistry at compute-0-30.local   BIP   2/8       2.01     lx26-amd64
> chemistry at compute-0-31.local   BIP   2/8       4.00     lx26-amd64
> chemistry at compute-0-32.local   BIP   0/8       3.90     lx26-amd64
> chemistry at compute-0-33.local   BIP   0/8       0.00     lx26- 
> amd64    E
> chemistry at compute-0-6.local    BIP   0/8       5.93     lx26-amd64
> chemistry at compute-0-7.local    BIP   6/8       6.10     lx26-amd64
> [root at cluster ~]# het
> het at compute-0-30.local         BIP   0/8       2.02     lx26-amd64
> het at compute-0-31.local         BIP   2/8       4.00     lx26-amd64
> het at compute-0-32.local         BIP   4/8       3.92     lx26-amd64
> het at compute-0-33.local         BIP   0/8       0.02     lx26- 
> amd64    E
>
> The users are now only submitting to the chemistry queue, but we have
> a few jobs which have been running a long time on the het queue.
>
> I have cleared the error from compute-0-33 several times using qmon.
> We have several jobs waiting on the chemistry queue for slots, but the
> available slots are not being filled.
>
>  21166 0.51167 Ladder Dim user2       qw    05/14/2009
> 11:01:06                                    4
>   21167 0.51167 Symmetric  user2       qw    05/14/2009
> 11:01:08                                    4
>   21168 0.51167 IRC RHF    user2       qw    05/14/2009
> 12:21:53                                    4
>   21170 0.51167 Me QST3    user2       qw    05/14/2009
> 16:00:01                                    4
>
> User2 currently has jobs running on both queues.
>
>  21154 0.51167 Trimer 6-3  user2       r     05/11/2009 17:44:00 chemistry at compute-0-7.local
>         4
>  21161 0.51167 IRC        user2       r     05/14/2009 00:38:35 het at compute-0-32.local
>              4
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=195895
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net 
> ].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=195898

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list