[GE users] SGE jobs stuck in pending state

craffi dag at sonsorol.org
Fri Jul 24 17:24:45 BST 2009


Something is badly wrong with your setup. You do not appear to have  
any functional cluster queues or queue instances.

"qstat -f" should be showing you the state and status of all queue  
instances. Even if the sge_execd was not running you'd at least see  
the queue instance states represented with status codes 'au'.

The fact that you see nothing regarding queue state seems to mean that  
you have no queues active or configured at all.

Either that or I don't understand something that is happening due to  
your unusual non-root installation.

Also, don't worry about the error you mention below about not  
submitting jobs as root. One of the side effects of your decision to  
not install SGE as root is the caveat that you can only run jobs as  
the user who owns the SGE installation. You need to install as root if  
you want more than just the binary owner to be able to submit tasks.

I'd recommend reinstalling as root even so you can just get some  
experience with what a working setup should look and feel like, you  
can then tear it down or otherwise change various things. Right now  
you are in a very odd and unfunctional state.

You could try to reconsistitute a queue by doing the "qconf -aq"  
command but it may be easier just to blow away the install and replace  
it with a more generic config that has:

  - shared filesystem for $SGE_ROOT shared among all participating hosts
  - installed as root user with 'em162155' as the named SGE admin  
account

-Chris








On Jul 24, 2009, at 12:13 PM, emallove wrote:

> On Fri, Jul/24/2009 11:26:58AM, craffi wrote:
>> Does the output of "qstat -f" really not show you the state of your
>> queues and queue instances and only shows the pending jobs?
>
> Correct. Below is the qstat output verbatim. qstat prints the same
> info from the qmaster node as from my one other non-qmaster node,
> which I assume should always be the case. Interestingly, jobs
> submitted as "root" show the same "unable to run" error, but then do
> not show up in the qstat -f output, e.g., notice job 9 does not show
> up in qstat:
>
>  $ sudo qsub /home/em162155/tmp/hostname.sh
>  Unable to run job: warning: root your job is not allowed to run in  
> any queue
>  Your job 9 ("hostname.sh") has been submitted.
>  Exiting.
>  $ qstat -f
>
>   
> ############################################################################
>   - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS -  
> PENDING JOBS
>   
> ############################################################################
>        1 0.75000 hostname.s em162155     qw    07/15/2009  
> 16:11:46     1
>        2 0.74958 hostname.s em162155     qw    07/15/2009  
> 16:21:29     1
>        3 0.74955 hostname.s em162155     qw    07/15/2009  
> 16:22:19     1
>        4 0.74944 hostname.s em162155     qw    07/15/2009  
> 16:24:47     1
>        5 0.74912 hostname.s em162155     qw    07/15/2009  
> 16:32:08     1
>        6 0.74911 hostname.s em162155     qw    07/15/2009  
> 16:32:23     1
>        8 0.25000 hostname.s em162155     qw    07/23/2009  
> 17:43:42     1
>
> Now, notice job 10 *does* show up in qstat:
>
>  $ qsub /home/em162155/tmp/hostname.sh
>  Unable to run job: warning: em162155 your job is not allowed to run  
> in any queue
>  Your job 10 ("hostname.sh") has been submitted.
>  Exiting.
>  $ qstat -f
>
>   
> ############################################################################
>   - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS -  
> PENDING JOBS
>   
> ############################################################################
>        1 0.75000 hostname.s em162155     qw    07/15/2009  
> 16:11:46     1
>        2 0.74962 hostname.s em162155     qw    07/15/2009  
> 16:21:29     1
>        3 0.74959 hostname.s em162155     qw    07/15/2009  
> 16:22:19     1
>        4 0.74949 hostname.s em162155     qw    07/15/2009  
> 16:24:47     1
>        5 0.74920 hostname.s em162155     qw    07/15/2009  
> 16:32:08     1
>        6 0.74919 hostname.s em162155     qw    07/15/2009  
> 16:32:23     1
>        8 0.29364 hostname.s em162155     qw    07/23/2009  
> 17:43:42     1
>       10 0.25000 hostname.s em162155     qw    07/24/2009  
> 12:14:14     1
>
>  $ qconf |& head -1
>  GE 6.2u3
>
> -Ethan
>
>>
>> -Chris
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=209349
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net 
>> ].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=209355
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net 
> ].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=209357

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list