[GE users] SGE jobs stuck in pending state

emallove ethan.mallove at sun.com
Fri Jul 24 20:34:54 BST 2009


On Fri, Jul/24/2009 10:13:56AM, templedf wrote:
> Oh, well that's the problem.  Look at the hostlist line.  It says none.
> Replace it with either "@allhosts" and make sure your hosts are in the
> @allhosts host group, or replace it with a list of all your hosts.

First I had to create the @allhosts group using "qconf -ahgrp", then
it worked. qsub runs without errors, and it logs to my SGE logs
directory.

Thanks!

-Ethan

>
> Daniel
>
> emallove wrote:
> > On Thu, Jul/23/2009 06:57:48PM, craffi wrote:
> >
> >> What is the output of "qconf -sq all.q" ?
> >>
> >
> > $ qconf -sq all.q
> > qname                 all.q
> > hostlist              NONE
> > seq_no                0
> > load_thresholds       np_load_avg=1.75
> > suspend_thresholds    NONE
> > nsuspend              1
> > suspend_interval      00:05:00
> > priority              0
> > min_cpu_interval      00:05:00
> > processors            UNDEFINED
> > qtype                 BATCH INTERACTIVE
> > ckpt_list             NONE
> > pe_list               make
> > rerun                 FALSE
> > slots                 1,[burl-ct-v20z-11=2],[burl-ct-v20z-34=2], \
> >                       [burl-ct-v20z-35=2],[burl-ct-v20z-36=2], \
> >                       [burl-ct-v20z-37=2],[burl-ct-v20z-38=2], \
> >                       [burl-ct-v20z-4=2],[burl-ct-v20z-39=2], \
> >                       [burl-ct-v20z-40=2],[burl-ct-v20z-22=2], \
> >                       [burl-ct-v20z-3=2],[burl-ct-v20z-31=2], \
> >                       [burl-ct-v20z-41=2],[burl-ct-v20z-42=2], \
> >                       [burl-ct-v20z-43=2],[burl-ct-v20z-44=2], \
> >                       [burl-ct-v20z-45=2],[burl-ct-v20z-46=2], \
> >                       [burl-ct-v20z-47=2],[burl-ct-v20z-48=2], \
> >                       [burl-ct-v20z-5=2],[burl-ct-v20z-49=2], \
> >                       [burl-ct-v20z-50=2],[burl-ct-v20z-51=2], \
> >                       [burl-ct-v20z-53=2],[burl-ct-v20z-54=2], \
> >                       [burl-ct-v20z-55=2],[burl-ct-v20z-56=2], \
> >                       [burl-ct-v20z-57=2],[burl-ct-v20z-58=2], \
> >                       [burl-ct-v20z-59=2],[burl-ct-v20z-60=2], \
> >                       [burl-ct-v20z-61=2],[burl-ct-v20z-62=2], \
> >                       [burl-ct-v20z-63=2],[burl-ct-v20z-64=2], \
> >                       [burl-ct-v20z-65=2],[burl-ct-v20z-66=2], \
> >                       [burl-ct-v20z-67=2],[burl-ct-v20z-7=2], \
> >                       [burl-ct-v20z-68=2],[burl-ct-v20z-69=2], \
> >                       [burl-ct-v20z-8=2],[burl-ct-v20z-6=2], \
> >                       [burl-ct-v20z-86=2],[burl-ct-v40z-1=8], \
> >                       [burl-ct-v440-1=4],[burl-ct-v440-0=4], \
> >                       [burl-ct-v440-5=4],[burl-ct-v440-6=4], \
> >                       [burl-ct-v40z-0=7],[burl-ct-v440-3=4], \
> >                       [burl-ct-v440-7=4],[burl-ct-v440-2=4], \
> >                       [burl-ct-280r-1=2],[burl-ct-280r-6=2], \
> >                       [burl-ct-280r-0=2],[burl-ct-280r-2=2], \
> >                       [burl-ct-280r-3=2],[burl-ct-280r-5=2], \
> >                       [burl-ct-280r-4=2],[burl-ct-280r-7=2], \
> >                       [burl-ct-280r-8=2],[burl-ct-v20z-101=2], \
> >                       [burl-ct-v20z-72=2],[burl-ct-v20z-74=2], \
> >                       [burl-ct-v20z-79=2],[burl-ct-v20z-73=2], \
> >                       [burl-ct-v20z-71=2],[burl-ct-v20z-76=2], \
> >                       [burl-ct-v20z-77=2],[burl-ct-v20z-70=2], \
> >                       [burl-ct-v20z-75=2],[burl-ct-v20z-84=2], \
> >                       [burl-ct-v20z-88=2],[burl-ct-v20z-78=2], \
> >                       [burl-ct-v20z-96=2],[burl-ct-v20z-81=2], \
> >                       [burl-ct-v20z-92=2],[burl-ct-v20z-94=2], \
> >                       [burl-ct-v20z-90=2],[burl-ct-v20z-95=2], \
> >                       [burl-ct-v20z-83=2],[burl-ct-v20z-91=2], \
> >                       [burl-ct-v20z-85=2],[burl-ct-v20z-89=2], \
> >                       [burl-ct-v20z-87=2],[burl-ct-v20z-82=2], \
> >                       [burl-ct-v20z-93=2],[burl-ct-v20z-97=2], \
> >                       [burl-ct-v20z-98=2],[burl-ct-v20z-99=2], \
> >                       [burl-ct-280r-11=2],[burl-ct-280r-10=2], \
> >                       [burl-ct-280r-12=2],[burl-ct-280r-13=2], \
> >                       [burl-ct-280r-16=2],[burl-ct-280r-15=2], \
> >                       [burl-ct-280r-14=2],[burl-ct-280r-17=2], \
> >                       [burl-ct-280r-18=2],[burl-ct-280r-20=2], \
> >                       [burl-ct-280r-21=2],[burl-ct-280r-24=2], \
> >                       [burl-ct-280r-25=2],[burl-ct-280r-23=2], \
> >                       [burl-ct-280r-26=2],[burl-ct-280r-29=2], \
> >                       [burl-ct-280r-31=2],[burl-ct-280r-33=2], \
> >                       [burl-ct-280r-9=2],[burl-ct-280r-19=2], \
> >                       [burl-ct-280r-22=2],[burl-ct-280r-36=2], \
> >                       [burl-ct-280r-37=2],[burl-ct-280r-35=2], \
> >                       [burl-ct-280r-39=2],[burl-ct-280r-40=2], \
> >                       [burl-ct-280r-41=2],[burl-ct-280r-38=2], \
> >                       [burl-ct-280r-42=2],[burl-ct-280r-43=2], \
> >                       [burl-ct-280r-45=2],[burl-ct-280r-48=2], \
> >                       [burl-ct-280r-49=2],[burl-ct-280r-51=2], \
> >                       [burl-ct-280r-50=2],[burl-ct-280r-53=2], \
> >                       [burl-ct-280r-52=2],[burl-ct-280r-54=2], \
> >                       [burl-ct-280r-55=2],[burl-ct-280r-59=2], \
> >                       [burl-ct-280r-56=2],[burl-ct-280r-58=2], \
> >                       [burl-ct-280r-60=2],[burl-ct-280r-57=1], \
> >                       [burl-ct-280r-62=2],[burl-ct-280r-65=2], \
> >                       [burl-ct-280r-63=2],[burl-ct-280r-64=2], \
> >                       [burl-ct-280r-67=2],[burl-ct-280r-69=2], \
> >                       [burl-ct-280r-70=2],[burl-ct-280r-68=2], \
> >                       [burl-ct-280r-71=2],[burl-ct-280r-72=2], \
> >                       [burl-ct-280r-73=2],[burl-ct-280r-77=2], \
> >                       [burl-ct-280r-78=2],[burl-ct-280r-79=2], \
> >                       [burl-ct-280r-80=2],[burl-ct-280r-81=2], \
> >                       [burl-ct-280r-82=2],[burl-ct-280r-83=2], \
> >                       [burl-ct-280r-84=2],[burl-ct-280r-85=2], \
> >                       [burl-ct-280r-86=2],[burl-ct-280r-89=2], \
> >                       [burl-ct-280r-91=2],[burl-ct-280r-90=2], \
> >                       [burl-ct-280r-93=2],[burl-ct-280r-92=2], \
> >                       [burl-ct-280r-94=2],[burl-ct-280r-88=2], \
> >                       [burl-ct-280r-95=2],[burl-ct-280r-96=2], \
> >                       [burl-ct-280r-97=2],[burl-ct-280r-99=2], \
> >                       [burl-ct-280r-101=2],[burl-ct-280r-100=2], \
> >                       [burl-ct-280r-98=2],[burl-ct-280r-102=2], \
> >                       [burl-ct-280r-103=2],[burl-ct-280r-104=2], \
> >                       [burl-ct-280r-106=2],[burl-ct-280r-105=2]
> > tmpdir                /tmp
> > shell                 /bin/csh
> > prolog                NONE
> > epilog                NONE
> > shell_start_mode      posix_compliant
> > starter_method        NONE
> > suspend_method        NONE
> > resume_method         NONE
> > terminate_method      NONE
> > notify                00:00:60
> > owner_list            NONE
> > user_lists            NONE
> > xuser_lists           NONE
> > subordinate_list      NONE
> > complex_values        NONE
> > projects              NONE
> > xprojects             NONE
> > calendar              NONE
> > initial_state         default
> > s_rt                  INFINITY
> > h_rt                  INFINITY
> > s_cpu                 INFINITY
> > h_cpu                 INFINITY
> > s_fsize               INFINITY
> > h_fsize               INFINITY
> > s_data                INFINITY
> > h_data                INFINITY
> > s_stack               INFINITY
> > h_stack               INFINITY
> > s_core                INFINITY
> > h_core                INFINITY
> > s_rss                 INFINITY
> > h_rss                 INFINITY
> > s_vmem                INFINITY
> > h_vmem                INFINITY
> >
> > -Ethan
> >
> >
> >
> >> -Chris
> >>
> >>
> >> On Jul 23, 2009, at 6:02 PM, emallove wrote:
> >>
> >>
> >>> Hello,
> >>>
> >>> All my jobs are getting stuck in the "pending" state, e.g.,
> >>>
> >>>  $ qconf -au em162155 user_lists
> >>>  "em162155" is already in access list "user_lists"
> >>>
> >>>  $ qsub /tmp/hostname.sh
> >>>  Unable to run job: warning: em162155 your job is not allowed to run
> >>> in any queue
> >>>  Your job 8 ("hostname.sh") has been submitted.
> >>>  Exiting.
> >>>
> >>>  $ qconf -sql
> >>>  all.q
> >>>  default
> >>>
> >>>  $ qstat -f
> >>>
> >>> ############################################################################
> >>>   - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS -
> >>> PENDING JOBS
> >>>
> >>> ############################################################################
> >>>        1 0.75000 hostname.s em162155     qw    07/15/2009
> >>> 16:11:46     1
> >>>        2 0.74958 hostname.s em162155     qw    07/15/2009
> >>> 16:21:29     1
> >>>        3 0.74955 hostname.s em162155     qw    07/15/2009
> >>> 16:22:19     1
> >>>        4 0.74944 hostname.s em162155     qw    07/15/2009
> >>> 16:24:47     1
> >>>        5 0.74912 hostname.s em162155     qw    07/15/2009
> >>> 16:32:08     1
> >>>        6 0.74911 hostname.s em162155     qw    07/15/2009
> >>> 16:32:23     1
> >>>        8 0.25000 hostname.s em162155     qw    07/23/2009
> >>> 17:43:42     1
> >>>
> >>>  $ qstat -g c
> >>>  CLUSTER QUEUE                   CQLOAD   USED    RES  AVAIL  TOTAL
> >>> aoACDS  cdsuE
> >>>
> >>> --------------------------------------------------------------------------------
> >>>  all.q                             -NA-      0      0      0
> >>> 0      0      0
> >>>  default                           -NA-      0      0      0
> >>> 0      0      0
> >>>
> >>>  $ qstat -j 1
> >>>  ==============================================================
> >>>  job_number:                 1
> >>>  ...
> >>>  scheduling info:            All queues dropped because of overload
> >>> or full
> >>>
> >>>  $ qstat -V |& head -1
> >>>  GE 6.2u3
> >>>
> >>> Any idea how to fix this?
> >>>
> >>> Note: I only have the sge_execd daemon running on two hosts (the
> >>> master + another host), because I'm trying to configure a small
> >>> sandbox SGE configuration before scaling up to a large one. All my
> >>> daemons are running as user "em162155".
> >>>
> >>> Thanks,
> >>> Ethan
> >>>
> >>> ------------------------------------------------------
> >>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=209256
> >>>
> >>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net
> >>> ].
> >>>
> >> ------------------------------------------------------
> >> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=209264
> >>
> >> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> >>
> >
> > ------------------------------------------------------
> > http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=209348
> >
> > To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> >
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=209367
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=209385

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list