[GE users] Fwd: subnode with empty slots but jobs in queue

reuti reuti at staff.uni-marburg.de
Mon Dec 6 19:02:45 GMT 2010


Am 06.12.2010 um 19:55 schrieb jlforrest:

> On 12/6/2010 10:17 AM, reuti wrote:
> 
>>> Running 'qstat -g t -l h=compute-0-0 -s' results in
>>> no output. Is this correct?
>> 
>> No, I forgot to mention -u "*" in addition to get the list of all users' jobs.
> 
> No problem. At least it wasn't me screwing up. The output
> is below.
> 
> I think I might have some idea of what might be causing
> this. compute-0-7 crashed last week, I think on 12/02/2010.
> I brought it up soon afterwards. So, the jobs that show
> a submit time of before 12/02/2010 are not really there.
> I counted and there are 19 of them. This, plus the 29 that
> are running, equals 48, which is the number of cores.
> 
> So the real question is why did these jobs remain
> visible to SGE after compute-0-7 was rebooted.

Was the node only rebooted, or also the local spool directory of SGE removed? When the local spool directory exists after the reboot, the execd would inform the qmaster about the failed jobs. When there is no information on the node about the last running jobs, the execd won't tell anything to the qmaster, and on its own it's waiting for the jobs to reappear.

-- Reuti


> job-ID  prior   name       user         state submit/start at     queue 
>                          master ja-task-ID
> ------------------------------------------------------------------------------------------------------------------
>    6954 0.55500 T.1.0.N.11 an           r     12/06/2010 09:07:19 
> all.q at compute-0-7.local        MASTER
>    5874 0.55500 Job descri wendy        r     11/30/2010 14:38:49 
> all.q at compute-0-7.local        MASTER
>    6959 0.55500 T.1.0.N.11 an           r     12/06/2010 09:07:19 
> all.q at compute-0-7.local        MASTER
>    5228 0.55500 Job descri maximoff     r     11/23/2010 15:22:34 
> all.q at compute-0-7.local        MASTER
>    6980 0.55500 T.1.0.N.11 an           r     12/06/2010 09:07:19 
> all.q at compute-0-7.local        MASTER
>    6969 0.55500 T.1.0.N.11 an           r     12/06/2010 09:07:19 
> all.q at compute-0-7.local        MASTER
>    6088 0.55500 Job descri maximoff     r     12/01/2010 11:35:19 
> all.q at compute-0-7.local        MASTER
>    6965 0.55500 T.1.0.N.11 an           r     12/06/2010 09:07:19 
> all.q at compute-0-7.local        MASTER
>    6973 0.55500 T.1.0.N.11 an           r     12/06/2010 09:07:19 
> all.q at compute-0-7.local        MASTER
>    6977 0.55500 T.1.0.N.11 an           r     12/06/2010 09:07:19 
> all.q at compute-0-7.local        MASTER
>    5873 0.55500 Job descri wendy        r     11/30/2010 14:37:34 
> all.q at compute-0-7.local        MASTER
>    5225 0.55500 Job descri maximoff     r     11/23/2010 15:14:34 
> all.q at compute-0-7.local        MASTER
>    6093 0.55500 Job descri maximoff     r     12/01/2010 11:37:04 
> all.q at compute-0-7.local        MASTER
>    5224 0.55500 Job descri maximoff     r     11/23/2010 15:13:04 
> all.q at compute-0-7.local        MASTER
>    6962 0.55500 T.1.0.N.11 an           r     12/06/2010 09:07:19 
> all.q at compute-0-7.local        MASTER
>    6970 0.55500 T.1.0.N.11 an           r     12/06/2010 09:07:19 
> all.q at compute-0-7.local        MASTER
>    6091 0.55500 Job descri maximoff     r     12/01/2010 11:36:19 
> all.q at compute-0-7.local        MASTER
>    6979 0.55500 T.1.0.N.11 an           r     12/06/2010 09:07:19 
> all.q at compute-0-7.local        MASTER
>    6967 0.55500 T.1.0.N.11 an           r     12/06/2010 09:07:19 
> all.q at compute-0-7.local        MASTER
>    6971 0.55500 T.1.0.N.11 an           r     12/06/2010 09:07:19 
> all.q at compute-0-7.local        MASTER
>    6957 0.55500 T.1.0.N.11 an           r     12/06/2010 09:07:19 
> all.q at compute-0-7.local        MASTER
>    6956 0.55500 T.1.0.N.11 an           r     12/06/2010 09:07:19 
> all.q at compute-0-7.local        MASTER
>    6961 0.55500 T.1.0.N.11 an           r     12/06/2010 09:07:19 
> all.q at compute-0-7.local        MASTER
>    6098 0.55500 Job descri maximoff     r     12/01/2010 11:41:49 
> all.q at compute-0-7.local        MASTER
>    6096 0.55500 Job descri maximoff     r     12/01/2010 11:40:19 
> all.q at compute-0-7.local        MASTER
>    6084 0.55500 Job descri maximoff     r     12/01/2010 11:11:34 
> all.q at compute-0-7.local        MASTER
>    6090 0.55500 Job descri maximoff     r     12/01/2010 11:36:04 
> all.q at compute-0-7.local        MASTER
>    5226 0.55500 Job descri maximoff     r     11/23/2010 15:17:04 
> all.q at compute-0-7.local        MASTER
>    6978 0.55500 T.1.0.N.11 an           r     12/06/2010 09:07:19 
> all.q at compute-0-7.local        MASTER
>    3003 0.55500 QQQ        mforrest     r     10/29/2010 11:33:56 
> all.q at compute-0-7.local        MASTER
>    6960 0.55500 T.1.0.N.11 an           r     12/06/2010 09:07:19 
> all.q at compute-0-7.local        MASTER
>    6958 0.55500 T.1.0.N.11 an           r     12/06/2010 09:07:19 
> all.q at compute-0-7.local        MASTER
>    6085 0.55500 Job descri maximoff     r     12/01/2010 11:11:49 
> all.q at compute-0-7.local        MASTER
>    6087 0.55500 Job descri maximoff     r     12/01/2010 11:34:49 
> all.q at compute-0-7.local        MASTER
>    5230 0.55500 Job descri maximoff     r     11/23/2010 15:28:04 
> all.q at compute-0-7.local        MASTER
>    6089 0.55500 Job descri maximoff     r     12/01/2010 11:35:34 
> all.q at compute-0-7.local        MASTER
>    6099 0.55500 Job descri maximoff     r     12/01/2010 11:42:34 
> all.q at compute-0-7.local        MASTER
>    6981 0.55500 T.1.0.N.11 an           r     12/06/2010 09:07:19 
> all.q at compute-0-7.local        MASTER
>    6955 0.55500 T.1.0.N.11 an           r     12/06/2010 09:07:19 
> all.q at compute-0-7.local        MASTER
>    6974 0.55500 T.1.0.N.11 an           r     12/06/2010 09:07:19 
> all.q at compute-0-7.local        MASTER
>    6982 0.55500 T.1.0.N.11 an           r     12/06/2010 09:07:19 
> all.q at compute-0-7.local        MASTER
>    6963 0.55500 T.1.0.N.11 an           r     12/06/2010 09:07:19 
> all.q at compute-0-7.local        MASTER
>    6964 0.55500 T.1.0.N.11 an           r     12/06/2010 09:07:19 
> all.q at compute-0-7.local        MASTER
>    6966 0.55500 T.1.0.N.11 an           r     12/06/2010 09:07:19 
> all.q at compute-0-7.local        MASTER
>    6972 0.55500 T.1.0.N.11 an           r     12/06/2010 09:07:19 
> all.q at compute-0-7.local        MASTER
>    6976 0.55500 T.1.0.N.11 an           r     12/06/2010 09:07:19 
> all.q at compute-0-7.local        MASTER
>    6975 0.55500 T.1.0.N.11 an           r     12/06/2010 09:07:19 
> all.q at compute-0-7.local        MASTER
>    6968 0.55500 T.1.0.N.11 an           r     12/06/2010 09:07:19 
> all.q at compute-0-7.local        MASTER
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=302529
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=302532

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list