[GE users] NULL element for JAT_prio

andreas andreas.haas at sun.com
Mon Dec 15 14:30:24 GMT 2008


Hi Jesse,

On Thu, 11 Dec 2008, Jesse Becker wrote:

> Since there's nothing running, and users aren't running jobs anyway, I'm
> trying to kill off the pending jobs.
>
> Since I can't get the job list from qstat, I have to go poking around in the
> spool directory, and recreate the job IDs from there, and call qdel on each in
> turn.  So I'm running this:
>
> 	for i in `ls $SGE_ROOT/SGE_CELL/spool/qmaster/jobs/00/0132/`; do
> 		qdel  132$i
> 	done
>
> This doesn't work, and throws these three lines in the qmaster messages file:
>
> 12/11/2008 16:24:11|worker|saturn|I|beckerjes has deleted job 1328049
> 12/11/2008 16:28:18|worker|saturn|E|unable to retrieve template task

This happens in job_get_ja_task_template_pending(). For some reason this
job had an empty JB_ja_template field. According the code this can't happen
since all jobs get their JB_ja_template at submission time in qmaster.

> 12/11/2008 16:28:18|worker|saturn|C|!!!!!!!!!! got NULL element for
> JAT_granted_destin_identifier_list !!!!!!!!!!

I can't explain why it happens, but this is causing qmaster abort.

> Qmaster then aborts with error code 134.
>
> Since this obviously failed, and the problem appears to be relate to the one
> of the jobs, I moved did the following:
>
> 1)  shutdown qmaster
> 2)  move both the "jobs" and "job_scripts" directories in
> $SGE_ROOT/SGE_CELL/spool/qmaster/ to other locations
> 3)  start qmaster
>
> *This* seems to have fixed the problem, but at the cost losing all of the
> running jobs.  Shutting qmaster down and moving these directories back into
> place brings the problem back.
>
> So I'd still like to know what broke, and if there's a better way of fixing
> the problem in the future.  If it is a matter of a corrupted job, then I'm
> fine if it needs to be forcibly removed to let the others through, but I would
> like to avoid a full-scale wipe in the future.

I think we must find the root cause of this phenomenon.

Was there anything suspicious before qstat did fail with pending 
jobs? E.g. these pending jobs were running before and then rescheduled?

Could you send me qmaster accounting + messages files for investigation?

Regards,
Andreas

>
> Becker, Jesse (NIH/NHGRI) [C] wrote:
>> andreas wrote:
>>> Thanks. It means qstat fails when it tries to sort the pending jobs as to print a
>>> them in priority order. What is really bewildering, is that there is no JAT_prio data
>>> field with pending jobs, whereas running jobs do have the JAT_prio, as one can see
>>> from the output in the 'prior' column ...
>>
>> Right.  I figured that something like that was happening, but didn't want to
>> speculate about code with which I'm not familiar. :)
>>
>>> With your first post you wrote restarting qmaster fixes that issue. Any loggings
>>> in qmaster messages file?
>>
>> Sorry, I didn't mean to imply that restarting qmaster fixes this.  In fact,
>> restarting qmaster specifically does *not* fix this JAT_prio problem.  I've
>> tried bouncing qmaster a number of times.
>>
>> I've let the running jobs flush out, so I'm going to try a few more things
>> that I didn't want to do previously (like bounce execd processes).
>>
>>> What you get from 'qconf -ssconf | grep report_pjob_tickets' ?
>>
>> [beckerjes at saturn ~]$ qconf -ssconf | grep report_pjob_tickets
>> report_pjob_tickets               TRUE
>>
>> I tried setting this to FALSE, then bouncing qmaster.  The error still occurs.
>>
>>
>>
>>
>>> Regards,
>>> Andreas
>>>
>>> On Thu, 11 Dec 2008, Jesse Becker wrote:
>>>
>>>> andreas wrote:
>>>>> Hi Jesse,
>>>>>
>>>>> thanks for reporting.
>>>> Thanks for the reply, and I'm happy to help.
>>>>
>>>>> Have you verified qstat(1) and sge_qmaster(8) return the same version string when
>>>>> you run it with -help option?
>>>> Both of them return the string  "GE 6.2".
>>>>
>>>> Something else that isn't immediatly obvious:  running jobs are displayed, but
>>>> no pending jobs are shown.  I believe that all running jobs are shown, as
>>>> opposed to just some of them.  I verified this by running 'qstat -s r', which
>>>> does *not* abort, and 'qstat -s p', which does.
>>>>
>>>>
>>>>> For further investigation I'd ask you to run qstat under gdb/dbx control
>>>>> when it happens next time, so that we get a full stack trace.
>>>> Here you go:
>>>>
>>>> [beckerjes at saturn ~]$ gdb qstat
>>>> GNU gdb Red Hat Linux (6.3.0.0-1.153.el4_6.2rh)
>>>> Copyright 2004 Free Software Foundation, Inc.
>>>> GDB is free software, covered by the GNU General Public License, and you are
>>>> welcome to change it and/or distribute copies of it under certain conditions.
>>>> Type "show copying" to see the conditions.
>>>> There is absolutely no warranty for GDB.  Type "show warranty" for details.
>>>> This GDB was configured as "x86_64-redhat-linux-gnu"...(no debugging symbols
>>>> found)
>>>> Using host libthread_db library "/lib64/tls/libthread_db.so.1".
>>>>
>>>> (gdb) run
>>>> Starting program: /opt/gridengine/bin/lx24-amd64/qstat
>>>> (no debugging symbols found)
>>>> (no debugging symbols found)
>>>> (no debugging symbols found)
>>>> [Thread debugging using libthread_db enabled]
>>>> [New Thread 182894186240 (LWP 21365)]
>>>> (no debugging symbols found)
>>>> (no debugging symbols found)
>>>> (no debugging symbols found)
>>>> (no debugging symbols found)
>>>> (no debugging symbols found)
>>>> critical error: !!!!!!!!!! got NULL element for JAT_prio !!!!!!!!!!
>>>>
>>>> Program received signal SIGABRT, Aborted.
>>>> [Switching to Thread 182894186240 (LWP 21365)]
>>>> 0x000000392712e25d in raise () from /lib64/tls/libc.so.6
>>>> (gdb) bt
>>>> #0  0x000000392712e25d in raise () from /lib64/tls/libc.so.6
>>>> #1  0x000000392712fa5e in abort () from /lib64/tls/libc.so.6
>>>> #2  0x00000000004d4299 in lGetPosViaElem ()
>>>> #3  0x00000000004d5177 in lGetDouble ()
>>>> #4  0x000000000046a6c8 in sgeee_sort_jobs_by ()
>>>> #5  0x000000000046a605 in sgeee_sort_jobs ()
>>>> #6  0x0000000000443bdd in qstat_no_group ()
>>>> #7  0x000000000042370a in main ()
>>>> (gdb)
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> Regards,
>>>>> Andreas
>>>>>
>>>>> On Wed, 10 Dec 2008, Jesse Becker wrote:
>>>>>
>>>>>> Came into work this morning and discovered that qstat from my 6.2 install was throwing this lovely message:
>>>>>>
>>>>>>   critical error: !!!!!!!!!! got NULL element for JAT_prio !!!!!!!!!!
>>>>>>
>>>>>> More specifically, it looks like this:
>>>>>>
>>>>>> [saturn ~]# qstat
>>>>>> job-ID  prior   name       user     state  submit/start at     queue                   slots ja-task-ID
>>>>>> ---------------------------------------------------------------------------------------------------------
>>>>>> 1327918 0.73703 solexa-clu solexa    r     12/10/2008 08:15:46 interactive.q at gcr06n20   4
>>>>>> 1327917 0.62500 s081209.15 solexa    S     12/10/2008 08:01:24 high.q at gcr06n20          4
>>>>>> 1327920 0.73688 solexa-clu solexa    r     12/10/2008 08:15:56 interactive.q at gcr06n24   4
>>>>>> 1327928 1.00000 s081209.16 solexa    r     12/10/2008 09:21:25 high.q at gcr06n08          4
>>>>>> 1327953 0.21717 FA081106.1 solexa    r     12/10/2008 09:41:13 low.q at gcr07n12           1
>>>>>> 1327967 0.92874 s081209.17 solexa    r     12/10/2008 10:41:06 high.q at gcr06n17          4
>>>>>> critical error: !!!!!!!!!! got NULL element for JAT_prio !!!!!!!!!!
>>>>>> Aborted
>>>>>> [saturn ~]#
>>>>>>
>>>>>>
>>>>>> Digging a little bit deeper, using 'dl 2', I get this output (trimmed for space):
>>>>>>
>>>>>>   756  14264         main <-- job_stdout_job() ../clients/qstat/qstat.c 1134 }
>>>>>>   757  14264         main <-- sge_handle_job() ../clients/common/sge_qstat.c 2491 }
>>>>>>   758  14264         main <-- handle_jobs_queue() ../clients/common/sge_qstat.c 715 }
>>>>>>   759  14264         main <-- qstat_handle_running_jobs() ../clients/common/sge_qstat.c 505 }
>>>>>>   760  14264         main --> sge_log() {
>>>>>>   761  14264         main     ../libs/cull/cull_multitype.c 153 !!!!!!!!!! got NULL element for JAT_prio !!!!!!!!!!
>>>>>>   762  14264         main <-- sge_log() ../libs/uti/sge_log.c 620 }
>>>>>>
>>>>>> Using 'dl 4', there's a little bit more:
>>>>>>
>>>>>>  1253  19495         main <-- job_stdout_job() ../clients/qstat/qstat.c 1134 }
>>>>>>  1254  19495         main <-- sge_handle_job() ../clients/common/sge_qstat.c 2491 }
>>>>>>  1255  19495         main <-- handle_jobs_queue() ../clients/common/sge_qstat.c 715 }
>>>>>>  1256  19495         main <-- qstat_handle_running_jobs() ../clients/common/sge_qstat.c 505 }
>>>>>>  1257  19495 182894186240 --> sge_set_message_id_output() {
>>>>>>  1258  19495 182894186240 <-- sge_set_message_id_output() ../libs/uti/sge_language.c 498 }
>>>>>>  1259  19495 182894186240 --> sge_gettext_() {
>>>>>>  1260  19495 182894186240 --> sge_get_message_id_output_implementation() {
>>>>>>  1261  19495 182894186240 <-- sge_get_message_id_output_implementation() ../libs/uti/sge_language.c 582 }
>>>>>>  1262  19495 182894186240 <-- sge_gettext_() ../libs/uti/sge_language.c 730 }
>>>>>>  1263  19495 182894186240 --> sge_set_message_id_output() {
>>>>>>  1264  19495 182894186240 <-- sge_set_message_id_output() ../libs/uti/sge_language.c 498 }
>>>>>>  1265  19495         main --> sge_log() {
>>>>>>  1266  19495         main     ../libs/cull/cull_multitype.c 153 !!!!!!!!!! got NULL element for JAT_prio !!!!!!!!!!
>>>>>>  1267  19495 182894186240 --> sge_gettext_() {
>>>>>>  1268  19495 182894186240 --> sge_get_message_id_output_implementation() {
>>>>>>  1269  19495 182894186240 <-- sge_get_message_id_output_implementation() ../libs/uti/sge_language.c 582 }
>>>>>>  1270  19495 182894186240 <-- sge_gettext_() ../libs/uti/sge_language.c 730 }
>>>>>>  1271  19495         main <-- sge_log() ../libs/uti/sge_log.c 620 }
>>>>>>
>>>>>>
>>>>>> Restarting qmaster doesn't help.
>>>>>>
>>>>>> Possibly related to this, is that I did the upgrade in October, and
>>>>>> occasionally get the message "got NULL element for SME_message_list"
>>>>>> in the qmaster messages file.  When this happens qmaster shuts down,
>>>>>> and must be restarted.
>>>>>>
>>>>>> This happens a few times a week, and I haven't been able to track down
>>>>>> the cause.  Looking over the logs, there may be a correlation between
>>>>>> large numbers of jobs finishing and this message, but nothing more solid.
>>>>>> As a workaround, I've a cron job that checks if the qmaster is running,
>>>>>> and starts it if needed.
>>>>>>
>>>>>> Any suggestions or ideas?
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jesse Becker
>>>>>> NHGRI Linux support (Digicon Contractor)
>>>>>>
>>>>>> ------------------------------------------------------
>>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=92089
>>>>>>
>>>>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>>>>
>>>>> http://gridengine.info/
>>>>>
>>>>> Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
>>>>> Amtsgericht Muenchen: HRB 161028
>>>>> Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
>>>>> Vorsitzender des Aufsichtsrates: Martin Haering
>>>>>
>>>>> ------------------------------------------------------
>>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=92194
>>>>>
>>>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>>>
>>>> --
>>>> Jesse Becker
>>>> NHGRI Linux support (Digicon Contractor)
>>>>
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=92271
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>>
>>> http://gridengine.info/
>>>
>>> Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
>>> Amtsgericht Muenchen: HRB 161028
>>> Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
>>> Vorsitzender des Aufsichtsrates: Martin Haering
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=92280
>>>
>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>
>>
>>
>
>
> -- 
> Jesse Becker
> NHGRI Linux support (Digicon Contractor)
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=92311
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

http://gridengine.info/

Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
Vorsitzender des Aufsichtsrates: Martin Haering

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=92679

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list