[GE users] NULL element for JAT_prio

andreas andreas.haas at sun.com
Thu Dec 11 09:57:40 GMT 2008


Hi Jesse,

thanks for reporting.

Have you verified qstat(1) and sge_qmaster(8) return the same version string when
you run it with -help option?

For further investigation I'd ask you to run qstat under gdb/dbx control 
when it happens next time, so that we get a full stack trace.

Regards,
Andreas

On Wed, 10 Dec 2008, Jesse Becker wrote:

> Came into work this morning and discovered that qstat from my 6.2 install was throwing this lovely message:
>
>   critical error: !!!!!!!!!! got NULL element for JAT_prio !!!!!!!!!!
>
> More specifically, it looks like this:
>
> [saturn ~]# qstat
> job-ID  prior   name       user     state  submit/start at     queue                   slots ja-task-ID
> ---------------------------------------------------------------------------------------------------------
> 1327918 0.73703 solexa-clu solexa    r     12/10/2008 08:15:46 interactive.q at gcr06n20   4
> 1327917 0.62500 s081209.15 solexa    S     12/10/2008 08:01:24 high.q at gcr06n20          4
> 1327920 0.73688 solexa-clu solexa    r     12/10/2008 08:15:56 interactive.q at gcr06n24   4
> 1327928 1.00000 s081209.16 solexa    r     12/10/2008 09:21:25 high.q at gcr06n08          4
> 1327953 0.21717 FA081106.1 solexa    r     12/10/2008 09:41:13 low.q at gcr07n12           1
> 1327967 0.92874 s081209.17 solexa    r     12/10/2008 10:41:06 high.q at gcr06n17          4
> critical error: !!!!!!!!!! got NULL element for JAT_prio !!!!!!!!!!
> Aborted
> [saturn ~]#
>
>
> Digging a little bit deeper, using 'dl 2', I get this output (trimmed for space):
>
>   756  14264         main <-- job_stdout_job() ../clients/qstat/qstat.c 1134 }
>   757  14264         main <-- sge_handle_job() ../clients/common/sge_qstat.c 2491 }
>   758  14264         main <-- handle_jobs_queue() ../clients/common/sge_qstat.c 715 }
>   759  14264         main <-- qstat_handle_running_jobs() ../clients/common/sge_qstat.c 505 }
>   760  14264         main --> sge_log() {
>   761  14264         main     ../libs/cull/cull_multitype.c 153 !!!!!!!!!! got NULL element for JAT_prio !!!!!!!!!!
>   762  14264         main <-- sge_log() ../libs/uti/sge_log.c 620 }
>
> Using 'dl 4', there's a little bit more:
>
>  1253  19495         main <-- job_stdout_job() ../clients/qstat/qstat.c 1134 }
>  1254  19495         main <-- sge_handle_job() ../clients/common/sge_qstat.c 2491 }
>  1255  19495         main <-- handle_jobs_queue() ../clients/common/sge_qstat.c 715 }
>  1256  19495         main <-- qstat_handle_running_jobs() ../clients/common/sge_qstat.c 505 }
>  1257  19495 182894186240 --> sge_set_message_id_output() {
>  1258  19495 182894186240 <-- sge_set_message_id_output() ../libs/uti/sge_language.c 498 }
>  1259  19495 182894186240 --> sge_gettext_() {
>  1260  19495 182894186240 --> sge_get_message_id_output_implementation() {
>  1261  19495 182894186240 <-- sge_get_message_id_output_implementation() ../libs/uti/sge_language.c 582 }
>  1262  19495 182894186240 <-- sge_gettext_() ../libs/uti/sge_language.c 730 }
>  1263  19495 182894186240 --> sge_set_message_id_output() {
>  1264  19495 182894186240 <-- sge_set_message_id_output() ../libs/uti/sge_language.c 498 }
>  1265  19495         main --> sge_log() {
>  1266  19495         main     ../libs/cull/cull_multitype.c 153 !!!!!!!!!! got NULL element for JAT_prio !!!!!!!!!!
>  1267  19495 182894186240 --> sge_gettext_() {
>  1268  19495 182894186240 --> sge_get_message_id_output_implementation() {
>  1269  19495 182894186240 <-- sge_get_message_id_output_implementation() ../libs/uti/sge_language.c 582 }
>  1270  19495 182894186240 <-- sge_gettext_() ../libs/uti/sge_language.c 730 }
>  1271  19495         main <-- sge_log() ../libs/uti/sge_log.c 620 }
>
>
> Restarting qmaster doesn't help.
>
> Possibly related to this, is that I did the upgrade in October, and
> occasionally get the message "got NULL element for SME_message_list"
> in the qmaster messages file.  When this happens qmaster shuts down,
> and must be restarted.
>
> This happens a few times a week, and I haven't been able to track down
> the cause.  Looking over the logs, there may be a correlation between
> large numbers of jobs finishing and this message, but nothing more solid.
> As a workaround, I've a cron job that checks if the qmaster is running,
> and starts it if needed.
>
> Any suggestions or ideas?
>
>
> -- 
> Jesse Becker
> NHGRI Linux support (Digicon Contractor)
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=92089
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

http://gridengine.info/

Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
Vorsitzender des Aufsichtsrates: Martin Haering

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=92194

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list