[GE users] NULL element for JAT_prio

Jesse Becker beckerjes at mail.nih.gov
Thu Dec 11 16:34:29 GMT 2008


andreas wrote:
> Thanks. It means qstat fails when it tries to sort the pending jobs as to print a 
> them in priority order. What is really bewildering, is that there is no JAT_prio data 
> field with pending jobs, whereas running jobs do have the JAT_prio, as one can see 
> from the output in the 'prior' column ...

Right.  I figured that something like that was happening, but didn't want to 
speculate about code with which I'm not familiar. :)

> With your first post you wrote restarting qmaster fixes that issue. Any loggings 
> in qmaster messages file?

Sorry, I didn't mean to imply that restarting qmaster fixes this.  In fact, 
restarting qmaster specifically does *not* fix this JAT_prio problem.  I've 
tried bouncing qmaster a number of times.

I've let the running jobs flush out, so I'm going to try a few more things 
that I didn't want to do previously (like bounce execd processes).

> 
> What you get from 'qconf -ssconf | grep report_pjob_tickets' ?

[beckerjes at saturn ~]$ qconf -ssconf | grep report_pjob_tickets
report_pjob_tickets               TRUE

I tried setting this to FALSE, then bouncing qmaster.  The error still occurs.




> 
> Regards,
> Andreas
> 
> On Thu, 11 Dec 2008, Jesse Becker wrote:
> 
>> andreas wrote:
>>> Hi Jesse,
>>>
>>> thanks for reporting.
>> Thanks for the reply, and I'm happy to help.
>>
>>> Have you verified qstat(1) and sge_qmaster(8) return the same version string when
>>> you run it with -help option?
>> Both of them return the string  "GE 6.2".
>>
>> Something else that isn't immediatly obvious:  running jobs are displayed, but
>> no pending jobs are shown.  I believe that all running jobs are shown, as
>> opposed to just some of them.  I verified this by running 'qstat -s r', which
>> does *not* abort, and 'qstat -s p', which does.
>>
>>
>>> For further investigation I'd ask you to run qstat under gdb/dbx control
>>> when it happens next time, so that we get a full stack trace.
>> Here you go:
>>
>> [beckerjes at saturn ~]$ gdb qstat
>> GNU gdb Red Hat Linux (6.3.0.0-1.153.el4_6.2rh)
>> Copyright 2004 Free Software Foundation, Inc.
>> GDB is free software, covered by the GNU General Public License, and you are
>> welcome to change it and/or distribute copies of it under certain conditions.
>> Type "show copying" to see the conditions.
>> There is absolutely no warranty for GDB.  Type "show warranty" for details.
>> This GDB was configured as "x86_64-redhat-linux-gnu"...(no debugging symbols
>> found)
>> Using host libthread_db library "/lib64/tls/libthread_db.so.1".
>>
>> (gdb) run
>> Starting program: /opt/gridengine/bin/lx24-amd64/qstat
>> (no debugging symbols found)
>> (no debugging symbols found)
>> (no debugging symbols found)
>> [Thread debugging using libthread_db enabled]
>> [New Thread 182894186240 (LWP 21365)]
>> (no debugging symbols found)
>> (no debugging symbols found)
>> (no debugging symbols found)
>> (no debugging symbols found)
>> (no debugging symbols found)
>> critical error: !!!!!!!!!! got NULL element for JAT_prio !!!!!!!!!!
>>
>> Program received signal SIGABRT, Aborted.
>> [Switching to Thread 182894186240 (LWP 21365)]
>> 0x000000392712e25d in raise () from /lib64/tls/libc.so.6
>> (gdb) bt
>> #0  0x000000392712e25d in raise () from /lib64/tls/libc.so.6
>> #1  0x000000392712fa5e in abort () from /lib64/tls/libc.so.6
>> #2  0x00000000004d4299 in lGetPosViaElem ()
>> #3  0x00000000004d5177 in lGetDouble ()
>> #4  0x000000000046a6c8 in sgeee_sort_jobs_by ()
>> #5  0x000000000046a605 in sgeee_sort_jobs ()
>> #6  0x0000000000443bdd in qstat_no_group ()
>> #7  0x000000000042370a in main ()
>> (gdb)
>>
>>
>>
>>
>>
>>> Regards,
>>> Andreas
>>>
>>> On Wed, 10 Dec 2008, Jesse Becker wrote:
>>>
>>>> Came into work this morning and discovered that qstat from my 6.2 install was throwing this lovely message:
>>>>
>>>>   critical error: !!!!!!!!!! got NULL element for JAT_prio !!!!!!!!!!
>>>>
>>>> More specifically, it looks like this:
>>>>
>>>> [saturn ~]# qstat
>>>> job-ID  prior   name       user     state  submit/start at     queue                   slots ja-task-ID
>>>> ---------------------------------------------------------------------------------------------------------
>>>> 1327918 0.73703 solexa-clu solexa    r     12/10/2008 08:15:46 interactive.q at gcr06n20   4
>>>> 1327917 0.62500 s081209.15 solexa    S     12/10/2008 08:01:24 high.q at gcr06n20          4
>>>> 1327920 0.73688 solexa-clu solexa    r     12/10/2008 08:15:56 interactive.q at gcr06n24   4
>>>> 1327928 1.00000 s081209.16 solexa    r     12/10/2008 09:21:25 high.q at gcr06n08          4
>>>> 1327953 0.21717 FA081106.1 solexa    r     12/10/2008 09:41:13 low.q at gcr07n12           1
>>>> 1327967 0.92874 s081209.17 solexa    r     12/10/2008 10:41:06 high.q at gcr06n17          4
>>>> critical error: !!!!!!!!!! got NULL element for JAT_prio !!!!!!!!!!
>>>> Aborted
>>>> [saturn ~]#
>>>>
>>>>
>>>> Digging a little bit deeper, using 'dl 2', I get this output (trimmed for space):
>>>>
>>>>   756  14264         main <-- job_stdout_job() ../clients/qstat/qstat.c 1134 }
>>>>   757  14264         main <-- sge_handle_job() ../clients/common/sge_qstat.c 2491 }
>>>>   758  14264         main <-- handle_jobs_queue() ../clients/common/sge_qstat.c 715 }
>>>>   759  14264         main <-- qstat_handle_running_jobs() ../clients/common/sge_qstat.c 505 }
>>>>   760  14264         main --> sge_log() {
>>>>   761  14264         main     ../libs/cull/cull_multitype.c 153 !!!!!!!!!! got NULL element for JAT_prio !!!!!!!!!!
>>>>   762  14264         main <-- sge_log() ../libs/uti/sge_log.c 620 }
>>>>
>>>> Using 'dl 4', there's a little bit more:
>>>>
>>>>  1253  19495         main <-- job_stdout_job() ../clients/qstat/qstat.c 1134 }
>>>>  1254  19495         main <-- sge_handle_job() ../clients/common/sge_qstat.c 2491 }
>>>>  1255  19495         main <-- handle_jobs_queue() ../clients/common/sge_qstat.c 715 }
>>>>  1256  19495         main <-- qstat_handle_running_jobs() ../clients/common/sge_qstat.c 505 }
>>>>  1257  19495 182894186240 --> sge_set_message_id_output() {
>>>>  1258  19495 182894186240 <-- sge_set_message_id_output() ../libs/uti/sge_language.c 498 }
>>>>  1259  19495 182894186240 --> sge_gettext_() {
>>>>  1260  19495 182894186240 --> sge_get_message_id_output_implementation() {
>>>>  1261  19495 182894186240 <-- sge_get_message_id_output_implementation() ../libs/uti/sge_language.c 582 }
>>>>  1262  19495 182894186240 <-- sge_gettext_() ../libs/uti/sge_language.c 730 }
>>>>  1263  19495 182894186240 --> sge_set_message_id_output() {
>>>>  1264  19495 182894186240 <-- sge_set_message_id_output() ../libs/uti/sge_language.c 498 }
>>>>  1265  19495         main --> sge_log() {
>>>>  1266  19495         main     ../libs/cull/cull_multitype.c 153 !!!!!!!!!! got NULL element for JAT_prio !!!!!!!!!!
>>>>  1267  19495 182894186240 --> sge_gettext_() {
>>>>  1268  19495 182894186240 --> sge_get_message_id_output_implementation() {
>>>>  1269  19495 182894186240 <-- sge_get_message_id_output_implementation() ../libs/uti/sge_language.c 582 }
>>>>  1270  19495 182894186240 <-- sge_gettext_() ../libs/uti/sge_language.c 730 }
>>>>  1271  19495         main <-- sge_log() ../libs/uti/sge_log.c 620 }
>>>>
>>>>
>>>> Restarting qmaster doesn't help.
>>>>
>>>> Possibly related to this, is that I did the upgrade in October, and
>>>> occasionally get the message "got NULL element for SME_message_list"
>>>> in the qmaster messages file.  When this happens qmaster shuts down,
>>>> and must be restarted.
>>>>
>>>> This happens a few times a week, and I haven't been able to track down
>>>> the cause.  Looking over the logs, there may be a correlation between
>>>> large numbers of jobs finishing and this message, but nothing more solid.
>>>> As a workaround, I've a cron job that checks if the qmaster is running,
>>>> and starts it if needed.
>>>>
>>>> Any suggestions or ideas?
>>>>
>>>>
>>>> --
>>>> Jesse Becker
>>>> NHGRI Linux support (Digicon Contractor)
>>>>
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=92089
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>>
>>> http://gridengine.info/
>>>
>>> Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
>>> Amtsgericht Muenchen: HRB 161028
>>> Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
>>> Vorsitzender des Aufsichtsrates: Martin Haering
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=92194
>>>
>>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>>
>>
>> -- 
>> Jesse Becker
>> NHGRI Linux support (Digicon Contractor)
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=92271
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>>
> 
> http://gridengine.info/
> 
> Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
> Amtsgericht Muenchen: HRB 161028
> Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
> Vorsitzender des Aufsichtsrates: Martin Haering
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=92280
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> 


-- 
Jesse Becker
NHGRI Linux support (Digicon Contractor)

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=92283

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list