No subject


Wed Jan 12 20:38:46 GMT 2011


#0  0x000000000056f9b0 in cull_hash_free_descr ()
#1  0x000000000056e41c in lFreeList ()
#2  0x000000000056e5bf in lFreeElem ()
#3  0x000000000056e2f3 in lRemoveElem ()
#4  0x000000000056e3e1 in lFreeList ()
#5  0x0000000000432469 in sge_scheduler_main ()
#6  0x00007ffff7771070 in start_thread () from /lib64/libpthread.so.0
#7  0x00007ffff74e411d in clone () from /lib64/libc.so.6

or:
#0  0x0000000000557198 in lCopySwitchPack ()
#1  0x0000000000556e81 in lCopyElemHash ()
#2  0x0000000000556e0e in lCopyElem ()
#3  0x00000000005583fb in lCopyListHash ()
#4  0x0000000000562e89 in lSelectHashPack ()
#5  0x00000000005571b3 in lCopySwitchPack ()
#6  0x0000000000556e81 in lCopyElemHash ()
#7  0x0000000000556e0e in lCopyElem ()
#8  0x00000000005583fb in lCopyListHash ()
#9  0x0000000000562e89 in lSelectHashPack ()
#10 0x00000000005571b3 in lCopySwitchPack ()
#11 0x0000000000556e81 in lCopyElemHash ()
#12 0x0000000000556e0e in lCopyElem ()
#13 0x00000000005583fb in lCopyListHash ()
#14 0x000000000055839e in lCopyList ()
#15 0x00000000004316e9 in sge_scheduler_main ()
#16 0x0000002a959cfaff in start_thread () from /lib64/tls/libpthread.so.0
#17 0x0000002a95b974b3 in clone () from /lib64/tls/libc.so.6

or we get an ABORT in lRemoveElem with logging in qmaster messages file:
"Removing element from other list !!!"

Root cause is an incorrect descriptor for a pe_task in the mirrored lists in sge_scheduler:
Scheduler holds reduced objects, in this case for the PET_Type,
but the descriptor of the JAT_task_list containing the pe_tasks is a full descriptor!

This is most probably caused by a bug in the event client total update code:
When the scheduler (thread) starts up, it requests a total update from the event master thread, and as part of this total update receives
the job list with all jobs, all their array tasks, and per array task the list of pe tasks.
Looks as if this list of pe tasks has a full object descriptor, but later updates (e.g. adding new pe tasks) are done using reduced objects.

A memory debugger shows us the following places where the reduced element is created, and where we get a read overflow:

[cull_multitype.c:1042] **READ_BAD_INDEX**
>>    return (lList *) ep->cont[pos].glp;

  Reading array out of range: &(ep->cont)[pos]

  Index used : 6
  In block   : 0x00002aaab00825d0 thru 0x00002aaab00825ff (48 bytes)
               calloc(1, sizeof(lMultiType) * n), allocated at cull_list.c, 919
                          calloc()  (interface)
                     lCreateElem()  ../libs/cull/cull_list.c, 919
                lSelectElemDPack()  ../libs/cull/cull_db.c, 670
                    lSelectDPack()  ../libs/cull/cull_db.c, 902
           add_list_event_direct()  ../libs/evm/sge_event_master.c, 2701
   sge_event_master_process_send()  ../libs/evm/sge_event_master.c, 1624
sge_event_master_process_requests()  ../libs/evm/sge_event_master.c, 3488
           sge_event_master_main()  ../daemons/qmaster/sge_thread_event_master.c, 166

  Stack trace where the error occurred:
                     lGetPosList()  ../libs/cull/cull_multitype.c, 1042
                     lWriteElem_()  ../libs/cull/cull_list.c, 728
                    lWriteElemTo()  ../libs/cull/cull_list.c, 671
pe_task_update_master_list_usage()  ../libs/mir/sge_pe_task_mirror.c, 113
    job_update_master_list_usage()  ../libs/mir/sge_job_mirror.c, 94
          job_update_master_list()  ../libs/mir/sge_job_mirror.c, 177
        sge_mirror_process_event()  ../libs/mir/sge_mirror.c, 1571
  sge_mirror_process_event_list_()  ../libs/mir/sge_mirror.c, 1272
   sge_mirror_process_event_list()  ../libs/mir/sge_mirror.c, 1526
              sge_scheduler_main()  ../daemons/qmaster/sge_thread_scheduler.c, 657

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=36&dsMessageId=234596

To unsubscribe from this discussion, e-mail: [issues-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list