[GE users] too many dynamic jobs (qsub -sync y) causes sge_qmaster issues

templedf dan.templeton at sun.com
Mon Aug 24 20:53:29 BST 2009


Watch IZ3113:
http://gridengine.sunsource.net/issues/show_bug.cgi?id=3113.  When the
fix is checked in, the IZ will be updated.  To get it, you'll either
have to build from source or wait until u4.

Daniel

rpatterson wrote:
> Daniel,
>
> What's the best way to find out when a patch is released for this? This
> issue is killing me :)
>
> Ron
>
>
> -----Original Message-----
> From: templedf [mailto:dan.templeton at sun.com]
> Sent: Monday, August 24, 2009 2:06 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] too many dynamic jobs (qsub -sync y) causes
> sge_qmaster issues
>
> Hang on!  The subject line isn't actually intended to be referencing an
> IZ.  It's just suspiciously similar.  In any case, it looks like we've
> found and fixed the issue (although it may not be checked in yet).
>
> Daniel
>
> templedf wrote:
>
>> Ron,
>>
>> Sounds like it might be the issue that you referenced in the subject
>> line.  We've tracked down that issue to out-of-order lock handling in
>> the qmaster, and it should be fixed for u4.  It has to do with the
>>
> event
>
>> master thread competing with the GDI threads for locks, which is why
>> DRMAA and qsub -sync tend to cause it.
>>
>> Daniel
>>
>> rpatterson wrote:
>>
>>
>>> I've been trying to track down the root cause of periodic problems
>>>
> with
>
>>> our SGE master (6.2u3), which I've described to this list before
>>> (sge_qmaster is up and running, but fails any new client connections,
>>> and won't schedule new jobs).
>>>
>>> Several of our users make heavy use of DRMAA and other dynamic jobs
>>> submitted with "qsub -sync y -r y", the latter mostly issued from
>>>
> within
>
>>> makefiles used to control intra-job dependencies.
>>>
>>> I have MAX_DYN_EC=1024 set in my qmaster_params, with 8000+ file
>>> descriptors available to the master. Our production cluster has about
>>> 250 server nodes and our test cluster has 6 servers nodes. I can
>>> reproduce this problem on both.
>>>
>>> I found that if I submit 200 jobs or so in a quick loop like so:
>>>
>>>
>>> for h in `seq 1 200`; do echo $h; qsub -sync y -r y -q default
>>> /netopt/lsf/sgetest/examples/jobs/sleeper.sh &; done
>>>
>>> Within two or three times through this, the sge_qmaster will stop
>>> responding. The qsub client will start reporting:
>>>
>>>
>>>
>>> error: unable to send message to qmaster using port 7979 on host
>>> "distcc01.be-md.ncbi.nlm.nih.gov": got send timeout
>>> error: unable to send message to qmaster using port 7979 on host
>>> "distcc01.be-md.ncbi.nlm.nih.gov": got send timeout
>>> error: unable to send message to qmaster using port 7979 on host
>>> "distcc01.be-md.ncbi.nlm.nih.gov": got send timeout
>>> error: unable to send message to qmaster using port 7979 on host
>>> "distcc01.be-md.ncbi.nlm.nih.gov": got send error
>>> error: unable to contact qmaster using port 7979 on host
>>> "distcc01.be-md.ncbi.nlm.nih.gov"
>>> error: unable to contact qmaster using port 7979 on host
>>> "distcc01.be-md.ncbi.nlm.nih.gov"
>>>
>>> The master will never respond again until it's killed and restarted.
>>> More often than not, the master will not come back after a restart
>>>
> until
>
>>> I go and kill a bunch of hanging dynamic job requests.
>>>
>>> Once the master is hung, I see the following gstack trace on the
>>>
> master:
>
>>> # gstack 13897
>>> Thread 13 (Thread 1082132832 (LWP 13900)):
>>> #0  0x0000002a958d0ef2 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
>>> #1  0x0000000000593c6c in cl_thread_wait_for_thread_condition ()
>>> #2  0x0000000000594392 in cl_thread_wait_for_event ()
>>> #3  0x0000000000581944 in cl_com_trigger_thread ()
>>> #4  0x0000002a958ceb8f in start_thread () from
>>> /lib64/tls/libpthread.so.0
>>> #5  0x0000002a95a96693 in clone () from /lib64/tls/libc.so.6
>>> #6  0x0000000000000000 in ?? ()
>>> Thread 12 (Thread 1090525536 (LWP 13902)):
>>> #0  0x0000002a958d0ef2 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
>>> #1  0x0000000000593c6c in cl_thread_wait_for_thread_condition ()
>>> #2  0x0000000000594392 in cl_thread_wait_for_event ()
>>> #3  0x0000000000581b01 in cl_com_handle_service_thread ()
>>> #4  0x0000002a958ceb8f in start_thread () from
>>> /lib64/tls/libpthread.so.0
>>> #5  0x0000002a95a96693 in clone () from /lib64/tls/libc.so.6
>>> #6  0x0000000000000000 in ?? ()
>>> Thread 11 (Thread 1098918240 (LWP 13903)):
>>> #0  0x0000002a95a8e3b2 in poll () from /lib64/tls/libc.so.6
>>> #1  0x000000000058f3ab in cl_com_tcp_open_connection_request_handler
>>>
> ()
>
>>> #2  0x000000000056b9d9 in cl_com_open_connection_request_handler ()
>>> #3  0x0000000000581d52 in cl_com_handle_read_thread ()
>>> #4  0x0000002a958ceb8f in start_thread () from
>>> /lib64/tls/libpthread.so.0
>>> #5  0x0000002a95a96693 in clone () from /lib64/tls/libc.so.6
>>> #6  0x0000000000000000 in ?? ()
>>> Thread 10 (Thread 1107310944 (LWP 13904)):
>>> #0  0x0000002a958d0ef2 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
>>> #1  0x0000000000593c6c in cl_thread_wait_for_thread_condition ()
>>> #2  0x0000000000594392 in cl_thread_wait_for_event ()
>>> #3  0x0000000000582a80 in cl_com_handle_write_thread ()
>>> #4  0x0000002a958ceb8f in start_thread () from
>>> /lib64/tls/libpthread.so.0
>>> #5  0x0000002a95a96693 in clone () from /lib64/tls/libc.so.6
>>> #6  0x0000000000000000 in ?? ()
>>> Thread 9 (Thread 1115703648 (LWP 13908)):
>>> #0  0x0000002a958d44eb in do_sigwait () from
>>>
> /lib64/tls/libpthread.so.0
>
>>> #1  0x0000002a958d45ad in sigwait () from /lib64/tls/libpthread.so.0
>>> #2  0x00000000004329ce in sge_signaler_main ()
>>> #3  0x0000002a958ceb8f in start_thread () from
>>> /lib64/tls/libpthread.so.0
>>> #4  0x0000002a95a96693 in clone () from /lib64/tls/libc.so.6
>>> #5  0x0000000000000000 in ?? ()
>>> Thread 8 (Thread 1124096352 (LWP 13909)):
>>> #0  0x0000002a958d0ced in pthread_cond_wait@@GLIBC_2.3.2 ()
>>> #1  0x00000000005b4eb9 in sge_fifo_lock ()
>>> #2  0x00000000005b4acd in sge_lock ()
>>> #3  0x00000000004d7ee7 in sge_event_master_process_mod_event_client
>>>
> ()
>
>>> #4  0x00000000004dd2be in sge_event_master_process_requests ()
>>> #5  0x000000000042df6e in sge_event_master_main ()
>>> #6  0x0000002a958ceb8f in start_thread () from
>>> /lib64/tls/libpthread.so.0
>>> #7  0x0000002a95a96693 in clone () from /lib64/tls/libc.so.6
>>> #8  0x0000000000000000 in ?? ()
>>> Thread 7 (Thread 1132489056 (LWP 13910)):
>>> #0  0x0000002a958d0ced in pthread_cond_wait@@GLIBC_2.3.2 ()
>>> #1  0x00000000005b4eb9 in sge_fifo_lock ()
>>> #2  0x00000000005b4acd in sge_lock ()
>>> #3  0x0000000000451254 in sge_load_value_cleanup_handler ()
>>> #4  0x0000000000487a55 in te_scan_table_and_deliver ()
>>> #5  0x000000000042eeba in sge_timer_main ()
>>> #6  0x0000002a958ceb8f in start_thread () from
>>> /lib64/tls/libpthread.so.0
>>> #7  0x0000002a95a96693 in clone () from /lib64/tls/libc.so.6
>>> #8  0x0000000000000000 in ?? ()
>>> Thread 6 (Thread 1140881760 (LWP 13911)):
>>> #0  0x0000002a958d0ced in pthread_cond_wait@@GLIBC_2.3.2 ()
>>> #1  0x00000000005b4eb9 in sge_fifo_lock ()
>>> #2  0x00000000005b4acd in sge_lock ()
>>> #3  0x000000000042d880 in sge_worker_main ()
>>> #4  0x0000002a958ceb8f in start_thread () from
>>> /lib64/tls/libpthread.so.0
>>> #5  0x0000002a95a96693 in clone () from /lib64/tls/libc.so.6
>>> #6  0x0000000000000000 in ?? ()
>>> Thread 5 (Thread 1149274464 (LWP 13912)):
>>> #0  0x0000002a958d30ab in __lll_mutex_lock_wait ()
>>> #1  0x0000002a958cffcc in pthread_mutex_lock () from
>>> /lib64/tls/libpthread.so.0
>>> #2  0x000000000055e207 in lGetObject ()
>>> #3  0x00000000005660a6 in cull_state_getspecific ()
>>> #4  0x000000000055d5ca in lGetPosViaElem ()
>>> #5  0x0000002a958d029e in pthread_mutex_unlock ()
>>> #6  0x00000000005b4f58 in sge_fifo_lock ()
>>> #7  0x00000000004d85b0 in sge_set_max_dynamic_event_clients ()
>>> #8  0x000000000043cbb9 in sge_c_gdi_add ()
>>> #9  0x000000000043bb8c in sge_c_gdi ()
>>> #10 0x000000000042d854 in sge_worker_main ()
>>> #11 0x0000002a958ceb8f in start_thread () from
>>> /lib64/tls/libpthread.so.0
>>> #12 0x0000002a95a96693 in clone () from /lib64/tls/libc.so.6
>>> #13 0x0000000000000000 in ?? ()
>>> Thread 4 (Thread 1157667168 (LWP 13913)):
>>> #0  0x0000002a958d0ef2 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
>>> #1  0x0000000000593c6c in cl_thread_wait_for_thread_condition ()
>>> #2  0x000000000057e708 in cl_commlib_receive_message ()
>>> #3  0x00000000004efe35 in sge_gdi2_get_any_request ()
>>> #4  0x000000000048b20e in sge_qmaster_process_message ()
>>> #5  0x000000000042cd38 in sge_listener_main ()
>>> #6  0x0000002a958ceb8f in start_thread () from
>>> /lib64/tls/libpthread.so.0
>>> #7  0x0000002a95a96693 in clone () from /lib64/tls/libc.so.6
>>> #8  0x0000000000000000 in ?? ()
>>> Thread 3 (Thread 1166059872 (LWP 13914)):
>>> #0  0x0000002a958d0ef2 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
>>> #1  0x0000000000593c6c in cl_thread_wait_for_thread_condition ()
>>> #2  0x000000000057e708 in cl_commlib_receive_message ()
>>> #3  0x00000000004efe35 in sge_gdi2_get_any_request ()
>>> #4  0x000000000048b20e in sge_qmaster_process_message ()
>>> #5  0x000000000042cd38 in sge_listener_main ()
>>> #6  0x0000002a958ceb8f in start_thread () from
>>> /lib64/tls/libpthread.so.0
>>> #7  0x0000002a95a96693 in clone () from /lib64/tls/libc.so.6
>>> #8  0x0000000000000000 in ?? ()
>>> Thread 2 (Thread 1174452576 (LWP 13915)):
>>> #0  0x0000002a958d0ef2 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
>>> #1  0x000000000043251f in sge_scheduler_wait_for_event ()
>>> #2  0x0000000000431c39 in sge_scheduler_main ()
>>> #3  0x0000002a958ceb8f in start_thread () from
>>> /lib64/tls/libpthread.so.0
>>> #4  0x0000002a95a96693 in clone () from /lib64/tls/libc.so.6
>>> #5  0x0000000000000000 in ?? ()
>>> Thread 1 (Thread 182894214976 (LWP 13897)):
>>> #0  0x0000002a958d0ced in pthread_cond_wait@@GLIBC_2.3.2 ()
>>> #1  0x00000000005adc80 in sge_thread_wait_for_signal ()
>>> #2  0x000000000042bd0b in main ()
>>> #0  0x0000002a958d0ced in pthread_cond_wait@@GLIBC_2.3.2 ()
>>>
>>> There are no errors logged by the sge_qmaster, but no client
>>>
> connections
>
>>> can be established (qhost/qsub/qstat). Although my production cluster
>>>
> is
>
>>> running classic spooling, this also happens on my test cluster which
>>>
> is
>
>>> using Berkeley db spooling.
>>>
>>>
>>> Is this a known issue? The master does not appear to be running out
>>>
> of
>
>>> file descriptors or any other resource that I can see. Any ideas
>>>
> would
>
>>> be appreciated.
>>>
>>> Ron
>>>
>>>
>>>
>>>
>>>  -----------------------------------
>>> Ron Patterson
>>> UNIX Systems Administrator
>>> NCBI/NLM/NIH contractor
>>> 301.435.5956
>>>
>>> ------------------------------------------------------
>>>
>>>
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
> Id=214008
>
>>> To unsubscribe from this discussion, e-mail:
>>>
> [users-unsubscribe at gridengine.sunsource.net].
>
>>>
>>>
>> ------------------------------------------------------
>>
>>
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
> Id=214012
>
>> To unsubscribe from this discussion, e-mail:
>>
> [users-unsubscribe at gridengine.sunsource.net].
>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
> Id=214015
>
> To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=214017
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=214036

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list