[GE users] too many dynamic jobs (qsub -sync y) causes sge_qmaster issues

rpatterson patterso at mail.nih.gov
Mon Aug 24 19:08:15 BST 2009


Daniel,

What's the best way to find out when a patch is released for this? This
issue is killing me :)

Ron


-----Original Message-----
From: templedf [mailto:dan.templeton at sun.com]
Sent: Monday, August 24, 2009 2:06 PM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] too many dynamic jobs (qsub -sync y) causes
sge_qmaster issues

Hang on!  The subject line isn't actually intended to be referencing an
IZ.  It's just suspiciously similar.  In any case, it looks like we've
found and fixed the issue (although it may not be checked in yet).

Daniel

templedf wrote:
> Ron,
>
> Sounds like it might be the issue that you referenced in the subject
> line.  We've tracked down that issue to out-of-order lock handling in
> the qmaster, and it should be fixed for u4.  It has to do with the
event
> master thread competing with the GDI threads for locks, which is why
> DRMAA and qsub -sync tend to cause it.
>
> Daniel
>
> rpatterson wrote:
>
>> I've been trying to track down the root cause of periodic problems
with
>> our SGE master (6.2u3), which I've described to this list before
>> (sge_qmaster is up and running, but fails any new client connections,
>> and won't schedule new jobs).
>>
>> Several of our users make heavy use of DRMAA and other dynamic jobs
>> submitted with "qsub -sync y -r y", the latter mostly issued from
within
>> makefiles used to control intra-job dependencies.
>>
>> I have MAX_DYN_EC=1024 set in my qmaster_params, with 8000+ file
>> descriptors available to the master. Our production cluster has about
>> 250 server nodes and our test cluster has 6 servers nodes. I can
>> reproduce this problem on both.
>>
>> I found that if I submit 200 jobs or so in a quick loop like so:
>>
>>
>> for h in `seq 1 200`; do echo $h; qsub -sync y -r y -q default
>> /netopt/lsf/sgetest/examples/jobs/sleeper.sh &; done
>>
>> Within two or three times through this, the sge_qmaster will stop
>> responding. The qsub client will start reporting:
>>
>>
>>
>> error: unable to send message to qmaster using port 7979 on host
>> "distcc01.be-md.ncbi.nlm.nih.gov": got send timeout
>> error: unable to send message to qmaster using port 7979 on host
>> "distcc01.be-md.ncbi.nlm.nih.gov": got send timeout
>> error: unable to send message to qmaster using port 7979 on host
>> "distcc01.be-md.ncbi.nlm.nih.gov": got send timeout
>> error: unable to send message to qmaster using port 7979 on host
>> "distcc01.be-md.ncbi.nlm.nih.gov": got send error
>> error: unable to contact qmaster using port 7979 on host
>> "distcc01.be-md.ncbi.nlm.nih.gov"
>> error: unable to contact qmaster using port 7979 on host
>> "distcc01.be-md.ncbi.nlm.nih.gov"
>>
>> The master will never respond again until it's killed and restarted.
>> More often than not, the master will not come back after a restart
until
>> I go and kill a bunch of hanging dynamic job requests.
>>
>> Once the master is hung, I see the following gstack trace on the
master:
>>
>> # gstack 13897
>> Thread 13 (Thread 1082132832 (LWP 13900)):
>> #0  0x0000002a958d0ef2 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
>> #1  0x0000000000593c6c in cl_thread_wait_for_thread_condition ()
>> #2  0x0000000000594392 in cl_thread_wait_for_event ()
>> #3  0x0000000000581944 in cl_com_trigger_thread ()
>> #4  0x0000002a958ceb8f in start_thread () from
>> /lib64/tls/libpthread.so.0
>> #5  0x0000002a95a96693 in clone () from /lib64/tls/libc.so.6
>> #6  0x0000000000000000 in ?? ()
>> Thread 12 (Thread 1090525536 (LWP 13902)):
>> #0  0x0000002a958d0ef2 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
>> #1  0x0000000000593c6c in cl_thread_wait_for_thread_condition ()
>> #2  0x0000000000594392 in cl_thread_wait_for_event ()
>> #3  0x0000000000581b01 in cl_com_handle_service_thread ()
>> #4  0x0000002a958ceb8f in start_thread () from
>> /lib64/tls/libpthread.so.0
>> #5  0x0000002a95a96693 in clone () from /lib64/tls/libc.so.6
>> #6  0x0000000000000000 in ?? ()
>> Thread 11 (Thread 1098918240 (LWP 13903)):
>> #0  0x0000002a95a8e3b2 in poll () from /lib64/tls/libc.so.6
>> #1  0x000000000058f3ab in cl_com_tcp_open_connection_request_handler
()
>> #2  0x000000000056b9d9 in cl_com_open_connection_request_handler ()
>> #3  0x0000000000581d52 in cl_com_handle_read_thread ()
>> #4  0x0000002a958ceb8f in start_thread () from
>> /lib64/tls/libpthread.so.0
>> #5  0x0000002a95a96693 in clone () from /lib64/tls/libc.so.6
>> #6  0x0000000000000000 in ?? ()
>> Thread 10 (Thread 1107310944 (LWP 13904)):
>> #0  0x0000002a958d0ef2 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
>> #1  0x0000000000593c6c in cl_thread_wait_for_thread_condition ()
>> #2  0x0000000000594392 in cl_thread_wait_for_event ()
>> #3  0x0000000000582a80 in cl_com_handle_write_thread ()
>> #4  0x0000002a958ceb8f in start_thread () from
>> /lib64/tls/libpthread.so.0
>> #5  0x0000002a95a96693 in clone () from /lib64/tls/libc.so.6
>> #6  0x0000000000000000 in ?? ()
>> Thread 9 (Thread 1115703648 (LWP 13908)):
>> #0  0x0000002a958d44eb in do_sigwait () from
/lib64/tls/libpthread.so.0
>> #1  0x0000002a958d45ad in sigwait () from /lib64/tls/libpthread.so.0
>> #2  0x00000000004329ce in sge_signaler_main ()
>> #3  0x0000002a958ceb8f in start_thread () from
>> /lib64/tls/libpthread.so.0
>> #4  0x0000002a95a96693 in clone () from /lib64/tls/libc.so.6
>> #5  0x0000000000000000 in ?? ()
>> Thread 8 (Thread 1124096352 (LWP 13909)):
>> #0  0x0000002a958d0ced in pthread_cond_wait@@GLIBC_2.3.2 ()
>> #1  0x00000000005b4eb9 in sge_fifo_lock ()
>> #2  0x00000000005b4acd in sge_lock ()
>> #3  0x00000000004d7ee7 in sge_event_master_process_mod_event_client
()
>> #4  0x00000000004dd2be in sge_event_master_process_requests ()
>> #5  0x000000000042df6e in sge_event_master_main ()
>> #6  0x0000002a958ceb8f in start_thread () from
>> /lib64/tls/libpthread.so.0
>> #7  0x0000002a95a96693 in clone () from /lib64/tls/libc.so.6
>> #8  0x0000000000000000 in ?? ()
>> Thread 7 (Thread 1132489056 (LWP 13910)):
>> #0  0x0000002a958d0ced in pthread_cond_wait@@GLIBC_2.3.2 ()
>> #1  0x00000000005b4eb9 in sge_fifo_lock ()
>> #2  0x00000000005b4acd in sge_lock ()
>> #3  0x0000000000451254 in sge_load_value_cleanup_handler ()
>> #4  0x0000000000487a55 in te_scan_table_and_deliver ()
>> #5  0x000000000042eeba in sge_timer_main ()
>> #6  0x0000002a958ceb8f in start_thread () from
>> /lib64/tls/libpthread.so.0
>> #7  0x0000002a95a96693 in clone () from /lib64/tls/libc.so.6
>> #8  0x0000000000000000 in ?? ()
>> Thread 6 (Thread 1140881760 (LWP 13911)):
>> #0  0x0000002a958d0ced in pthread_cond_wait@@GLIBC_2.3.2 ()
>> #1  0x00000000005b4eb9 in sge_fifo_lock ()
>> #2  0x00000000005b4acd in sge_lock ()
>> #3  0x000000000042d880 in sge_worker_main ()
>> #4  0x0000002a958ceb8f in start_thread () from
>> /lib64/tls/libpthread.so.0
>> #5  0x0000002a95a96693 in clone () from /lib64/tls/libc.so.6
>> #6  0x0000000000000000 in ?? ()
>> Thread 5 (Thread 1149274464 (LWP 13912)):
>> #0  0x0000002a958d30ab in __lll_mutex_lock_wait ()
>> #1  0x0000002a958cffcc in pthread_mutex_lock () from
>> /lib64/tls/libpthread.so.0
>> #2  0x000000000055e207 in lGetObject ()
>> #3  0x00000000005660a6 in cull_state_getspecific ()
>> #4  0x000000000055d5ca in lGetPosViaElem ()
>> #5  0x0000002a958d029e in pthread_mutex_unlock ()
>> #6  0x00000000005b4f58 in sge_fifo_lock ()
>> #7  0x00000000004d85b0 in sge_set_max_dynamic_event_clients ()
>> #8  0x000000000043cbb9 in sge_c_gdi_add ()
>> #9  0x000000000043bb8c in sge_c_gdi ()
>> #10 0x000000000042d854 in sge_worker_main ()
>> #11 0x0000002a958ceb8f in start_thread () from
>> /lib64/tls/libpthread.so.0
>> #12 0x0000002a95a96693 in clone () from /lib64/tls/libc.so.6
>> #13 0x0000000000000000 in ?? ()
>> Thread 4 (Thread 1157667168 (LWP 13913)):
>> #0  0x0000002a958d0ef2 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
>> #1  0x0000000000593c6c in cl_thread_wait_for_thread_condition ()
>> #2  0x000000000057e708 in cl_commlib_receive_message ()
>> #3  0x00000000004efe35 in sge_gdi2_get_any_request ()
>> #4  0x000000000048b20e in sge_qmaster_process_message ()
>> #5  0x000000000042cd38 in sge_listener_main ()
>> #6  0x0000002a958ceb8f in start_thread () from
>> /lib64/tls/libpthread.so.0
>> #7  0x0000002a95a96693 in clone () from /lib64/tls/libc.so.6
>> #8  0x0000000000000000 in ?? ()
>> Thread 3 (Thread 1166059872 (LWP 13914)):
>> #0  0x0000002a958d0ef2 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
>> #1  0x0000000000593c6c in cl_thread_wait_for_thread_condition ()
>> #2  0x000000000057e708 in cl_commlib_receive_message ()
>> #3  0x00000000004efe35 in sge_gdi2_get_any_request ()
>> #4  0x000000000048b20e in sge_qmaster_process_message ()
>> #5  0x000000000042cd38 in sge_listener_main ()
>> #6  0x0000002a958ceb8f in start_thread () from
>> /lib64/tls/libpthread.so.0
>> #7  0x0000002a95a96693 in clone () from /lib64/tls/libc.so.6
>> #8  0x0000000000000000 in ?? ()
>> Thread 2 (Thread 1174452576 (LWP 13915)):
>> #0  0x0000002a958d0ef2 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
>> #1  0x000000000043251f in sge_scheduler_wait_for_event ()
>> #2  0x0000000000431c39 in sge_scheduler_main ()
>> #3  0x0000002a958ceb8f in start_thread () from
>> /lib64/tls/libpthread.so.0
>> #4  0x0000002a95a96693 in clone () from /lib64/tls/libc.so.6
>> #5  0x0000000000000000 in ?? ()
>> Thread 1 (Thread 182894214976 (LWP 13897)):
>> #0  0x0000002a958d0ced in pthread_cond_wait@@GLIBC_2.3.2 ()
>> #1  0x00000000005adc80 in sge_thread_wait_for_signal ()
>> #2  0x000000000042bd0b in main ()
>> #0  0x0000002a958d0ced in pthread_cond_wait@@GLIBC_2.3.2 ()
>>
>> There are no errors logged by the sge_qmaster, but no client
connections
>> can be established (qhost/qsub/qstat). Although my production cluster
is
>> running classic spooling, this also happens on my test cluster which
is
>> using Berkeley db spooling.
>>
>>
>> Is this a known issue? The master does not appear to be running out
of
>> file descriptors or any other resource that I can see. Any ideas
would
>> be appreciated.
>>
>> Ron
>>
>>
>>
>>
>>  -----------------------------------
>> Ron Patterson
>> UNIX Systems Administrator
>> NCBI/NLM/NIH contractor
>> 301.435.5956
>>
>> ------------------------------------------------------
>>
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=214008
>>
>> To unsubscribe from this discussion, e-mail:
[users-unsubscribe at gridengine.sunsource.net].
>>
>>
>
> ------------------------------------------------------
>
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=214012
>
> To unsubscribe from this discussion, e-mail:
[users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=214015

To unsubscribe from this discussion, e-mail:
[users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=214017

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list