[GE users] strange delays in gridengine commands

templedf dan.templeton at sun.com
Thu Jan 28 15:25:01 GMT 2010


I assume you're not running on Solaris.  If you were, the DTrace script 
that comes with Grid Engine would point you in the right direction.  
Instead, you can try running the qmaster with debugging turned on and 
redirected to a file, but that will itself cause some minor qmaster 
performance issues.  See:

http://blogs.sun.com/templedf/entry/using_debugging_output

Daniel

On 01/28/10 03:48, ruppert wrote:
> Hi,
>
> we experience, since a few days, strange delays when executing
> gridengine commands. For example, a simple 'qhost' or 'qstat'
> command, which usually takes less than one second to complete,
> takes almost one minute. The same command, issued some minutes
> later, may complete without this delay.
>
> This is not load related; we have only about 60 single processor
> execution nodes (Solaris10/Sparc), and the load on the qmaster
> host is usually around 0.1, and this happens also when all execution
> hosts are idle. SGE version is 6.0u6. There is nothing in the
> various messages - files which is obviously suspicous.
>
> How could I proceed to further investigate this? Is there any trace
> facility which could reveal where these commands spend their time?
>
>  From a simple 'truss qhost' I see that the client side transmits
> a binary packet to the qmaster port on the qmaster host, and then
> a long delay with "pollsys" (probably a select), before a response
> arrives:
>
> ...
> write(6, 0x1002F33A0, 99)			= 99
>     <  m i h   v e r s i o n = " 0 . 1 ">  <  m i d>  1<  / m i d>  <
>     d l>  4 6 1<  / d l>  <  d f>  b i n<  / d f>  <  m a t>  a c k<
>     / m a t>  <  t a g>  2<  / t a g>  <  r i d>  0<  / r i d>  <  / m
>     i h>
> write(6, 0x1002F43B0, 461)			= 461
>    \0\0\0\01002\0\0\0\0\001\0\0\00310\01001\0\0\0\0\0\0\0\0\0\0\001
>    ...
>    d d d d ~ * d b d , d * d * ~ * h ~ h , ~ * g n d ~ g = g d g l
>    \0\0\0\005\0\0\0\0\0\0\0\0
> pollsys(0xFFFFFFFF7FFF8100, 1, 0xFFFFFFFF7FFF8200, 0x00000000) (sleeping...)
> pollsys(0xFFFFFFFF7FFF8100, 1, 0xFFFFFFFF7FFF8200, 0x00000000) = 0
>
> ... many pollsys ...
>
> pollsys(0xFFFFFFFF7FFF8100, 1, 0xFFFFFFFF7FFF8200, 0x00000000) (sleeping...)
> pollsys(0xFFFFFFFF7FFF8100, 1, 0xFFFFFFFF7FFF8200, 0x00000000) = 0
> pollsys(0xFFFFFFFF7FFF8100, 1, 0xFFFFFFFF7FFF8200, 0x00000000) = 0
> pollsys(0xFFFFFFFF7FFF8100, 1, 0xFFFFFFFF7FFF8200, 0x00000000) = 1
> read(6, 0x1002F2390, 22)			= 22
>     <  g m s h>  <  d l>  9 7<  / d l>  <  / g m s
> read(6, " h", 1)				= 1
> read(6, ">", 1)				= 1
> read(6, 0x1002F2390, 97)			= 97
>     <  m i h   v e r s i o n = " 0 . 1 ">  <  m i d>  1<  / m i d>  <
>     d l>  3 5<  / d l>  <  d f>  a m<  / d f>  <  m a t>  n a k<  / m
>     a t>  <  t a g>  0<  / t a g>  <  r i d>  0<  / r i d>  <  / m i h
>     >
>
> Is it possible to somehow trace the qmaster side?
>
> Regards
> D. Ruppert
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=241480
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=241516

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list