[GE users] strange delays in gridengine commands

templedf dan.templeton at sun.com
Thu Jan 28 15:51:43 GMT 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

The DTrace script was added in 6.0u8, I believe.  There's no reason it 
shouldn't work with 6.0u6.  Just grab the scripts and readme from here:

http://gridengine.sunsource.net/source/browse/gridengine/source/scripts/dtrace/

You'll probably need help interpreting the results, so just send us what 
you get.

Daniel

On 01/28/10 07:40, ruppert wrote:
> Yes, we use Solaris 10, but I was not aware of a DTrace script (we
> use SGE 6.0u6). Is this part of newer SGE versions? Might this work also
> with 6.0u6?
>
> D. Ruppert
>
>    
>> I assume you're not running on Solaris.  If you were, the DTrace script
>> that comes with Grid Engine would point you in the right direction.
>> Instead, you can try running the qmaster with debugging turned on and
>> redirected to a file, but that will itself cause some minor qmaster
>> performance issues.  See:
>>
>> http://blogs.sun.com/templedf/entry/using_debugging_output
>>
>> Daniel
>>
>> On 01/28/10 03:48, ruppert wrote:
>>      
>>> Hi,
>>>
>>> we experience, since a few days, strange delays when executing
>>> gridengine commands. For example, a simple 'qhost' or 'qstat'
>>> command, which usually takes less than one second to complete,
>>> takes almost one minute. The same command, issued some minutes
>>> later, may complete without this delay.
>>>
>>> This is not load related; we have only about 60 single processor
>>> execution nodes (Solaris10/Sparc), and the load on the qmaster
>>> host is usually around 0.1, and this happens also when all execution
>>> hosts are idle. SGE version is 6.0u6. There is nothing in the
>>> various messages - files which is obviously suspicous.
>>>
>>> How could I proceed to further investigate this? Is there any trace
>>> facility which could reveal where these commands spend their time?
>>>
>>>   From a simple 'truss qhost' I see that the client side transmits
>>> a binary packet to the qmaster port on the qmaster host, and then
>>> a long delay with "pollsys" (probably a select), before a response
>>> arrives:
>>>
>>> ...
>>> write(6, 0x1002F33A0, 99)			= 99
>>>      <   m i h   v e r s i o n = " 0 . 1 ">   <   m i d>   1<   / m i d>   <
>>>      d l>   4 6 1<   / d l>   <   d f>   b i n<   / d f>   <   m a t>   a c k<
>>>      / m a t>   <   t a g>   2<   / t a g>   <   r i d>   0<   / r i d>   <   / m
>>>      i h>
>>> write(6, 0x1002F43B0, 461)			= 461
>>>     \0\0\0\01002\0\0\0\0\001\0\0\00310\01001\0\0\0\0\0\0\0\0\0\0\001
>>>     ...
>>>     d d d d ~ * d b d , d * d * ~ * h ~ h , ~ * g n d ~ g = g d g l
>>>     \0\0\0\005\0\0\0\0\0\0\0\0
>>> pollsys(0xFFFFFFFF7FFF8100, 1, 0xFFFFFFFF7FFF8200, 0x00000000) (sleeping...)
>>> pollsys(0xFFFFFFFF7FFF8100, 1, 0xFFFFFFFF7FFF8200, 0x00000000) = 0
>>>
>>> ... many pollsys ...
>>>
>>> pollsys(0xFFFFFFFF7FFF8100, 1, 0xFFFFFFFF7FFF8200, 0x00000000) (sleeping...)
>>> pollsys(0xFFFFFFFF7FFF8100, 1, 0xFFFFFFFF7FFF8200, 0x00000000) = 0
>>> pollsys(0xFFFFFFFF7FFF8100, 1, 0xFFFFFFFF7FFF8200, 0x00000000) = 0
>>> pollsys(0xFFFFFFFF7FFF8100, 1, 0xFFFFFFFF7FFF8200, 0x00000000) = 1
>>> read(6, 0x1002F2390, 22)			= 22
>>>      <   g m s h>   <   d l>   9 7<   / d l>   <   / g m s
>>> read(6, " h", 1)				= 1
>>> read(6, ">", 1)				= 1
>>> read(6, 0x1002F2390, 97)			= 97
>>>      <   m i h   v e r s i o n = " 0 . 1 ">   <   m i d>   1<   / m i d>   <
>>>      d l>   3 5<   / d l>   <   d f>   a m<   / d f>   <   m a t>   n a k<   / m
>>>      a t>   <   t a g>   0<   / t a g>   <   r i d>   0<   / r i d>   <   / m i h
>>>      >
>>>
>>> Is it possible to somehow trace the qmaster side?
>>>
>>> Regards
>>> D. Ruppert
>>>
>>> ------------------------------------------------------
>>>
>>>        
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=24148
> 0
>    
>>> To unsubscribe from this discussion, e-mail:
>>>        
> [users-unsubscribe at gridengine.sunsource.net].
>    
>>>        
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=2415
>>      
> 16
>    
>> To unsubscribe from this discussion, e-mail:
>>      
> [users-unsubscribe at gridengine.sunsource.net].
>
> ----------------------------------
> ePS&  RTS Automation Software GmbH
> Benzstr. 1
> D-71272 Renningen
> Geschäftsführer: Gernot Kral, Frank Lubnau
> Sitz der Gesellschaft: Renningen
> Registergericht: Leonberg HRB 253220
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=241520
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=241528

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list