[GE users] Failed receiving gdi request

Kogan, Felix Felix-Kogan at deshaw.com
Thu Aug 31 15:01:49 BST 2006


Thanks for the prompt and detailed answer, Andreas. I hope the fixes for
the bugs you mentioned will arrive soon.

A note regarding the possible fix for the failed gdi request problem:
I've noticed that static threshold are rarely a satisfactory solution.
They inevitably end up just throttling the system without real need. In
our case, where would I set the threshold? At 200 messages in the queue?
At 600? It would require an extensive research, turning on profiling and
other such interesting but time-consuming efforts. And everything would
need to be done again if, for example, hardware or software platform
qmaster runs on changed.

I think the algorithm should be subtler than that. I think it should
analyze the average time of request processing (they are FIFO, I
assume?) and start reporting overloading errors when that time grows to,
say, 0.75 of the shortest known timeout (I assume qmaster would know all
timeouts set up). In such case everything will be resolved
automatically.


Thanks, 

Felix


-----Original Message-----
From: Andreas.Haas at Sun.COM [mailto:Andreas.Haas at Sun.COM] 
Sent: Thursday, August 31, 2006 8:16 AM
To: users at gridengine.sunsource.net
Subject: RE: [GE users] Failed receiving gdi request


Hi Felix,

On Wed, 30 Aug 2006, Kogan, Felix wrote:

> What I would really like to know: does receiving this error message
> ("failed receiving gdi request") _always_ mean that the requested
action
> (job submission, qmod action, qstat request) didn't go through? Is a
> situation when the job has actually been submitted by qsub (i.e.
landed
> in the pending queue), but qsub reported gdi request error, possible?
If

It is possible. The "failed receiving gdi request" message 
always means the client did wait for a reply on a request that
was accepted by qmaster, but the reply wasn't delivered within a 
timeout of 10 minutes or so.

> it is not possible, a reasonable workaround would be to retry the
> requests (e.g. qsub command execution) until succeeded. If it is
> possible, the retrying approach wouldn't work, as we inevitably will
end
> up with duplicate submissions.

It is true, that this problem exists.

The problem is that overloaded qmasters can never entirely be ruled 
out. That means it is necessary to improve behaviour of Grid 
Engine in such situations. In the concrete case you describe
above the improvment would be that qsub(1) doesn't fail somehow,
but instead returns an error indication telling you that qmaster 
is overloaded and submission failed for that reason. I believe 
this could be achieved by enhancing qmaster in a way that incoming
requests are not accepted anymore when a certain threshold of 
pending messages in read buffer is exceeded. If the client would
then get a notification about the rejected request this would be
the situation where error code 25 would be returned by qsub(1), 
very much as in cases when sge_conf(5) max_jobs limit is reached.

> I've run some profiling on our SGE installation (v6 update 8) that
also
> produced gdi request errors recently. It appears that these errors
> always correspond in time to very long scheduler runs (normal - about
2
> seconds, long - about 60 seconds). How is successful submission
related
> to a successful scheduling? I thought these are two quite independent
> processes. Or does it mean that qmaster machine was just overloaded at
> that time? It is strange - we have quite peppy dual Opteron-based
> Solaris 10 machines running as dedicated qmasters and I have never
seen
> them particularly loaded...

I agree with you: Actually long scheduling times and GDI request 
errors shouldn't be related, but unfortunately it would take me 
much more time to really understand your concrete case. Yet if GDI 
request errors really are an outcome of overly long scheduing 
times it is good possible that you encounter a deciding improvment 
when you use 6.0u9 as it will bring fixes for

    http://gridengine.sunsource.net/issues/show_bug.cgi?id=2093
    http://gridengine.sunsource.net/issues/show_bug.cgi?id=2094

OTOH if you wish to trace into this I recommend that you meanwhile 
use qping -dump:

    # qping -dump gridware $SGE_QMASTER_PORT qmaster 1
    open connection to "es-ergb01-01/qmaster/1" ... no error happened
               time|local                 |d.|remote
|format|ack type|               msg tag|msg id|msg
    rid|msg len|       msg time|   msg ltime|con count|
 
---------------|----------------------|--|------------------------------
|------|--------|----------------------|------|-------|-------|---------
------|------------|---------|
 
14:07:03.632328|es-ergb01-01/qmaster/1|->|es-ergb01-01/debug_client/475
|   crm|     nak|                     0|     0|
    0|    235|14:07:03.632327|00:00.000000|       10|
    14:07:04.001733|es-ergb01-01/qmaster/1|->|es-ergb01-01/schedd/1
|   bin|     nak|    TAG_REPORT_REQUEST| 33937|
    0|    197|14:07:04.001558|00:00.000174|       10|
    14:07:04.003214|es-ergb01-01/qmaster/1|<-|es-ergb01-01/schedd/1
|   bin|     nak|       TAG_ACK_REQUEST|  9088|
    0|     20|14:07:04.003202|00:00.000011|       10|
    14:07:04.012578|es-ergb01-01/qmaster/1|<-|es-ergb01-01/schedd/1
|   bin|     nak|       TAG_GDI_REQUEST|  9089|
    0|    309|14:07:04.012558|00:00.000019|       10|
    14:07:04.013149|es-ergb01-01/qmaster/1|->|es-ergb01-01/schedd/1
|   bin|     nak|       TAG_GDI_REQUEST| 33938|
    9089|    107|14:07:04.013045|00:00.000103|       10|
    14:07:04.014032|es-ergb01-01/qmaster/1|<-|es-ergb01-01/schedd/1
|   bin|     ack|       TAG_GDI_REQUEST|  9090|
    0|    801|14:07:04.014025|00:00.000007|       10|
    14:07:04.014177|es-ergb01-01/qmaster/1|->|es-ergb01-01/schedd/1
|    am|     nak|                     0| 33939|
    0|     38|14:07:04.014059|00:00.000118|       10|
    14:07:04.014650|es-ergb01-01/qmaster/1|->|es-ergb01-01/schedd/1
|   bin|     nak|       TAG_GDI_REQUEST| 33940|
    9090|    205|14:07:04.014489|00:00.000160|       10|
    14:07:05.001541|es-ergb01-01/qmaster/1|->|es-ergb01-01/schedd/1
|   bin|     nak|    TAG_REPORT_REQUEST| 33941|
    0|    197|14:07:05.001347|00:00.000192|       10|

it allows you to track all incoming/outgoing messages of qmaster.

If you can reproduce your error condition qping -dump would allow
you to figure out, what component (client/execd/schedd??) is actually
sending so many requests to qmaster in the phase where read messages
grow so rapid.

Regards,
Andreas

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list