[GE users] Failed receiving gdi request

Andreas.Haas at Sun.COM Andreas.Haas at Sun.COM
Thu Aug 31 13:15:42 BST 2006


Hi Felix,

On Wed, 30 Aug 2006, Kogan, Felix wrote:

> What I would really like to know: does receiving this error message
> ("failed receiving gdi request") _always_ mean that the requested action
> (job submission, qmod action, qstat request) didn't go through? Is a
> situation when the job has actually been submitted by qsub (i.e. landed
> in the pending queue), but qsub reported gdi request error, possible? If

It is possible. The "failed receiving gdi request" message 
always means the client did wait for a reply on a request that
was accepted by qmaster, but the reply wasn't delivered within a 
timeout of 10 minutes or so.

> it is not possible, a reasonable workaround would be to retry the
> requests (e.g. qsub command execution) until succeeded. If it is
> possible, the retrying approach wouldn't work, as we inevitably will end
> up with duplicate submissions.

It is true, that this problem exists.

The problem is that overloaded qmasters can never entirely be ruled 
out. That means it is necessary to improve behaviour of Grid 
Engine in such situations. In the concrete case you describe
above the improvment would be that qsub(1) doesn't fail somehow,
but instead returns an error indication telling you that qmaster 
is overloaded and submission failed for that reason. I believe 
this could be achieved by enhancing qmaster in a way that incoming
requests are not accepted anymore when a certain threshold of 
pending messages in read buffer is exceeded. If the client would
then get a notification about the rejected request this would be
the situation where error code 25 would be returned by qsub(1), 
very much as in cases when sge_conf(5) max_jobs limit is reached.

> I've run some profiling on our SGE installation (v6 update 8) that also
> produced gdi request errors recently. It appears that these errors
> always correspond in time to very long scheduler runs (normal - about 2
> seconds, long - about 60 seconds). How is successful submission related
> to a successful scheduling? I thought these are two quite independent
> processes. Or does it mean that qmaster machine was just overloaded at
> that time? It is strange - we have quite peppy dual Opteron-based
> Solaris 10 machines running as dedicated qmasters and I have never seen
> them particularly loaded...

I agree with you: Actually long scheduling times and GDI request 
errors shouldn't be related, but unfortunately it would take me 
much more time to really understand your concrete case. Yet if GDI 
request errors really are an outcome of overly long scheduing 
times it is good possible that you encounter a deciding improvment 
when you use 6.0u9 as it will bring fixes for

    http://gridengine.sunsource.net/issues/show_bug.cgi?id=2093
    http://gridengine.sunsource.net/issues/show_bug.cgi?id=2094

OTOH if you wish to trace into this I recommend that you meanwhile 
use qping -dump:

    # qping -dump gridware $SGE_QMASTER_PORT qmaster 1
    open connection to "es-ergb01-01/qmaster/1" ... no error happened
               time|local                 |d.|remote                        |format|ack type|               msg tag|msg id|msg
    rid|msg len|       msg time|   msg ltime|con count|
    ---------------|----------------------|--|------------------------------|------|--------|----------------------|------|-------|-------|---------------|------------|---------|
    14:07:03.632328|es-ergb01-01/qmaster/1|->|es-ergb01-01/debug_client/475 |   crm|     nak|                     0|     0|
    0|    235|14:07:03.632327|00:00.000000|       10|
    14:07:04.001733|es-ergb01-01/qmaster/1|->|es-ergb01-01/schedd/1         |   bin|     nak|    TAG_REPORT_REQUEST| 33937|
    0|    197|14:07:04.001558|00:00.000174|       10|
    14:07:04.003214|es-ergb01-01/qmaster/1|<-|es-ergb01-01/schedd/1         |   bin|     nak|       TAG_ACK_REQUEST|  9088|
    0|     20|14:07:04.003202|00:00.000011|       10|
    14:07:04.012578|es-ergb01-01/qmaster/1|<-|es-ergb01-01/schedd/1         |   bin|     nak|       TAG_GDI_REQUEST|  9089|
    0|    309|14:07:04.012558|00:00.000019|       10|
    14:07:04.013149|es-ergb01-01/qmaster/1|->|es-ergb01-01/schedd/1         |   bin|     nak|       TAG_GDI_REQUEST| 33938|
    9089|    107|14:07:04.013045|00:00.000103|       10|
    14:07:04.014032|es-ergb01-01/qmaster/1|<-|es-ergb01-01/schedd/1         |   bin|     ack|       TAG_GDI_REQUEST|  9090|
    0|    801|14:07:04.014025|00:00.000007|       10|
    14:07:04.014177|es-ergb01-01/qmaster/1|->|es-ergb01-01/schedd/1         |    am|     nak|                     0| 33939|
    0|     38|14:07:04.014059|00:00.000118|       10|
    14:07:04.014650|es-ergb01-01/qmaster/1|->|es-ergb01-01/schedd/1         |   bin|     nak|       TAG_GDI_REQUEST| 33940|
    9090|    205|14:07:04.014489|00:00.000160|       10|
    14:07:05.001541|es-ergb01-01/qmaster/1|->|es-ergb01-01/schedd/1         |   bin|     nak|    TAG_REPORT_REQUEST| 33941|
    0|    197|14:07:05.001347|00:00.000192|       10|

it allows you to track all incoming/outgoing messages of qmaster.

If you can reproduce your error condition qping -dump would allow
you to figure out, what component (client/execd/schedd??) is actually
sending so many requests to qmaster in the phase where read messages
grow so rapid.

Regards,
Andreas

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list