[GE issues] [Issue 3235] modifying the global configuration make sge_qmaster unresponsive in a huge cluster

joga Joachim.Gabler at sun.com
Fri Jan 29 08:40:20 GMT 2010


User joga changed the following:

                What    |Old value                 |New value
             Assigned to|ernst                     |joga
                 Summary| modifying the global conf|modifying the global confi
                        |iguration make sge_qmaster|guration make sge_qmaster 
                        | unresponsive in a huge cl|unresponsive in a huge clu
                        |uster                     |ster
        Target milestone|---                       |6.2u6
                 Version|6.2                       |6.0

------- Additional comments from joga at sunsource.net Fri Jan 29 00:40:18 -0800 2010 -------

Changing the configuration (qconf -mconf) increases a config version number.
This triggers updating all execution daemons using the following protocol:

-> execd reports the config version it is using (in every load report interval)
-> qmaster recognizes that execd is using an old version
-> qmaster tells execd to update configuration
-> execd updates its configuration via GDI GET request

Within one load report interval, all execds try to update the configuration.
Due to a non optimal implementation of the GDI GET CONF request in qmaster,
these operations are expensive,
leading to high load on qmaster, GDI GET requests timing out, 
execds reconnecting at qmaster, again issuing GDI GET CONF requests ...

In big clusters this leads to qmaster being unresponsive for a long time or even endlessly.

In this scenario there are 2 expensive operations:
1. The GDI GET CONF request: execd requests the global config and its local config. The code processing the request first makes a copy of
the whole config list (some 3500 objects), then selects (copies) the two requested config objects, finally frees the copied list:
         lList *conf = NULL;
         conf = sge_get_configuration();
         task->data_list = lSelectHashPack("", conf, task->condition, task->enumeration, false, NULL);
Instead, it should just select/copy the two requested objects.

2. When execds time out and reconnect, they resend static load values, which leads to an unnecessary spooling operation, see IZ 3236.
This generates additional load and significant delays on sge_qmaster, esp. with classic spooling on a shared filesystem.


To unsubscribe from this discussion, e-mail: [issues-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list