[GE users] 6.2 qping and deadlocks

Christian Reissmann Christian.Reissmann at Sun.COM
Fri Oct 31 10:49:39 GMT 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi justin,

regarding the qping -info output I have to say that this is definitely a 
issue. The info line reports warnings and errors when the threads does 
not update an internal counter. This seems not to happen when there is 
nothing to do for them.

The counter was introduced to detect if a thread is deadlocked or has
a high working load in its working loop.

Here an qping -info output on my GE 6.2u1 release candidate cluster:

 > qping -info hostfoo $SGE_QMASTER_PORT qmaster 1
10/31/2008 11:19:11:
SIRM version:             0.1
SIRM message id:          1
start time:               10/31/2008 11:18:43 (1225448323)
run time [s]:             28
messages in read buffer:  0
messages in write buffer: 0
nr. of connected clients: 6
status:                   2
info:                     MAIN: W (28.46) | signaler000: W (28.35) | 
event_master000: W (0.86) | timer000: W (3.98) | worker000: W (3.53) | 
worker001: W (0.86) | listener000: W (0.86) | listener001: W (3.53) | 
jvm000: W (28.34) | scheduler000: W (8.98) | ERROR
Monitor:                  disabled


 > qstat -help | head -1
GE 6.2 (build 20081029)

We have a new bugfix in the changelog of V62_BRANCH:

JG-2008-10-30-0: Bugfix:      fixed a qmaster deadlock on linux
                               fixed minor memory leaks
                  Review:      RD

Perhaps this is the root cause of the worker Warning/Error states.

The info output seems also not be correct for the MAIN thread. I think
the main thread is waiting for the other threads, so he cannot update
its counter.

Please file an issue regarding this info output problem. It might be a
good idea to validate the code which is generating the info message string.

The DB_LOCK_DEADLOCK loggings might be a result of the deadlock problem
fixed with Changelog JG-2008-10-30-0. It might be also a problem with
rpc server. Is your spooling directory where the rpc server is storing
its data a mounted NFS3 file system?. It should be at least a local 
filesystem or NFS4.

Many thanks for reporting your observations,

Christian


On 10/30/08 15:33, Justin Ottley wrote:
> Hey all,
> Im debugging a new 6.2 install, and noticed that the qping 'info' output 
> looks like this:
> 
> info:   MAIN: E (60015.99) | signaler000: E (60015.58) | 
> event_master000: E (0.63) | timer000: E (0.63) | worker000: E (59431.99) 
> | worker001: E (59446.31) | listener000: E (3.53) | listener001: E 
> (3.53) | scheduler000: E (0.63) | ERROR
> 
> pretty much all the time, even under the following conditions:
> - no visible problem (jobs get queued and run, commands like qstat, 
> qconf, etc work, qmon is functional)
> - a clean, minimal install of 6.2 using berkeley db RPC (no execd, no 
> shadowd, no arco)
> - a clean, minimal install of 6.2 using classic spooling (no execd, no 
> shadowd, no arco)
> 
> anyone know whether this behavior is normal or not?
> the output of qping on a 6.1 install on the same box shows no such 
> errors (i acknowledge the qping info format is different in 6.1, but 
> shows OK)
> 
> In addition, the problem im actually having is my 6.2 / berkeley db RPC 
> install seems to suffer from relatively frequent deadlocks, with errors 
> of the form:
> 
> |E|error writing object with key "JOB:     259" into berkeley database: 
> (-30995) DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
> 
> This is after restarts of the RPC server and qmaster.
> Ive ran thousands of jobs in a 6.1 install and never saw this error..
> 
> Arch: lx24-x86
> Fedora Core 4, Fedora Core 6
> 
> thanks for any help/info/advice,
> -justin
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 

-- 
Sun Microsystems GmbH             Christian Reissmann
Dr.-Leo-Ritter-Str. 7             Software Engineer
D-93049 Regensburg                Phone: +49 (0)941 3075 112
Germany                           Fax:   +49 (0)941 3075 222
http://www.sun.de                 mailto: Christian.Reissmann at sun.com
                                   http://www.sun.com/gridengine
Sitz der Gesellschaft:
Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
Vorsitzender des Aufsichtsrates: Martin Haering

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list