[GE users] 6.2 qping and deadlocks

Justin Ottley ottley at coredp.com
Fri Oct 31 15:17:13 GMT 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hey Christian,

thanks for the reply, comments inline -

Christian Reissmann wrote:
> Hi justin,
>
> regarding the qping -info output I have to say that this is definitely 
> a issue. The info line reports warnings and errors when the threads 
> does not update an internal counter. This seems not to happen when 
> there is nothing to do for them.
>
> The counter was introduced to detect if a thread is deadlocked or has
> a high working load in its working loop.
>
> Here an qping -info output on my GE 6.2u1 release candidate cluster:
>
> > qping -info hostfoo $SGE_QMASTER_PORT qmaster 1
> 10/31/2008 11:19:11:
> SIRM version:             0.1
> SIRM message id:          1
> start time:               10/31/2008 11:18:43 (1225448323)
> run time [s]:             28
> messages in read buffer:  0
> messages in write buffer: 0
> nr. of connected clients: 6
> status:                   2
> info:                     MAIN: W (28.46) | signaler000: W (28.35) | 
> event_master000: W (0.86) | timer000: W (3.98) | worker000: W (3.53) | 
> worker001: W (0.86) | listener000: W (0.86) | listener001: W (3.53) | 
> jvm000: W (28.34) | scheduler000: W (8.98) | ERROR
> Monitor:                  disabled
>
>
> > qstat -help | head -1
> GE 6.2 (build 20081029)
>
> We have a new bugfix in the changelog of V62_BRANCH:
>
> JG-2008-10-30-0: Bugfix:      fixed a qmaster deadlock on linux
>                               fixed minor memory leaks
>                  Review:      RD
>
> Perhaps this is the root cause of the worker Warning/Error states.
>
> The info output seems also not be correct for the MAIN thread. I think
> the main thread is waiting for the other threads, so he cannot update
> its counter.
>
> Please file an issue regarding this info output problem. It might be a
> good idea to validate the code which is generating the info message 
> string.
ahh, thanks for that bit of info. I'll file an issue.
Ill also include that the qping output for any execds also always shows 
a warning for info (likely related?)

: qping -info exechost $SGE_EXECD_PORT execd 1

10/31/2008 11:13:21:
SIRM version:             0.1
SIRM message id:          1
start time:               10/31/2008 11:04:49 (1225465489)
run time [s]:             512
messages in read buffer:  0
messages in write buffer: 0
nr. of connected clients: 2
status:                   1
info:                     sge_execd_process_messages: W (509.61) | WARNING
malloc:                   arena(135168) |ordblks(26) | smblks(3) | 
hblksr(0) | hblhkd(0) usmblks(0) | fsmblks(112) | uordblks(77240) | 
fordblks(57928) | keepcost(51464)
Monitor:                  disabled

>
> The DB_LOCK_DEADLOCK loggings might be a result of the deadlock problem
> fixed with Changelog JG-2008-10-30-0. It might be also a problem with
> rpc server. Is your spooling directory where the rpc server is storing
> its data a mounted NFS3 file system?. It should be at least a local 
> filesystem or NFS4.
the spooling directory is on a local filesystem. Hopefully its related 
to Changelog JG-2008-10-30-0.
>
> Many thanks for reporting your observations,
>
> Christian
>
>
> On 10/30/08 15:33, Justin Ottley wrote:
>> Hey all,
>> Im debugging a new 6.2 install, and noticed that the qping 'info' 
>> output looks like this:
>>
>> info:   MAIN: E (60015.99) | signaler000: E (60015.58) | 
>> event_master000: E (0.63) | timer000: E (0.63) | worker000: E 
>> (59431.99) | worker001: E (59446.31) | listener000: E (3.53) | 
>> listener001: E (3.53) | scheduler000: E (0.63) | ERROR
>>
>> pretty much all the time, even under the following conditions:
>> - no visible problem (jobs get queued and run, commands like qstat, 
>> qconf, etc work, qmon is functional)
>> - a clean, minimal install of 6.2 using berkeley db RPC (no execd, no 
>> shadowd, no arco)
>> - a clean, minimal install of 6.2 using classic spooling (no execd, 
>> no shadowd, no arco)
>>
>> anyone know whether this behavior is normal or not?
>> the output of qping on a 6.1 install on the same box shows no such 
>> errors (i acknowledge the qping info format is different in 6.1, but 
>> shows OK)
>>
>> In addition, the problem im actually having is my 6.2 / berkeley db 
>> RPC install seems to suffer from relatively frequent deadlocks, with 
>> errors of the form:
>>
>> |E|error writing object with key "JOB:     259" into berkeley 
>> database: (-30995) DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
>>
>> This is after restarts of the RPC server and qmaster.
>> Ive ran thousands of jobs in a 6.1 install and never saw this error..
>>
>> Arch: lx24-x86
>> Fedora Core 4, Fedora Core 6
>>
>> thanks for any help/info/advice,
>> -justin
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list