[GE users] Failed receiving gdi request

Kogan, Felix Felix-Kogan at deshaw.com
Wed Aug 30 18:31:04 BST 2006


What I would really like to know: does receiving this error message
("failed receiving gdi request") _always_ mean that the requested action
(job submission, qmod action, qstat request) didn't go through? Is a
situation when the job has actually been submitted by qsub (i.e. landed
in the pending queue), but qsub reported gdi request error, possible? If
it is not possible, a reasonable workaround would be to retry the
requests (e.g. qsub command execution) until succeeded. If it is
possible, the retrying approach wouldn't work, as we inevitably will end
up with duplicate submissions.

I've run some profiling on our SGE installation (v6 update 8) that also
produced gdi request errors recently. It appears that these errors
always correspond in time to very long scheduler runs (normal - about 2
seconds, long - about 60 seconds). How is successful submission related
to a successful scheduling? I thought these are two quite independent
processes. Or does it mean that qmaster machine was just overloaded at
that time? It is strange - we have quite peppy dual Opteron-based
Solaris 10 machines running as dedicated qmasters and I have never seen
them particularly loaded...

Normal qping output:

08/30/2006 07:16:30:
SIRM version:             0.1
SIRM message id:          2116
start time:               08/29/2006 09:56:51 (1156859811)
run time [s]:             76779
messages in read buffer:  0
messages in write buffer: 0
nr. of connected clients: 605
status:                   0
info:                     TET: R (2.23) | EDT: R (0.08) | SIGT: R
(76349.04) | MT(1): R (0.21) | MT(2): R (0.34) | OK
Monitor:
08/30/2006 07:16:11 | TET: runs: 0.67r/s (pending: 15.30 executed:
0.60/s) out:
0.00m/s APT: 0.1246s/m idle: 91.69% wait: 1.82% time: 29.99s
08/30/2006 07:16:22 | EDT: runs: 21.40r/s (clients: 8.00 mod: 0.10/s
ack: 0.33/s blocked: 0.00 busy: 0.00 | events: 41.14/s added: 41.11/s
skipt: 0.03/s) out: 0.33m/s APT: 0.0006s/m idle: 98.81% wait: 0.00%
time: 30.00s
08/29/2006 10:04:01 | SIGT: no monitoring data available
08/30/2006 07:16:04 | MT(1): runs: 10.34r/s (execd
(l:9.00,j:9.27,c:9.00,p:9.00,a:0.03)/s GDI
(a:0.10,g:0.67,m:0.03,d:0.00,c:0.00,t:0.00,p:0.00)/s event-acks: 0.27/s)
out: 0.93m/s APT: 0.0321s/m idle: 66.87% wait: 13.82% time: 29.99s
08/30/2006 07:16:04 | MT(2): runs: 8.62r/s (execd
(l:7.30,j:7.54,c:7.30,p:7.30,a:0.00)/s GDI
(a:0.17,g:0.71,m:0.10,d:0.00,c:0.00,t:0.00,p:0.00)/s event-acks: 0.20/s)
out: 1.14m/s APT: 0.0391s/m idle: 66.28% wait: 13.48% time: 29.71s

Overloaded (gdi errors time) qping output:

08/30/2006 07:22:30:
SIRM version:             0.1
SIRM message id:          2128
start time:               08/29/2006 09:56:51 (1156859811)
run time [s]:             77139
messages in read buffer:  1442
messages in write buffer: 0
nr. of connected clients: 645
status:                   2
info:                     TET: W (30.52) | EDT: W (0.52) | SIGT: W
(76709.18) |
MT(1): W (37.16) | MT(2): W (36.68) | ERROR
Monitor:
08/30/2006 07:21:44 | TET: runs: 0.44r/s (pending: 56.00 executed:
0.31/s) out:
0.00m/s APT: 0.0540s/m idle: 97.60% wait: 0.64% time: 29.29s
08/30/2006 07:22:22 | EDT: runs: 2.07r/s (clients: 7.00 mod: 0.00/s ack:
0.00/s
blocked: 0.00 busy: 1.00 | events: 3.23/s added: 3.17/s skipt: 0.07/s)
out: 0.20m/s APT: 0.0005s/m idle: 99.90% wait: 0.00% time: 30.01s
08/29/2006 10:04:01 | SIGT: no monitoring data available
08/30/2006 07:21:44 | MT(1): runs: 9.70r/s (execd
(l:8.09,j:8.99,c:8.09,p:8.09,a:0.00)/s GDI
(a:0.30,g:0.67,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s event-acks: 0.07/s)
out: 1.40m/s APT: 0.1031s/m idle: 0.03% wait: 46.19% time: 29.91s
08/30/2006 07:21:46 | MT(2): runs: 9.41r/s (execd
(l:8.03,j:8.90,c:8.03,p:8.03,a:0.00)/s GDI
(a:0.10,g:0.58,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s event-acks: 0.16/s)
out: 1.10m/s APT: 0.1062s/m idle: 0.03% wait: 46.64% time: 31.02s


08/30/2006 07:23:00:
SIRM version:             0.1
SIRM message id:          2129
start time:               08/29/2006 09:56:51 (1156859811)
run time [s]:             77169
messages in read buffer:  1925
messages in write buffer: 0
nr. of connected clients: 658
status:                   2
info:                     TET: W (60.53) | EDT: W (0.53) | SIGT: W
(76739.19) |
MT(1): W (67.17) | MT(2): W (66.69) | ERROR
Monitor:
08/30/2006 07:21:44 | TET: runs: 0.44r/s (pending: 56.00 executed:
0.31/s) out:
0.00m/s APT: 0.0540s/m idle: 97.60% wait: 0.64% time: 29.29s
08/30/2006 07:22:52 | EDT: runs: 0.97r/s (clients: 7.00 mod: 0.00/s ack:
0.00/s
blocked: 0.00 busy: 1.00 | events: 0.00/s added: 0.00/s skipt: 0.00/s)
out: 0.20m/s APT: 0.0394s/m idle: 96.19% wait: 0.00% time: 30.00s
08/29/2006 10:04:01 | SIGT: no monitoring data available
08/30/2006 07:21:44 | MT(1): runs: 9.70r/s (execd
(l:8.09,j:8.99,c:8.09,p:8.09,a:0.00)/s GDI
(a:0.30,g:0.67,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s event-acks: 0.07/s)
out: 1.40m/s APT: 0.1031s/m idle: 0.03% wait: 46.19% time: 29.91s
08/30/2006 07:21:46 | MT(2): runs: 9.41r/s (execd
(l:8.03,j:8.90,c:8.03,p:8.03,a:0.00)/s GDI
(a:0.10,g:0.58,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s event-acks: 0.16/s)
out: 1.10m/s APT: 0.1062s/m idle: 0.03% wait: 46.64% time: 31.02s

Normal scheduler profile:

08/30/2006 07:15:50|schedd|desags1|P|PROF: sge_mirror processed 334
events in 0.010 s

08/30/2006 07:15:51|schedd|desags1|P|PROF: static urgency took 0.000 s
08/30/2006 07:15:51|schedd|desags1|P|PROF: job ticket calculation: init:
0.010 s, pass 0: 0.040 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s
08/30/2006 07:15:51|schedd|desags1|P|PROF: job ticket calculation: init:
0.010 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s
08/30/2006 07:15:51|schedd|desags1|P|PROF: normalizing job tickets took
0.000 s
08/30/2006 07:15:51|schedd|desags1|P|PROF: create active job orders:
0.170 s
08/30/2006 07:15:51|schedd|desags1|P|PROF: job-order calculation took
0.490 s
08/30/2006 07:15:51|schedd|desags1|P|PROF: job sorting took 0.000 s
08/30/2006 07:15:52|schedd|desags1|P|PROF: job dispatching took 0.370 s
(2 fast, 0 comp, 0 pe, 0 res)
08/30/2006 07:15:52|schedd|desags1|P|PROF: create pending job orders:
0.020 s
08/30/2006 07:15:52|schedd|desags1|P|PROF: scheduled in 1.620 (u 0.790 +
s 0.010 = 0.800): 0 sequential, 0 parallel, 715 orders, 639 H, 2305 Q,
3000 QA, 2 J(qw), 296 J(r), 3 J(s), 0 J(h), 0 J(e), 4 J(x), 299 J(all),
75 C, 3 ACL, 3 PE, 96 U, 1 D, 4 PRJ, 1 ST, 0 CKPT, 0 RU, 695 gMes, 4610
jMes, 712/1 pre-send, 0/0/0 pe-alg

08/30/2006 07:15:52|schedd|desags1|P|PROF: send orders and cleanup took:
0.100 (u 0.060,s 0.000) s
08/30/2006 07:15:53|schedd|desags1|P|PROF: schedd run took: 2.840 s
(init: 0.000 s, copy: 0.580 s, run:1.880, free: 0.380 s, jobs: 299,
categories: 15/63)
08/30/2006 07:16:00|schedd|desags1|P|PROF: sge_mirror processed 210
events in 0.010 s

08/30/2006 07:16:01|schedd|desags1|P|PROF: static urgency took 0.000 s
08/30/2006 07:16:01|schedd|desags1|P|PROF: job ticket calculation: init:
0.010 s, pass 0: 0.030 s, pass 1: 0.010, pass2: 0.000, calc: 0.000 s
08/30/2006 07:16:01|schedd|desags1|P|PROF: job ticket calculation: init:
0.010 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s
08/30/2006 07:16:01|schedd|desags1|P|PROF: normalizing job tickets took
0.000 s
08/30/2006 07:16:01|schedd|desags1|P|PROF: create active job orders:
0.180 s
08/30/2006 07:16:01|schedd|desags1|P|PROF: job-order calculation took
0.510 s
08/30/2006 07:16:01|schedd|desags1|P|PROF: job sorting took 0.000 s
08/30/2006 07:16:02|schedd|desags1|P|PROF: job dispatching took 0.360 s
(2 fast, 0 comp, 0 pe, 0 res)
08/30/2006 07:16:02|schedd|desags1|P|PROF: create pending job orders:
0.010 s
08/30/2006 07:16:02|schedd|desags1|P|PROF: scheduled in 1.670 (u 0.790 +
s 0.020 = 0.810): 0 sequential, 0 parallel, 710 orders, 639 H, 2305 Q,
3000 QA, 2 J(qw), 296 J(r), 3 J(s), 0 J(h), 0 J(e), 0 J(x), 299 J(all),
75 C, 3 ACL, 3 PE, 96 U, 1 D, 4 PRJ, 1 ST, 0 CKPT, 0 RU, 695 gMes, 4610
jMes, 707/1 pre-send, 0/0/0 pe-alg

08/30/2006 07:16:02|schedd|desags1|P|PROF: send orders and cleanup took:
0.110 (u 0.060,s 0.000) s
08/30/2006 07:16:03|schedd|desags1|P|PROF: schedd run took: 2.930 s
(init: 0.000 s, copy: 0.620 s, run:1.930, free: 0.380 s, jobs: 299,
categories: 15/63)

Overloaded scheduler profile:

08/30/2006 07:20:36|schedd|desags1|P|PROF(1): profiling summary:

08/30/2006 07:20:36|schedd|desags1|P|PROF(1): other          : wc =
193.950s, utime =      1.150s, stime =     19.990s, utilization =  11%
08/30/2006 07:20:36|schedd|desags1|P|PROF(1): packing        : wc =
653.030s, utime =    641.430s, stime =      0.570s, utilization =  98%
08/30/2006 07:20:36|schedd|desags1|P|PROF(1): eventclient    : wc =
6.280s, utime =      5.490s, stime =      0.340s, utilization =  93%
08/30/2006 07:20:36|schedd|desags1|P|PROF(1): mirror         : wc =
215.240s, utime =    210.300s, stime =      0.350s, utilization =  98%
08/30/2006 07:20:36|schedd|desags1|P|PROF(1): gdi            : wc =
47394.910s, utime =    134.010s, stime =     17.640s, utilization =   0%
08/30/2006 07:20:36|schedd|desags1|P|PROF(1): ht-resize      : wc =
0.000s, utime =      0.000s, stime =      0.000s, utilization =   0%
08/30/2006 07:20:36|schedd|desags1|P|PROF(1): scheduler      : wc =
2760.580s, utime =   2495.590s, stime =     41.360s, utilization =  92%
08/30/2006 07:20:36|schedd|desags1|P|PROF(1): pending ticket : wc =
534.430s, utime =    329.040s, stime =     30.300s, utilization =  67%
08/30/2006 07:20:36|schedd|desags1|P|PROF(1): job sorting    : wc =
13.260s, utime =     12.140s, stime =      0.100s, utilization =  92%
08/30/2006 07:20:36|schedd|desags1|P|PROF(1): job dispatching: wc =
6240.540s, utime =   4987.190s, stime =    129.390s, utilization =  82%
08/30/2006 07:20:36|schedd|desags1|P|PROF(1): send orders    : wc =
165.040s, utime =    160.170s, stime =      0.190s, utilization =  97%
08/30/2006 07:20:37|schedd|desags1|P|PROF(1): scheduler event: wc =
0.200s, utime =      0.150s, stime =      0.120s, utilization = 135%
08/30/2006 07:20:37|schedd|desags1|P|PROF(1): copy lists     : wc =
6477.260s, utime =   6289.690s, stime =     15.260s, utilization =  97%
08/30/2006 07:20:37|schedd|desags1|P|PROF(1): total          : wc =
65647.790s, utime =  16091.970s, stime =    278.280s, utilization =  25%
08/30/2006 07:20:58|schedd|desags1|P|PROF: sge_mirror processed 0 events
in 0.000 s

08/30/2006 07:20:59|schedd|desags1|P|PROF: static urgency took 0.000 s
08/30/2006 07:20:59|schedd|desags1|P|PROF: job ticket calculation: init:
0.020 s, pass 0: 0.040 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s
08/30/2006 07:20:59|schedd|desags1|P|PROF: job ticket calculation: init:
0.010 s, pass 0: 0.000 s, pass 1: 0.010, pass2: 0.000, calc: 0.000 s
08/30/2006 07:20:59|schedd|desags1|P|PROF: normalizing job tickets took
0.000 s
08/30/2006 07:20:59|schedd|desags1|P|PROF: create active job orders:
0.170 s
08/30/2006 07:21:00|schedd|desags1|P|PROF: job-order calculation took
0.490 s
08/30/2006 07:21:00|schedd|desags1|P|PROF: job sorting took 0.000 s
08/30/2006 07:21:15|schedd|desags1|P|PROF: job dispatching took 15.630 s
(86 fast, 0 comp, 0 pe, 0 res)
08/30/2006 07:21:15|schedd|desags1|P|PROF: create pending job orders:
0.010 s
08/30/2006 07:21:16|schedd|desags1|P|PROF: scheduled in 16.930 (u 2.640
+ s 0.220 = 2.860): 84 sequential, 0 parallel, 856 orders, 639 H, 2233
Q, 3000 QA, 2 J(qw), 310 J(r), 1 J(s), 0 J(h), 0 J(e), 8 J(x), 313
J(all), 75 C, 3 ACL, 3 PE, 96
U, 1 D, 4 PRJ, 1 ST, 0 CKPT, 0 RU, 767 gMes, 4634 jMes, 769/1 pre-send,
0/0/0 pe-alg

08/30/2006 07:21:58|schedd|desags1|P|PROF: send orders and cleanup took:
42.570
(u 0.100,s 0.000) s
08/30/2006 07:21:59|schedd|desags1|P|PROF: schedd run took: 60.630 s
(init: 0.000 s, copy: 0.600 s, run:59.660, free: 0.370 s, jobs: 313,
categories: 15/64)

...

08/30/2006 07:23:06|schedd|desags1|P|PROF: sge_mirror processed 0 events
in 0.000 s

08/30/2006 07:23:07|schedd|desags1|P|PROF: static urgency took 0.000 s
08/30/2006 07:23:07|schedd|desags1|P|PROF: job ticket calculation: init:
0.010 s, pass 0: 0.040 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s
08/30/2006 07:23:07|schedd|desags1|P|PROF: job ticket calculation: init:
0.010 s, pass 0: 0.010 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s
08/30/2006 07:23:08|schedd|desags1|P|PROF: normalizing job tickets took
0.000 s
08/30/2006 07:23:08|schedd|desags1|P|PROF: create active job orders:
0.580 s
08/30/2006 07:23:08|schedd|desags1|P|PROF: job-order calculation took
0.900 s
08/30/2006 07:23:08|schedd|desags1|P|PROF: job sorting took 0.000 s
08/30/2006 07:23:09|schedd|desags1|P|PROF: job dispatching took 0.750 s
(6 fast, 0 comp, 0 pe, 0 res)
08/30/2006 07:23:09|schedd|desags1|P|PROF: create pending job orders:
0.010 s
08/30/2006 07:23:09|schedd|desags1|P|PROF: scheduled in 2.350 (u 0.840 +
s 0.030 = 0.870): 4 sequential, 0 parallel, 789 orders, 639 H, 2240 Q,
3000 QA, 2 J(qw), 310 J(r), 1 J(s), 0 J(h), 0 J(e), 6 J(x), 317 J(all),
75 C, 3 ACL, 3 PE, 96 U, 1 D, 4 PRJ, 1 ST, 0 CKPT, 0 RU, 760 gMes, 4488
jMes, 782/1 pre-send, 0/0/0 pe-alg

08/30/2006 07:23:51|schedd|desags1|P|PROF: send orders and cleanup took:
41.910
(u 0.100,s 0.010) s
08/30/2006 07:23:52|schedd|desags1|P|PROF: schedd run took: 45.360 s
(init: 0.000 s, copy: 0.590 s, run:44.420, free: 0.350 s, jobs: 317,
categories: 15/64)
08/30/2006 07:23:52|schedd|desags1|P|PROF(1): profiling summary:

08/30/2006 07:23:52|schedd|desags1|P|PROF(1): other          : wc =
197.160s, utime =      1.160s, stime =     20.020s, utilization =  11%
08/30/2006 07:23:52|schedd|desags1|P|PROF(1): packing        : wc =
653.270s, utime =    641.710s, stime =      0.570s, utilization =  98%
08/30/2006 07:23:52|schedd|desags1|P|PROF(1): eventclient    : wc =
6.280s, utime =      5.490s, stime =      0.340s, utilization =  93%
08/30/2006 07:23:52|schedd|desags1|P|PROF(1): mirror         : wc =
215.350s, utime =    210.400s, stime =      0.350s, utilization =  98%
08/30/2006 07:23:52|schedd|desags1|P|PROF(1): gdi            : wc =
47562.790s, utime =    134.190s, stime =     17.660s, utilization =   0%
08/30/2006 07:23:52|schedd|desags1|P|PROF(1): ht-resize      : wc =
0.000s, utime =      0.000s, stime =      0.000s, utilization =   0%
08/30/2006 07:23:52|schedd|desags1|P|PROF(1): scheduler      : wc =
2762.850s, utime =   2496.500s, stime =     41.390s, utilization =  92%
08/30/2006 07:23:52|schedd|desags1|P|PROF(1): pending ticket : wc =
536.110s, utime =    329.160s, stime =     30.320s, utilization =  67%
08/30/2006 07:23:52|schedd|desags1|P|PROF(1): job sorting    : wc =
13.260s, utime =     12.140s, stime =      0.100s, utilization =  92%
08/30/2006 07:23:52|schedd|desags1|P|PROF(1): job dispatching: wc =
6257.720s, utime =   4990.250s, stime =    129.610s, utilization =  82%
08/30/2006 07:23:53|schedd|desags1|P|PROF(1): send orders    : wc =
165.100s, utime =    160.230s, stime =      0.190s, utilization =  97%
08/30/2006 07:23:53|schedd|desags1|P|PROF(1): scheduler event: wc =
0.200s, utime =      0.150s, stime =      0.120s, utilization = 135%
08/30/2006 07:23:53|schedd|desags1|P|PROF(1): copy lists     : wc =
6480.630s, utime =   6292.570s, stime =     15.270s, utilization =  97%
08/30/2006 07:23:53|schedd|desags1|P|PROF(1): total          : wc =
65845.830s, utime =  16099.870s, stime =    278.610s, utilization =  25%
08/30/2006 07:23:53|schedd|desags1|P|PROF: sge_mirror processed 3179
events in 0.100 s
08/30/2006 07:23:54|schedd|desags1|P|PROF: static urgency took 0.000 s
08/30/2006 07:23:55|schedd|desags1|P|PROF: job ticket calculation: init:
0.020 s, pass 0: 0.040 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s
08/30/2006 07:23:55|schedd|desags1|P|PROF: job ticket calculation: init:
0.010 s, pass 0: 0.010 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s
08/30/2006 07:23:55|schedd|desags1|P|PROF: normalizing job tickets took
0.000 s
08/30/2006 07:23:55|schedd|desags1|P|PROF: create active job orders:
0.170 s
08/30/2006 07:23:55|schedd|desags1|P|PROF: job-order calculation took
0.520 s
08/30/2006 07:23:55|schedd|desags1|P|PROF: job sorting took 0.000 s
08/30/2006 07:23:58|schedd|desags1|P|PROF: job dispatching took 2.670 s
(24 fast, 0 comp, 0 pe, 0 res)
08/30/2006 07:23:58|schedd|desags1|P|PROF: create pending job orders:
0.010 s
08/30/2006 07:23:58|schedd|desags1|P|PROF: scheduled in 3.890 (u 1.260 +
s 0.060 = 1.320): 22 sequential, 0 parallel, 807 orders, 639 H, 2292 Q,
3000 QA, 2 J(qw), 313 J(r), 1 J(s), 0 J(h), 0 J(e), 26 J(x), 335 J(all),
75 C, 3 ACL, 3 PE, 96
U, 1 D, 4 PRJ, 1 ST, 0 CKPT, 0 RU, 708 gMes, 4626 jMes, 782/1 pre-send,
0/0/0 pe-alg

08/30/2006 07:24:13|schedd|desags1|P|PROF: send orders and cleanup took:
14.790
(u 0.090,s 0.010) s
08/30/2006 07:24:13|schedd|desags1|P|PROF: schedd run took: 19.860 s
(init: 0.000 s, copy: 0.630 s, run:18.850, free: 0.380 s, jobs: 335,
categories: 15/65)

Other profiling information at around that time is also available if
necssary (I've turned all profiling up to figure out the problem).

Thanks,

Felix Kogan



-----Original Message-----
From: Thomas Neumann [mailto:Thomas.Neumann at exasol.com] 
Sent: Thursday, August 17, 2006 3:35 AM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Failed receiving gdi request


Hello !

Yesterday the problem with the qmaster occured even twice (~13:00 and 
~17:25). I have collected the data you asked for:

1) Data from qmaster host:

top - 17:25:12 up 122 days, 23:23,  1 user,  load average: 4.35, 5.72,
4.96
Tasks: 152 total,   1 running, 151 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.1% us,  2.3% sy,  0.2% ni, 93.3% id,  0.7% wa,  0.3% hi,
1.0% si
Mem:   3116188k total,  1423984k used,  1692204k free,   369328k buffers
Swap:  7823644k total,        4k used,  7823640k free,   460340k cached
[...]
There is some NFS activity (32 nfsd processes)



2) Cluster-Data:
70 machines in configuration.
Average time for ssh login on machines: 3 sec
Load:
    Short time:  min 0.00, max 6.5, average: 0.806
    Long time: min 0.00, max 7.00, average: 0.911

Processes:
    min: 53, max: 201, average: 86



3) qstat when first alarm was given (100 messages in read buffer, time 
17:26):

job-ID  prior   name       user         state submit/start at     
queue                          slots ja-task-ID
------------------------------------------------------------------------
-----------------------------------------
  66912 0.52067 startstop. sr           r     08/16/2006 16:55:39 
64Bit at cmo52.gelb.exasol.com       20
  66913 0.55624 startstop. sr           r     08/16/2006 16:55:39 
64Bit at cmo52.gelb.exasol.com       64
  66914 0.55624 startstop. sr           r     08/16/2006 16:55:39 
64Bit at cmo52.gelb.exasol.com       64
  66909 0.51744 startstop. sr           r     08/16/2006 16:55:24 
64Bit at cmo55.gelb.exasol.com       16
  66910 0.51744 startstop. sr           r     08/16/2006 16:55:24 
64Bit at cmo55.gelb.exasol.com       16
  66911 0.52067 startstop. sr           r     08/16/2006 16:55:24 
64Bit at cmo55.gelb.exasol.com       20
  66858 0.50500 INTERACTIV ct           r     08/16/2006 14:48:08 
64Bit at cmo56.gelb.exasol.com       16
  66713 0.52391 INTERACTIV tn           r     08/16/2006 13:01:38 
64Bit at cmw1.gelb.exasol.com        24
  66929 0.60500 package.13 bm           r     08/16/2006 17:22:25 
64Bit at cmw10.gelb.exasol.com       32
  66926 0.60500 package.12 bm           r     08/16/2006 17:05:54 
64Bit at cmw8.gelb.exasol.com        32
  66920 0.52067 startstop. sr           r     08/16/2006 17:02:39 
cmps at cmp10.gelb.exasol.com        20
  66922 0.55624 startstop. sr           r     08/16/2006 17:02:39 
cmps at cmp7.gelb.exasol.com         64
  66918 0.51744 startstop. sr           r     08/16/2006 17:02:24 
cns at cn15.gelb.exasol.com          16
  66921 0.55624 startstop. sr           r     08/16/2006 17:02:39 
cns at cn27.gelb.exasol.com          64
  66906 0.60450 package.12 bm           qw    08/16/2006 
16:46:02                                   16
  66928 0.60450 package.12 se            qw    08/16/2006 
17:17:26                                   16



4) Extract from jobcounting:

time   jobs   messages in read buffer
[... everthing looks normal - last start after hang up on 13:01 ]
17:01:00 14 0
17:02:00 14 0
17:03:01 22 0
17:04:01 22 0
17:05:02 26 0
17:06:04 25 0
17:07:05 23 0
17:08:06 23 0
17:09:07 22 8
17:10:08 21 0
17:11:08 21 0
17:12:09 21 0
17:13:10 21 0
17:14:10 21 0
17:15:11 20 0
17:16:12 20 0
17:17:13 19 0
17:18:13 18 12
17:19:14 19 10
17:20:15 19 43
17:21:15 19 64
17:22:16 17 16
17:23:17 17 15
17:24:17 16 6
17:25:18 16 122
17:26:18 16 246
17:27:19 16 413
17:28:19 16 606
17:29:20 16 851
17:30:21 15 1169
17:31:21 15 1303
17:32:22 13 1471
17:33:22 13 1566
17:34:23 13 1885
17:35:23 13 2262
17:36:24 13 2673
17:37:24 13 3066
17:38:25 13 3529
17:39:25 13 4034
17:40:25 12 4610
[...]
17:49:30 12 11250
[...]
17:54:32 12 14941
[... here we stopped the qmaster and restarted the whole system ]




5) Output of qping:

08/16/2006 17:26:42:
SIRM version:             0.1
SIRM message id:          1
start time:               08/16/2006 12:56:50 (1155725810)
run time [s]:             16192
messages in read buffer:  452
messages in write buffer: 0
nr. of connected clients: 81
status:                   0
info:                     TET: R (2.94) | EDT: R (0.00) | SIGT: R 
(16192.12) | MT(1): R (0.00) | MT(2): R (0.05) | OK
Monitor:
08/16/2006 17:26:30 | TET: runs: 0.40r/s (pending: 11.00 executed: 
0.40/s) out: 0.00m/s APT: 0.0129s/m idle: 99.48% wait: 0.44% time:
20.00s
08/16/2006 17:26:30 | EDT: runs: 39.35r/s (clients: 1.00 mod: 0.05/s 
ack: 0.05/s blocked: 0.00 busy: 0.59 | events: 39.10/s added: 39.10/s 
skipt: 0.00/s) out: 0.05m/s APT: 0.0003s/m idle: 98.82% wait: 0.08% 
time: 20.00s
08/16/2006 12:56:50 | SIGT: no monitoring data available
08/16/2006 17:26:33 | MT(1): runs: 16.14r/s (execd 
(l:1.05,j:15.84,c:1.05,p:1.05,a:0.00)/s GDI 
(a:0.10,g:1.64,m:0.05,d:0.00,c:0.00,t:0.00,p:0.00)/s event-acks: 0.05/s)

out: 15.74m/s APT: 0.0619s/m idle: 0.07% wait: 46.75% time: 20.07s
08/16/2006 17:26:33 | MT(2): runs: 15.98r/s (execd 
(l:1.15,j:15.68,c:1.15,p:1.15,a:0.00)/s GDI 
(a:0.00,g:3.29,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s event-acks: 0.00/s)

out: 15.48m/s APT: 0.0625s/m idle: 0.08% wait: 50.90% time: 20.09s

[And ten minutes later]
08/16/2006 17:36:38:
SIRM version:             0.1
SIRM message id:          1
start time:               08/16/2006 12:56:50 (1155725810)
run time [s]:             16788
messages in read buffer:  3218
messages in write buffer: 0
nr. of connected clients: 76
status:                   0
info:                     TET: R (0.67) | EDT: R (0.01) | SIGT: R 
(16787.86) | MT(1): R (0.04) | MT(2): R (0.01) | OK
Monitor:
08/16/2006 17:36:30 | TET: runs: 0.40r/s (pending: 11.00 executed: 
0.40/s) out: 0.00m/s APT: 0.0116s/m idle: 99.53% wait: 0.40% time:
20.00s
08/16/2006 17:36:30 | EDT: runs: 27.95r/s (clients: 1.00 mod: 0.00/s 
ack: 0.00/s blocked: 0.00 busy: 1.00 | events: 27.15/s added: 27.15/s 
skipt: 0.00/s) out: 0.00m/s APT: 0.0002s/m idle: 99.49% wait: 0.05% 
time: 20.00s
08/16/2006 12:56:50 | SIGT: no monitoring data available
08/16/2006 17:36:33 | MT(1): runs: 11.52r/s (execd 
(l:0.70,j:11.52,c:0.70,p:0.70,a:0.00)/s GDI 
(a:0.00,g:0.00,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s event-acks: 0.00/s)

out: 11.12m/s APT: 0.0868s/m idle: 0.05% wait: 47.95% time: 20.05s
08/16/2006 17:36:33 | MT(2): runs: 11.28r/s (execd 
(l:0.90,j:11.28,c:0.90,p:0.90,a:0.00)/s GDI 
(a:0.00,g:0.00,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s event-acks: 0.00/s)

out: 10.83m/s APT: 0.0886s/m idle: 0.05% wait: 49.47% time: 19.94s


Hopefully this helps to analyse the problem. I'm looking forward to your

answer, thanks

        Thomas

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list