[GE users] Failed receiving gdi request

Thomas Neumann Thomas.Neumann at exasol.com
Thu Aug 17 08:34:50 BST 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hello !

Yesterday the problem with the qmaster occured even twice (~13:00 and 
~17:25). I have collected the data you asked for:

1) Data from qmaster host:

top - 17:25:12 up 122 days, 23:23,  1 user,  load average: 4.35, 5.72, 4.96
Tasks: 152 total,   1 running, 151 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.1% us,  2.3% sy,  0.2% ni, 93.3% id,  0.7% wa,  0.3% hi,  1.0% si
Mem:   3116188k total,  1423984k used,  1692204k free,   369328k buffers
Swap:  7823644k total,        4k used,  7823640k free,   460340k cached
[...]
There is some NFS activity (32 nfsd processes)



2) Cluster-Data:
70 machines in configuration.
Average time for ssh login on machines: 3 sec
Load:
    Short time:  min 0.00, max 6.5, average: 0.806
    Long time: min 0.00, max 7.00, average: 0.911

Processes:
    min: 53, max: 201, average: 86



3) qstat when first alarm was given (100 messages in read buffer, time 
17:26):

job-ID  prior   name       user         state submit/start at     
queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
  66912 0.52067 startstop. sr           r     08/16/2006 16:55:39 
64Bit at cmo52.gelb.exasol.com       20
  66913 0.55624 startstop. sr           r     08/16/2006 16:55:39 
64Bit at cmo52.gelb.exasol.com       64
  66914 0.55624 startstop. sr           r     08/16/2006 16:55:39 
64Bit at cmo52.gelb.exasol.com       64
  66909 0.51744 startstop. sr           r     08/16/2006 16:55:24 
64Bit at cmo55.gelb.exasol.com       16
  66910 0.51744 startstop. sr           r     08/16/2006 16:55:24 
64Bit at cmo55.gelb.exasol.com       16
  66911 0.52067 startstop. sr           r     08/16/2006 16:55:24 
64Bit at cmo55.gelb.exasol.com       20
  66858 0.50500 INTERACTIV ct           r     08/16/2006 14:48:08 
64Bit at cmo56.gelb.exasol.com       16
  66713 0.52391 INTERACTIV tn           r     08/16/2006 13:01:38 
64Bit at cmw1.gelb.exasol.com        24
  66929 0.60500 package.13 bm           r     08/16/2006 17:22:25 
64Bit at cmw10.gelb.exasol.com       32
  66926 0.60500 package.12 bm           r     08/16/2006 17:05:54 
64Bit at cmw8.gelb.exasol.com        32
  66920 0.52067 startstop. sr           r     08/16/2006 17:02:39 
cmps at cmp10.gelb.exasol.com        20
  66922 0.55624 startstop. sr           r     08/16/2006 17:02:39 
cmps at cmp7.gelb.exasol.com         64
  66918 0.51744 startstop. sr           r     08/16/2006 17:02:24 
cns at cn15.gelb.exasol.com          16
  66921 0.55624 startstop. sr           r     08/16/2006 17:02:39 
cns at cn27.gelb.exasol.com          64
  66906 0.60450 package.12 bm           qw    08/16/2006 
16:46:02                                   16
  66928 0.60450 package.12 se            qw    08/16/2006 
17:17:26                                   16



4) Extract from jobcounting:

time   jobs   messages in read buffer
[... everthing looks normal - last start after hang up on 13:01 ]
17:01:00 14 0
17:02:00 14 0
17:03:01 22 0
17:04:01 22 0
17:05:02 26 0
17:06:04 25 0
17:07:05 23 0
17:08:06 23 0
17:09:07 22 8
17:10:08 21 0
17:11:08 21 0
17:12:09 21 0
17:13:10 21 0
17:14:10 21 0
17:15:11 20 0
17:16:12 20 0
17:17:13 19 0
17:18:13 18 12
17:19:14 19 10
17:20:15 19 43
17:21:15 19 64
17:22:16 17 16
17:23:17 17 15
17:24:17 16 6
17:25:18 16 122
17:26:18 16 246
17:27:19 16 413
17:28:19 16 606
17:29:20 16 851
17:30:21 15 1169
17:31:21 15 1303
17:32:22 13 1471
17:33:22 13 1566
17:34:23 13 1885
17:35:23 13 2262
17:36:24 13 2673
17:37:24 13 3066
17:38:25 13 3529
17:39:25 13 4034
17:40:25 12 4610
[...]
17:49:30 12 11250
[...]
17:54:32 12 14941
[... here we stopped the qmaster and restarted the whole system ]




5) Output of qping:

08/16/2006 17:26:42:
SIRM version:             0.1
SIRM message id:          1
start time:               08/16/2006 12:56:50 (1155725810)
run time [s]:             16192
messages in read buffer:  452
messages in write buffer: 0
nr. of connected clients: 81
status:                   0
info:                     TET: R (2.94) | EDT: R (0.00) | SIGT: R 
(16192.12) | MT(1): R (0.00) | MT(2): R (0.05) | OK
Monitor:
08/16/2006 17:26:30 | TET: runs: 0.40r/s (pending: 11.00 executed: 
0.40/s) out: 0.00m/s APT: 0.0129s/m idle: 99.48% wait: 0.44% time: 20.00s
08/16/2006 17:26:30 | EDT: runs: 39.35r/s (clients: 1.00 mod: 0.05/s 
ack: 0.05/s blocked: 0.00 busy: 0.59 | events: 39.10/s added: 39.10/s 
skipt: 0.00/s) out: 0.05m/s APT: 0.0003s/m idle: 98.82% wait: 0.08% 
time: 20.00s
08/16/2006 12:56:50 | SIGT: no monitoring data available
08/16/2006 17:26:33 | MT(1): runs: 16.14r/s (execd 
(l:1.05,j:15.84,c:1.05,p:1.05,a:0.00)/s GDI 
(a:0.10,g:1.64,m:0.05,d:0.00,c:0.00,t:0.00,p:0.00)/s event-acks: 0.05/s) 
out: 15.74m/s APT: 0.0619s/m idle: 0.07% wait: 46.75% time: 20.07s
08/16/2006 17:26:33 | MT(2): runs: 15.98r/s (execd 
(l:1.15,j:15.68,c:1.15,p:1.15,a:0.00)/s GDI 
(a:0.00,g:3.29,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s event-acks: 0.00/s) 
out: 15.48m/s APT: 0.0625s/m idle: 0.08% wait: 50.90% time: 20.09s

[And ten minutes later]
08/16/2006 17:36:38:
SIRM version:             0.1
SIRM message id:          1
start time:               08/16/2006 12:56:50 (1155725810)
run time [s]:             16788
messages in read buffer:  3218
messages in write buffer: 0
nr. of connected clients: 76
status:                   0
info:                     TET: R (0.67) | EDT: R (0.01) | SIGT: R 
(16787.86) | MT(1): R (0.04) | MT(2): R (0.01) | OK
Monitor:
08/16/2006 17:36:30 | TET: runs: 0.40r/s (pending: 11.00 executed: 
0.40/s) out: 0.00m/s APT: 0.0116s/m idle: 99.53% wait: 0.40% time: 20.00s
08/16/2006 17:36:30 | EDT: runs: 27.95r/s (clients: 1.00 mod: 0.00/s 
ack: 0.00/s blocked: 0.00 busy: 1.00 | events: 27.15/s added: 27.15/s 
skipt: 0.00/s) out: 0.00m/s APT: 0.0002s/m idle: 99.49% wait: 0.05% 
time: 20.00s
08/16/2006 12:56:50 | SIGT: no monitoring data available
08/16/2006 17:36:33 | MT(1): runs: 11.52r/s (execd 
(l:0.70,j:11.52,c:0.70,p:0.70,a:0.00)/s GDI 
(a:0.00,g:0.00,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s event-acks: 0.00/s) 
out: 11.12m/s APT: 0.0868s/m idle: 0.05% wait: 47.95% time: 20.05s
08/16/2006 17:36:33 | MT(2): runs: 11.28r/s (execd 
(l:0.90,j:11.28,c:0.90,p:0.90,a:0.00)/s GDI 
(a:0.00,g:0.00,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s event-acks: 0.00/s) 
out: 10.83m/s APT: 0.0886s/m idle: 0.05% wait: 49.47% time: 19.94s


Hopefully this helps to analyse the problem. I'm looking forward to your 
answer, thanks

        Thomas

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list