[GE users] schedd profiling and sge_mirror

John Tseng jtseng at montalvosystems.com
Sat May 26 15:36:26 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi folks, we are running into an issue where our sched/qmaster rockets 
in memory size: 7 and 5 GB respectievly in less than 30 minutes.

Below are the two consective scheduling cycle (the :06 and :14 minutes 
after the hour), labeled "GOOD" and "BAD".
It appears that the "BAD" scheduling cycle had to "sge_mirror" 221304 
events which caused or is due to a memory leak.

We are running 6.0u10 on RHEL4 on a opteron machine with an average of 
600 running jobs and about 200 task array jobs (10,000 total "runs")

Reviewing the source code seems to indicate that the scheduler "mirrors" 
the state of the system from the qmaster.  Is it possible
to find out why the number of events jump from 25-50, to over 
200/400,000 in  a short period of time?

Is it possible that the issue is on the qmaster side?  Or is this some 
unusualy bug in sge_mirror?

This seems to happen every 3 days and typically at 11 PM localtime.  We 
are trying to understand what is trigging this.
Any help would greatly be appreciated. 

Thanks,
john


"GOOD"
05/24/2007 23:06:48|schedd|sge-master|P|PROF: sge_mirror processed 23 events in 0.060 s
05/24/2007 23:06:48|schedd|sge-master|P|PROF: scheduled in 0.680 (u 0.680 + s 0.000 = 0.680): 1 sequential, 0 parallel, 690 orders, 281 H, 931 Q, 2366 QA, 50 J(qw), 42 J(r), 13 J(s), 0 J(h), 0 J(e), 1 J(x), 147 J(all), 246 C, 5 ACL, 3 PE, 42 U, 1 D, 4 PRJ, 0 ST, 2 CKPT, 0 RU, 1 gMes, 0 jMes, 594/2 pre-send, 0/0/0 pe-alg

"BAD" (immediately following GOOD)
05/24/2007 23:13:59|schedd|sge-master|P|PROF: sge_mirror processed 221304 events in 294.850 s
05/24/2007 23:14:00|schedd|sge-master|P|PROF: scheduled in 0.680 (u 0.670 + s 0.000 = 0.670): 16 sequential, 0 parallel, 705 orders, 281 H, 911 Q, 2366 QA, 50 J(qw), 42 J(r), 13 J(s), 0 J(h), 0 J(e), 0 J(x), 147 J(all), 246 C, 5 ACL, 3 PE, 42 U, 1 D, 4 PRJ, 0 ST, 2 CKPT, 0 RU, 1 gMes, 0 jMes, 609/2 pre-send, 0/0/0 pe-alg




Details data:
"GOOD" scheduling cycle

05/24/2007 23:06:47|schedd|sge-master|P|PROF: sge_mirror processed 40 
events in 0.010 s

05/24/2007 23:06:48|schedd|sge-master|P|PROF: static urgency took 0.000 s
05/24/2007 23:06:48|schedd|sge-master|P|PROF: job ticket calculation: 
init: 0.000 s, pass 0: 0.020 s, pass 1: 0.000, pass2
: 0.000, calc: 0.010 s
05/24/2007 23:06:48|schedd|sge-master|P|PROF: job ticket calculation: 
init: 0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2
: 0.000, calc: 0.000 s
05/24/2007 23:06:48|schedd|sge-master|P|PROF: normalizing job tickets 
took 0.000 s
05/24/2007 23:06:48|schedd|sge-master|P|PROF: create active job orders: 
0.010 s
05/24/2007 23:06:48|schedd|sge-master|P|PROF: job-order calculation took 
0.040 s
05/24/2007 23:06:48|schedd|sge-master|P|PROF: job sorting took 0.010 s
05/24/2007 23:06:48|schedd|sge-master|P|PROF: job dispatching took 0.110 
s (0 fast, 51 comp, 0 pe, 0 res)
05/24/2007 23:06:48|schedd|sge-master|P|PROF: create pending job orders: 
0.000 s
05/24/2007 23:06:48|schedd|sge-master|P|PROF: scheduled in 0.680 (u 
0.680 + s 0.000 = 0.680): 1 sequential, 0 parallel, 69
0 orders, 281 H, 931 Q, 2366 QA, 50 J(qw), 42 J(r), 13 J(s), 0 J(h), 0 
J(e), 1 J(x), 147 J(all), 246 C, 5 ACL, 3 PE, 42 U,
 1 D, 4 PRJ, 0 ST, 2 CKPT, 0 RU, 1 gMes, 0 jMes, 594/2 pre-send, 0/0/0 
pe-alg

05/24/2007 23:06:48|schedd|sge-master|P|PROF: send orders and cleanup 
took: 0.110 (u 0.060,s 0.000) s
05/24/2007 23:06:48|schedd|sge-master|P|PROF: schedd run took: 1.040 s 
(init: 0.000 s, copy: 0.190 s, run:0.790, free: 0.0
60 s, jobs: 147, categories: 12/25)
05/24/2007 23:06:48|schedd|sge-master|P|PROF: sge_mirror processed 23 
events in 0.060 s
05/24/2007 23:06:49|schedd|sge-master|P|PROF: static urgency took 0.000 s
05/24/2007 23:06:49|schedd|sge-master|P|PROF: job ticket calculation: 
init: 0.010 s, pass 0: 0.020 s, pass 1: 0.000, pass2
: 0.000, calc: 0.010 s
05/24/2007 23:06:49|schedd|sge-master|P|PROF: job ticket calculation: 
init: 0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2
: 0.000, calc: 0.000 s
05/24/2007 23:06:49|schedd|sge-master|P|PROF: normalizing job tickets 
took 0.000 s
05/24/2007 23:06:49|schedd|sge-master|P|PROF: create active job orders: 
0.010 s
05/24/2007 23:06:49|schedd|sge-master|P|PROF: job-order calculation took 
0.160 s
05/24/2007 23:06:49|schedd|sge-master|P|PROF: job sorting took 0.000 s
05/24/2007 23:06:49|schedd|sge-master|P|PROF: job dispatching took 0.000 
s (0 fast, 50 comp, 0 pe, 0 res)
05/24/2007 23:06:49|schedd|sge-master|P|PROF: create pending job orders: 
0.000 s
05/24/2007 23:06:49|schedd|sge-master|P|PROF: scheduled in 0.710 (u 
0.550 + s 0.000 = 0.550): 0 sequential, 0 parallel, 68
9 orders, 281 H, 931 Q, 2366 QA, 50 J(qw), 42 J(r), 13 J(s), 0 J(h), 0 
J(e), 0 J(x), 147 J(all), 246 C, 5 ACL, 3 PE, 42 U,
 1 D, 4 PRJ, 0 ST, 2 CKPT, 0 RU, 1 gMes, 0 jMes, 593/1 pre-send, 0/0/0 
pe-alg
05/24/2007 23:06:49|schedd|sge-master|P|PROF: send orders and cleanup 
took: 0.090 (u 0.020,s 0.000) s
05/24/2007 23:06:49|schedd|sge-master|P|PROF: schedd run took: 1.040 s 
(init: 0.000 s, copy: 0.180 s, run:0.800, free: 0.0
60 s, jobs: 147, categories: 12/25)



"BAD" scheduling cycle. (again immediately following GOOD)
05/24/2007 23:13:59|schedd|sge-master|P|PROF: sge_mirror processed 
221304 events in 294.850 s
05/24/2007 23:14:00|schedd|sge-master|P|PROF: static urgency took 0.000 s
05/24/2007 23:14:00|schedd|sge-master|P|PROF: job ticket calculation: 
init: 0.010 s, pass 0: 0.020 s, pass 1: 0.000, pass2
: 0.000, calc: 0.000 s
05/24/2007 23:14:00|schedd|sge-master|P|PROF: job ticket calculation: 
init: 0.010 s, pass 0: 0.000 s, pass 1: 0.000, pass2
: 0.000, calc: 0.000 s
05/24/2007 23:14:00|schedd|sge-master|P|PROF: normalizing job tickets 
took 0.000 s
05/24/2007 23:14:00|schedd|sge-master|P|PROF: create active job orders: 
0.000 s
05/24/2007 23:14:00|schedd|sge-master|P|PROF: job-order calculation took 
0.050 s
05/24/2007 23:14:00|schedd|sge-master|P|PROF: job sorting took 0.000 s
05/24/2007 23:14:00|schedd|sge-master|P|PROF: job dispatching took 0.200 
s (0 fast, 66 comp, 0 pe, 0 res)
05/24/2007 23:14:00|schedd|sge-master|P|PROF: create pending job orders: 
0.000 s
05/24/2007 23:14:00|schedd|sge-master|P|PROF: scheduled in 0.680 (u 
0.670 + s 0.000 = 0.670): 16 sequential, 0 parallel, 7
05 orders, 281 H, 911 Q, 2366 QA, 50 J(qw), 42 J(r), 13 J(s), 0 J(h), 0 
J(e), 0 J(x), 147 J(all), 246 C, 5 ACL, 3 PE, 42 U
, 1 D, 4 PRJ, 0 ST, 2 CKPT, 0 RU, 1 gMes, 0 jMes, 609/2 pre-send, 0/0/0 
pe-alg

05/24/2007 23:14:00|schedd|sge-master|P|PROF: send orders and cleanup 
took: 0.050 (u 0.010,s 0.000) s
05/24/2007 23:14:00|schedd|sge-master|P|PROF: schedd run took: 1.350 s 
(init: 0.000 s, copy: 0.570 s, run:0.730, free: 0.0
50 s, jobs: 147, categories: 12/25)
05/24/2007 23:14:00|schedd|sge-master|P|PROF(-1789437376): profiling 
summary:

05/24/2007 23:14:00|schedd|sge-master|P|PROF(-1789437376): 
other          : wc =     16.880s, utime =      2.260s, stime =
      2.370s, utilization =  27%
05/24/2007 23:14:00|schedd|sge-master|P|PROF(-1789437376): 
packing        : wc =   3428.210s, utime =   3419.270s, stime =
      4.910s, utilization = 100%
05/24/2007 23:14:00|schedd|sge-master|P|PROF(-1789437376): 
eventclient    : wc =    251.660s, utime =    231.660s, stime =
     18.780s, utilization = 100%
05/24/2007 23:14:00|schedd|sge-master|P|PROF(-1789437376): 
mirror         : wc =   3597.870s, utime =   3589.920s, stime =
      3.200s, utilization = 100%
05/24/2007 23:14:00|schedd|sge-master|P|PROF(-1789437376): 
gdi            : wc =  29128.360s, utime =    246.570s, stime =
    255.770s, utilization =   2%
05/24/2007 23:14:00|schedd|sge-master|P|PROF(-1789437376): 
ht-resize      : wc =      0.000s, utime =      0.000s, stime =
      0.000s, utilization =   0%
05/24/2007 23:14:00|schedd|sge-master|P|PROF(-1789437376): 
scheduler      : wc =  13720.790s, utime =  13639.370s, stime =
     64.430s, utilization = 100%
05/24/2007 23:14:00|schedd|sge-master|P|PROF(-1789437376): pending 
ticket : wc =    452.940s, utime =    428.860s, stime =
      7.460s, utilization =  96%
05/24/2007 23:14:00|schedd|sge-master|P|PROF(-1789437376): job 
dispatching: wc =   1663.660s, utime =   1661.870s, stime =
      1.760s, utilization = 100%
05/24/2007 23:14:00|schedd|sge-master|P|PROF(-1789437376): send 
orders    : wc =    200.850s, utime =    200.480s, stime =
      0.630s, utilization = 100%
05/24/2007 23:14:00|schedd|sge-master|P|PROF(-1789437376): scheduler 
event: wc =      0.150s, utime =      0.100s, stime =
      0.110s, utilization = 140%
05/24/2007 23:14:00|schedd|sge-master|P|PROF(-1789437376): copy 
lists     : wc =   6673.620s, utime =   6700.680s, stime =
      6.590s, utilization = 101%
05/24/2007 23:14:00|schedd|sge-master|P|PROF(-1789437376): 
total          : wc =  60624.960s, utime =  31599.940s, stime =
    374.700s, utilization =  53%


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list