Opened 12 years ago

Closed 4 years ago

#435 closed defect (fixed)

IZ2298: array job accounting or scheduling problem

Reported by: pascalucsf Owned by: Mark Dixon <m.c.dixon@…>
Priority: normal Milestone:
Component: sge Version: 6.1
Severity: major Keywords: scheduling
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2298]

        Issue #:      2298             Platform:     All      Reporter: pascalucsf (pascalucsf)
       Component:     gridengine          OS:        All
     Subcomponent:    scheduling       Version:      6.1         CC:    None defined
        Status:       REOPENED         Priority:     P3
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    andreas (andreas)
      QA Contact:     andreas
          URL:
       * Summary:     array job accounting or scheduling problem
   Status whiteboard:
      Attachments:

     Issue 2298 blocks:
   Votes for issue 2298:  35


   Opened: Tue Jun 19 14:26:00 -0700 2007 
------------------------


Array job usage seems to be accounted for incorrectly.

Example:

100 cpu's on cluster, OS fairshare policy, evenly balanced share tree.
User a submits 1000 jobs.
User b submits 1000 jobs.
User c submits 1 array job, with 1000 members.

Results look something like:
48 of user a's jobs running at any time
48 of user b's jobs running at any time
4 of user c's array job members run at any time.

If the queue is empty, except for user c's jobs, they will all begin executing.

Looking at fairshare usage (via qmon) shows that user c's "Actual Resource
Share" (policy configuration -> share tree policy) is very high, (like 50-80%).

I can provide detailed configuration on request.

   ------- Additional comments from pascalucsf Thu Jun 21 14:57:29 -0700 2007 -------
Notes from testing:

4 nodes totaling to 7 cpus on all.q, each node has 4 slots in queue config.

scheduler conf:

policy_hierarchy OS
weight_tickets_share 100000

share tree:

id=0
name=template
type=0
shares=0
childnodes=1
id=1
name=default
type=0
shares=100
childnodes=NONE

queue is disabled, and empty.
1000 individual jobs are queued as user pascal
1 array job of 1000 subjobs is queued as user ben

usage is cleared (qconf -clearusage)

at the starting line:

Queued per user:
   1000 pascal qw
   1000 ben qw

bang: qmod -e all.q

1 minute in:

Running per user:
      8 pascal r
      8 ben r
Queued per user:
    992 pascal qw
    992 ben qw

(jobs are cpuburners, 5 minutes each)

A while later:


Running per user:
     10 pascal r
      1 ben r
Queued per user:
    991 ben qw
    973 pascal qw


And it continues this way.

   ------- Additional comments from andreas Thu Jun 28 11:49:32 -0700 2007 -------
It is very, very likely that this is yet another symptom of #2222.
The fix for #2222 will be part of 6.1u1 once it is available.
In 6.0u11 this #2222 is already fixed.

*** This issue has been marked as a duplicate of 2222 ***

   ------- Additional comments from andreas Fri Jun 29 02:04:12 -0700 2007 -------
Revert: Can't be duplicate since #2222 was already fixed with 6.1.

   ------- Additional comments from andreas Fri Jun 29 02:22:39 -0700 2007 -------
What are you using for

   weight_urgency
   weight_ticket
   weight_waiting_time

in sched_conf(5). If your waiting time weight is non-zero this could
cause the phenomenon you observe. Reason is that waiting time contributes
to job urgency and urgency has higher weight than the ticket policy.

   ------- Additional comments from pascalucsf Fri Jun 29 08:25:58 -0700 2007 -------
These are all of the weight_* values for configurations where I have seen this
problem.

----------------------------------------------------
weight_ticket                     1.000000
weight_waiting_time               0.000000
weight_deadline                   0.000000
weight_urgency                    0.000000
weight_priority                   0.000000
----------------------------------------------------
weight_ticket                     0.900000
weight_waiting_time               0.000000
weight_deadline                   0.000000
weight_urgency                    0.000000
weight_priority                   0.100000
----------------------------------------------------
weight_ticket                     0.010000
weight_waiting_time               0.000000
weight_deadline                   3600000.000000
weight_urgency                    0.100000
weight_priority                   1.000000
----------------------------------------------------

   ------- Additional comments from andreas Mon Jul 2 08:28:45 -0700 2007 -------
After running

---> 100 times: qsub -P A -b y /bin/sleep 5
---> 100 times: qsub -P B -b y /bin/sleep 5
---> 1 times: qsub -t 1-100 -P C -b y /bin/sleep 5

with SHARETREE_RESERVED_USAGE=true be set in global cluster
configuration sge_conf(5) I get a combined resource usage
which sometimes is really surprisingly unbalanced. I played
around with different arrangements to get a clue on this
phenomenon:

Project| Comb. Usage  | Sum. Acct. Usage
-------------------------------------------
A      | 1136.78      | 1085
B      | 1161.77      | 1100
C      | 1292.73      | 1189        (array)
-------------------------------------------
A      | 1294.78      | 1222        (array)
B      | 1159.82      | 1080
C      | 1154.82      | 1097
-------------------------------------------
A      | 1052.86      |  997
B      | 1047.86      |  991
C      | 1224.82      | 1137        (array)
-------------------------------------------
A      |  782.36      |  655        (array)
B      |  646.80      |  590
C      |  645.80      |  586
-------------------------------------------
A      |  635.88      |  568
B      |  634.88      |  570
C      |  647.88      |  569        (array)
-------------------------------------------
A      |  700.77      |  640        (array)
B      |  697.77      |  633
C      |  670.77      |  605
-------------------------------------------
A      |  656.83      |  585        (array)
B      |  629.84      |  570
C      |  640.84      |  581
-------------------------------------------

this shows the accounted usage of array jobs is constantly higher
than of of sequential jobs! I investigated it to the
point that I can say: For some mysterious reason array tasks on
average take longer time from fork() until /bin/sleep actually
starts. Interestingly use of "-shell no" submit option finally
gave me a very much balanced distribution of per project accounting,
but I still can not explain why array jobs should be affected from
the overhead than sequential jobs ... :-o

With regards to the share tree behaviour I recommend use of a by
far lower compensation factor than the default. With the compensation
factor one can control how much projects with higher usage shall
be penalized. When I used a 1 or 2 as compensation factor I got
quite good results despite of the unbalanced accounting.

   ------- Additional comments from jlb Wed Mar 26 10:49:30 -0700 2008 -------
Testing on my production cluster (>300 nodes), I consistently see utilization
numbers (as reported by sge_share_mon) ~22% higher for array jobs than for the
equivalent number of individually submitted jobs.  This is a rather significant
difference, in my opinion.  Using "-shell no" has absolutely no effect on this
over-accounting in my testing.

   ------- Additional comments from pascalucsf Wed Mar 26 16:22:20 -0700 2008 -------
Another take on this:

From a running, saturated queue on a cluster of 256 cpus over 128 machines, a
sample of 3 users, 2 running 1 array job each and 1 running 21 single jobs is
taken 60 seconds after a qconf -clearusage. In the following output, user1 is
the user with the single jobs.

Side notes:
Each of these jobs runs only on a single processor.
Each of these jobs is CPU bound.
Other jobs by other users are running on the cluster.
Other jobs by THESE users are NOT running on the cluster.

Wed Mar 26 15:57:34 PDT 2008
user1 jobs running:
21
user2 jobs running:
54
user3 jobs running:
53
Wed Mar 26 15:58:34 PDT 2008
usage_time      user_name       actual_share        usage       cpu
1206572554      user1       1120.975817     1120.975817     1426.595958
1206572554      user3       46592.519562        46592.519562        12893.812376
1206572554      user2       45888.666024        45888.666024        12691.742827
user1 jobs running:
21
user2 jobs running:
54
user3 jobs running:
53
user1 jobs running:

So what I think the times should be:
user1: 60 (seconds) * 21 (jobs) = 1260 cpuseconds
user2: 60 (seconds) * 54 (jobs) = 3240 cpuseconds
user3: 60 (seconds) * 53 (jobs) = 3180 cpuseconds

user1's output from sge_share_mon lines up reasonably well.
user 2 and 3 are very over their estimated usage. Also it's unclear why usage
and cpu differ so much, as I am only using cpu time for usage:

usage_weight_list                 cpu=1.000000,mem=0.000000,io=0.000000

Are there any flaws in my testing method here?
Does this shine any light on the situation?

Thanks,
-Pascal

   ------- Additional comments from jlb Mon Apr 7 16:49:17 -0700 2008 -------
Further observation -- the CPU usage reported by qacct is essentially equal for
array jobs and equivalent numbers of individual jobs.  In other words, 'ltcpu'
as reported by sge_share_mon differs from 'CPU' as reported by qacct.  Does that
help narrow down where this bug may be at all?

Also, if I switch to the functional share policy, then array jobs are scheduled
with priority equal to that of individually submitted jobs.

   ------- Additional comments from andreas Mon Jul 7 09:08:19 -0700 2008 -------
Actually the result of my investigation was that array jobs on average
cause higher utilization in SGE accounting than sequential jobs, but I
could not find a reason for this.

Are you using local spooling for your execution daemons?

My suspicion was the deviations from the ideal total job run-time are an
outcome of delays during job startup/shutdown due to a bottleneck situation
at the file server. Though this would not explain higher utilization by
array jobs, but I think understanding the net/gross deviation is prerequisite
for getting an idea on how to level the sequential/array job variation.

Attachments (15)

0001-build_usage_list-added-missing-static-keyword.patch (793 bytes) - added by markdixon 4 years ago.
0002-decay_and_sum_usage-updated-comments-and-whitespace.patch (6.5 KB) - added by markdixon 4 years ago.
0003-usage_list_sum-also-sum-finished_jobs-attribute.patch (1.9 KB) - added by markdixon 4 years ago.
0004-decay_and_sum_usage-use-usage_list_sum.patch (1.4 KB) - added by markdixon 4 years ago.
0005-Added-get_usage_or_create-replace-get_usage-create_u.patch (4.3 KB) - added by markdixon 4 years ago.
0006-Added-get_or_build_usage_list-extends-build_usage_li.patch (4.0 KB) - added by markdixon 4 years ago.
0007-Fix-435-Fix-task-array-usage-accumulation-while-runn.patch (1.9 KB) - added by markdixon 4 years ago.
0008-Move-job-usage-calc-decay_and_sum_usage-sum_job_usag.patch (2.8 KB) - added by markdixon 4 years ago.
0009-Move-old-job-usage-retrieval-decay_and_sum_usage-cop.patch (2.6 KB) - added by markdixon 4 years ago.
0010-Move-saving-old-job-usage-decay_and_sum_usage-save_o.patch (3.0 KB) - added by markdixon 4 years ago.
0011-Move-node-project-usage-updates-decay_and_sum_usage-.patch (6.8 KB) - added by markdixon 4 years ago.
0012-Added-usage_list_sub-subtract-one-usage-list-from-an.patch (5.3 KB) - added by markdixon 4 years ago.
0013-Fix-435-Fix-task-array-accumulation-where-tasks-over.patch (5.4 KB) - added by markdixon 4 years ago.
0014-Fix-435-Only-decay_and_sum_usage-once-per-job.patch (2.3 KB) - added by markdixon 4 years ago.
0015-Fix-435-sge_calc_tickets-redundant-finished-task-pro.patch (3.4 KB) - added by markdixon 4 years ago.

Download all attachments as: .zip

Change History (18)

comment:1 Changed 4 years ago by markdixon

  • Severity set to minor

Sharetree usage of array jobs is wrong:

Underlying storage keeps per-job usage, not per-task, but decay_and_sum_usage and delete_debited_job_usage are called per-task. Results in:

  • While tasks are running, results in usage only acculmulating at the rate of the last task in the job array.
  • When a task ends, spool forgets the entire job, resulting in the entire usage of the last task being added to spool, instead of a delta from the last time usage is calculated.

So task array usage climbs too slowly, then jumps at the end of a task - either to the correct value if no other task is running, or to too large a value if other tasks are running.

A set of patches, prepared against 8.1.8, to fix this issue follows.

A side-effect of the patches is that it squashes a few attributes in the spool's user/project objects that don't need to be there (e.g. I doubt that the sum of all job submission times is useful).

Mark

comment:2 Changed 4 years ago by markdixon

  • Severity changed from minor to major

comment:3 Changed 4 years ago by Mark Dixon <m.c.dixon@…>

  • Owner set to Mark Dixon <m.c.dixon@…>
  • Resolution set to fixed
  • Status changed from new to closed

In 4840/sge:

Fix #435: Correct share tree usage of array jobs
Sharetree usage of array jobs was wrong:

Underlying storage keeps per-job usage, not per-task, but
decay_and_sum_usage and delete_debited_job_usage are called
per-task. Results in:

  • While tasks are running, results in usage only acculmulating at the rate of the last task in the job array.
  • When a task ends, spool forgets the entire job, resulting in the entire usage of the last task being added to spool, instead of a delta from the last time usage is calculated.

So task array usage climbs too slowly, then jumps at the end of a task -
either to the correct value if no other task is running, or to too
large a value if other tasks are running.

A side-effect is to squashes a few attributes in the spool's
user/project objects that don't need to be there (e.g. I doubt that
the sum of all job submission times is useful).

Note: See TracTickets for help on using tickets.