Opened 12 years ago

Last modified 9 years ago

#464 new enhancement

IZ2386: Slow scheduling with many queue instances

Reported by: sgaure Owned by:
Priority: normal Milestone:
Component: sge Version: 6.1u2
Severity: Keywords: scheduling


[Imported from gridengine issuezilla]

        Issue #:      2386             Platform:     All           Reporter: sgaure (sgaure)
       Component:     gridengine          OS:        All
     Subcomponent:    scheduling       Version:      6.1u2            CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    ENHANCEMENT
                                   Target milestone: ---
      Assigned to:    andreas (andreas)
      QA Contact:     andreas
       * Summary:     Slow scheduling with many queue instances
   Status whiteboard:

     Issue 2386 blocks:
   Votes for issue 2386:

   Opened: Tue Oct 2 14:13:00 -0700 2007 

We do have problems with scheduling taking too long time, i.e. qlogins timeout,
or have to wait half a minute or more, even though there are lots of available
slots in the cluster. With max_reservation=1 (instead of the current 0), we are
most of the time not able to schedule interactive jobs at all.  We typically
have approx 10-20 pending jobs, some of them array jobs with 10-100k tasks. We
have set max_pending_tasks_per_job=20.

I've done some profiling with oprofile on sge_schedd and found that it spends
most of its time in sge_eval_expression(), sge_is_expression(), sge_strlcpy(),
sge_hostcpy() and sge_hostcmp()

We do have approx 25 cluster queues on approx 450 nodes; approx 10000 queue
instances.  One of the queues is subordinate to all the others.  Access is
governed by rqs.

Though I have not yet been able to do a gprof (this is a system in full
production) it seems very likely that the routine qinstance_list_locate is to
blame.  It's a linear search in a list of queue instances, with two quite
elaborate tests (sge_eval_expression) containing a lot of setup (and copying in
sge_hostcmp) for what boils down to (more or less) a strcmp.  I guess the thing
must be initiated from the cqueue_locate_qinstance() call in so_list_resolve().

For scalability reasons, I strongly suggest this part of the code be rewritten
to be more efficient.

   ------- Additional comments from sgaure Wed Oct 3 02:52:48 -0700 2007 -------
Nah, it's not there, it's all over the place, linear searches in hostlists,
queue instances.  Calls into fancy routines with malloc, copying, free,
typically ending in a simple strcmp.  Over and over and over again, on the same
data. We have problems now, with a mere 450 nodes, and a 4-cpu dedicated sge
server. In a couple of years we might have 3000.  If sge_schedd could be made
parallel, we could set apart a 20-node cluster with 160 cpus for running it.

A serious effort should be made to make sge scale.

   ------- Additional comments from sgaure Fri Oct 5 17:28:45 -0700 2007 -------
When my sge_schedd gets one of its fits, the copy in sge_hostcmp is the top cpu-
hog.  I suppose the hostcpy is used to strip domain names etc, but this should
be unnecessary to do over and over again during scheduling, hostnames should be
normalized before they're admitted to internal structures in the scheduler. If
it's required that users see the exact hostname they supplied, a literal copy
should be kept in the job-structure.

   ------- Additional comments from andreas Mon Oct 8 10:29:36 -0700 2007 -------
Thanks for reporting the observations. Actually efforts for scheduler
improvements with many, many queue instances are already underway and already
6.1u3 will be faster. Though I can not predict how the improvement will be with
your special setup, but they are significant in particular when many resource
quotas, many hosts and many hosts are involved.

Implementation-wise the improvements do not change functions like
sge_eval_expression(), but instead aim on reducing the overall amount of calls
of such functions.

Change History (0)

Note: See TracTickets for help on using tickets.