Custom Query (431 matches)

Filters
 
Or
 
  
 
Columns

Show under each result:


Results (103 - 105 of 431)

Ticket Resolution Summary Owner Reporter
#1590 fixed Hangs in execd status - with FIX Dave Love <d.love@…> pete.forman@…
Description

I have installed SGE 8.1.9 from RPM on SUSE 11 SP4.

PROBLEM: Running this command hangs with no output.

sudo /sbin/service sgeexecd.p6444 status

DIAGNOSIS: The cause of this is the sgeexecd script failing to set pidfile and then calling cat which now expects STDIN and so waits indefinitely.

FIX: This is the patch that I used on the script in /etc/init.d. It is a two line change that you can apply to the appropriate source file.

--- sgeexecd.p6444.orig 2016-11-18 17:42:28.000000000 +0000 +++ sgeexecd.p6444 2016-11-22 15:16:04.000000000 +0000 @@ -448,9 +448,10 @@

fi

if [ "$status" = true ]; then

+ pidfile=$execd_run_dir/execd.pid

if [ -f $pidfile ]; then

pid=cat $pidfile

  • if $utilbin_dir/checkprog $pid $name > /dev/null; then

+ if $utilbin_dir/checkprog $pid sge_execd > /dev/null; then

echo "execd (pid $pid) is running..." exit 0

else

VERIFY: Applying that patch causes the status subcommand to report correctly.

Regards,

Pete Forman Senior Programmer

Dolphin Geophysical Limited A member of the Shearwater GeoServices? Group

Direct Dial: +44 (0) 1892 707173 / Mobile: +44 (0) 7840 797658 / pete.forman@… Brockbourne House, 77 Mount Ephraim, Tunbridge Wells, Kent, TN4 8BS, UK www.shearwatergeo.com

No image "untitled-part.gif" attached to Ticket #1590

#1557 fixed darcs patch: aimk changes needed for cygwin compile Marco Schmidt <marco.schmidt@…> marco.schmidt@…
Description

1 patch for repository http://arc.liv.ac.uk/repos/darcs/sge:

Fri Sep 18 12:48:50 CEST 2015 Marco Schmidt <marco.schmidt@…>

  • aimk changes needed for cygwin compile

patch-preview.txt

aimk-changes-needed-for-cygwin-compile.dpatch

#435 fixed IZ2298: array job accounting or scheduling problem Mark Dixon <m.c.dixon@…> pascalucsf
Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2298]

        Issue #:      2298             Platform:     All      Reporter: pascalucsf (pascalucsf)
       Component:     gridengine          OS:        All
     Subcomponent:    scheduling       Version:      6.1         CC:    None defined
        Status:       REOPENED         Priority:     P3
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    andreas (andreas)
      QA Contact:     andreas
          URL:
       * Summary:     array job accounting or scheduling problem
   Status whiteboard:
      Attachments:

     Issue 2298 blocks:
   Votes for issue 2298:  35


   Opened: Tue Jun 19 14:26:00 -0700 2007 
------------------------


Array job usage seems to be accounted for incorrectly.

Example:

100 cpu's on cluster, OS fairshare policy, evenly balanced share tree.
User a submits 1000 jobs.
User b submits 1000 jobs.
User c submits 1 array job, with 1000 members.

Results look something like:
48 of user a's jobs running at any time
48 of user b's jobs running at any time
4 of user c's array job members run at any time.

If the queue is empty, except for user c's jobs, they will all begin executing.

Looking at fairshare usage (via qmon) shows that user c's "Actual Resource
Share" (policy configuration -> share tree policy) is very high, (like 50-80%).

I can provide detailed configuration on request.

   ------- Additional comments from pascalucsf Thu Jun 21 14:57:29 -0700 2007 -------
Notes from testing:

4 nodes totaling to 7 cpus on all.q, each node has 4 slots in queue config.

scheduler conf:

policy_hierarchy OS
weight_tickets_share 100000

share tree:

id=0
name=template
type=0
shares=0
childnodes=1
id=1
name=default
type=0
shares=100
childnodes=NONE

queue is disabled, and empty.
1000 individual jobs are queued as user pascal
1 array job of 1000 subjobs is queued as user ben

usage is cleared (qconf -clearusage)

at the starting line:

Queued per user:
   1000 pascal qw
   1000 ben qw

bang: qmod -e all.q

1 minute in:

Running per user:
      8 pascal r
      8 ben r
Queued per user:
    992 pascal qw
    992 ben qw

(jobs are cpuburners, 5 minutes each)

A while later:


Running per user:
     10 pascal r
      1 ben r
Queued per user:
    991 ben qw
    973 pascal qw


And it continues this way.

   ------- Additional comments from andreas Thu Jun 28 11:49:32 -0700 2007 -------
It is very, very likely that this is yet another symptom of #2222.
The fix for #2222 will be part of 6.1u1 once it is available.
In 6.0u11 this #2222 is already fixed.

*** This issue has been marked as a duplicate of 2222 ***

   ------- Additional comments from andreas Fri Jun 29 02:04:12 -0700 2007 -------
Revert: Can't be duplicate since #2222 was already fixed with 6.1.

   ------- Additional comments from andreas Fri Jun 29 02:22:39 -0700 2007 -------
What are you using for

   weight_urgency
   weight_ticket
   weight_waiting_time

in sched_conf(5). If your waiting time weight is non-zero this could
cause the phenomenon you observe. Reason is that waiting time contributes
to job urgency and urgency has higher weight than the ticket policy.

   ------- Additional comments from pascalucsf Fri Jun 29 08:25:58 -0700 2007 -------
These are all of the weight_* values for configurations where I have seen this
problem.

----------------------------------------------------
weight_ticket                     1.000000
weight_waiting_time               0.000000
weight_deadline                   0.000000
weight_urgency                    0.000000
weight_priority                   0.000000
----------------------------------------------------
weight_ticket                     0.900000
weight_waiting_time               0.000000
weight_deadline                   0.000000
weight_urgency                    0.000000
weight_priority                   0.100000
----------------------------------------------------
weight_ticket                     0.010000
weight_waiting_time               0.000000
weight_deadline                   3600000.000000
weight_urgency                    0.100000
weight_priority                   1.000000
----------------------------------------------------

   ------- Additional comments from andreas Mon Jul 2 08:28:45 -0700 2007 -------
After running

---> 100 times: qsub -P A -b y /bin/sleep 5
---> 100 times: qsub -P B -b y /bin/sleep 5
---> 1 times: qsub -t 1-100 -P C -b y /bin/sleep 5

with SHARETREE_RESERVED_USAGE=true be set in global cluster
configuration sge_conf(5) I get a combined resource usage
which sometimes is really surprisingly unbalanced. I played
around with different arrangements to get a clue on this
phenomenon:

Project| Comb. Usage  | Sum. Acct. Usage
-------------------------------------------
A      | 1136.78      | 1085
B      | 1161.77      | 1100
C      | 1292.73      | 1189        (array)
-------------------------------------------
A      | 1294.78      | 1222        (array)
B      | 1159.82      | 1080
C      | 1154.82      | 1097
-------------------------------------------
A      | 1052.86      |  997
B      | 1047.86      |  991
C      | 1224.82      | 1137        (array)
-------------------------------------------
A      |  782.36      |  655        (array)
B      |  646.80      |  590
C      |  645.80      |  586
-------------------------------------------
A      |  635.88      |  568
B      |  634.88      |  570
C      |  647.88      |  569        (array)
-------------------------------------------
A      |  700.77      |  640        (array)
B      |  697.77      |  633
C      |  670.77      |  605
-------------------------------------------
A      |  656.83      |  585        (array)
B      |  629.84      |  570
C      |  640.84      |  581
-------------------------------------------

this shows the accounted usage of array jobs is constantly higher
than of of sequential jobs! I investigated it to the
point that I can say: For some mysterious reason array tasks on
average take longer time from fork() until /bin/sleep actually
starts. Interestingly use of "-shell no" submit option finally
gave me a very much balanced distribution of per project accounting,
but I still can not explain why array jobs should be affected from
the overhead than sequential jobs ... :-o

With regards to the share tree behaviour I recommend use of a by
far lower compensation factor than the default. With the compensation
factor one can control how much projects with higher usage shall
be penalized. When I used a 1 or 2 as compensation factor I got
quite good results despite of the unbalanced accounting.

   ------- Additional comments from jlb Wed Mar 26 10:49:30 -0700 2008 -------
Testing on my production cluster (>300 nodes), I consistently see utilization
numbers (as reported by sge_share_mon) ~22% higher for array jobs than for the
equivalent number of individually submitted jobs.  This is a rather significant
difference, in my opinion.  Using "-shell no" has absolutely no effect on this
over-accounting in my testing.

   ------- Additional comments from pascalucsf Wed Mar 26 16:22:20 -0700 2008 -------
Another take on this:

From a running, saturated queue on a cluster of 256 cpus over 128 machines, a
sample of 3 users, 2 running 1 array job each and 1 running 21 single jobs is
taken 60 seconds after a qconf -clearusage. In the following output, user1 is
the user with the single jobs.

Side notes:
Each of these jobs runs only on a single processor.
Each of these jobs is CPU bound.
Other jobs by other users are running on the cluster.
Other jobs by THESE users are NOT running on the cluster.

Wed Mar 26 15:57:34 PDT 2008
user1 jobs running:
21
user2 jobs running:
54
user3 jobs running:
53
Wed Mar 26 15:58:34 PDT 2008
usage_time      user_name       actual_share        usage       cpu
1206572554      user1       1120.975817     1120.975817     1426.595958
1206572554      user3       46592.519562        46592.519562        12893.812376
1206572554      user2       45888.666024        45888.666024        12691.742827
user1 jobs running:
21
user2 jobs running:
54
user3 jobs running:
53
user1 jobs running:

So what I think the times should be:
user1: 60 (seconds) * 21 (jobs) = 1260 cpuseconds
user2: 60 (seconds) * 54 (jobs) = 3240 cpuseconds
user3: 60 (seconds) * 53 (jobs) = 3180 cpuseconds

user1's output from sge_share_mon lines up reasonably well.
user 2 and 3 are very over their estimated usage. Also it's unclear why usage
and cpu differ so much, as I am only using cpu time for usage:

usage_weight_list                 cpu=1.000000,mem=0.000000,io=0.000000

Are there any flaws in my testing method here?
Does this shine any light on the situation?

Thanks,
-Pascal

   ------- Additional comments from jlb Mon Apr 7 16:49:17 -0700 2008 -------
Further observation -- the CPU usage reported by qacct is essentially equal for
array jobs and equivalent numbers of individual jobs.  In other words, 'ltcpu'
as reported by sge_share_mon differs from 'CPU' as reported by qacct.  Does that
help narrow down where this bug may be at all?

Also, if I switch to the functional share policy, then array jobs are scheduled
with priority equal to that of individually submitted jobs.

   ------- Additional comments from andreas Mon Jul 7 09:08:19 -0700 2008 -------
Actually the result of my investigation was that array jobs on average
cause higher utilization in SGE accounting than sequential jobs, but I
could not find a reason for this.

Are you using local spooling for your execution daemons?

My suspicion was the deviations from the ideal total job run-time are an
outcome of delays during job startup/shutdown due to a bottleneck situation
at the file server. Though this would not explain higher utilization by
array jobs, but I think understanding the net/gross deviation is prerequisite
for getting an idea on how to level the sequential/array job variation.
Note: See TracQuery for help on using queries.