Custom Query (431 matches)

Filters
 
Or
 
  
 
Columns

Show under each result:


Results (133 - 135 of 431)

Ticket Resolution Summary Owner Reporter
#467 fixed IZ2393: qrsh -inherit should access the 'rsh_command'-cache more persevering brs
Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2393]

        Issue #:      2393             Platform:     PC       Reporter: brs (brs)
       Component:     gridengine          OS:        Linux
     Subcomponent:    execution        Version:      6.1u4       CC:    None defined
        Status:       STARTED          Priority:     P3
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    roland (roland)
      QA Contact:     pollinger
          URL:
       * Summary:     qrsh -inherit should access the 'rsh_command'-cache more persevering
   Status whiteboard:
      Attachments:

     Issue 2393 blocks:
   Votes for issue 2393:


   Opened: Mon Oct 8 08:57:00 -0700 2007 
------------------------


This is an intermittent problem that occurs with greater frequency as you
increase the number of slots used in a parallel environment.  It seems to affect
only OpenMPI jobs.

Jobs would fail with the following messages from OpenMPI from a fairly random
number of qrsh processes (sometimes 2 or 3 sometimes almost half -- considering
128 processors -- of the calls would fail):

[rcn-ib-0013.rc.usf.edu:32328] ERROR: A daemon on node rcn-ib-0003.rc.usf.edu
failed to start as expected.
[rcn-ib-0013.rc.usf.edu:32328] ERROR: There may be more information available from
[rcn-ib-0013.rc.usf.edu:32328] ERROR: the 'qstat -t' command on the Grid Engine
tasks.
[rcn-ib-0013.rc.usf.edu:32328] ERROR: If the problem persists, please restart the
[rcn-ib-0013.rc.usf.edu:32328] ERROR: Grid Engine PE job
[rcn-ib-0013.rc.usf.edu:32328] ERROR: The daemon exited unexpectedly with status
255.

Then, a core would be dumped in the user's home directory which was from the
qrsh (or qsh) executable.  A quick backtrace with gdb revealed that qrsh was
failing on a call to execvp.  After some investigation, this call was found in
source/clients/qsh/qsh.c:789 (as of a cvs checkout from 10/08/2007).  Though I
am reporting this bug with 6.0u11, I also reproduced it at one point with 6.1u2.

I've gotten around this bug for the time being by LD_PRELOADing the following code:

/*
 * Place holder library to fix bug with qsh.c not correctly passing
 * arguments to execvp()
 */
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <stdio.h>
#include <sys/utsname.h>
#include <sys/types.h>
#define __USE_GNU
#include <dlfcn.h>

int execvp(const char *file, char *const argv[]){
        int pid, i = 1;
        const char *file_n;
        char **args;

        static int (*func)(const char*, char *const *);
        struct utsname u_name;

        uname(&u_name);
        pid = (int)getpid();

        if (file == NULL){
                file_n = "/opt/sge-tools/rsh";
                fprintf (stderr, "[%s:%d] execvp(): had to modify null
arguments\n", u_name.nodename, pid);
        } else file_n = (const char *)file;
        if (argv[0] == NULL) args[0] = "/opt/sge-tools/rsh";
        else args[0] = (char *)argv[0];

        if (argv[i] != NULL){
                do {
                        args[i] = (char *)argv[i];
                } while (argv[++i] != NULL);
        }

        func = (int (*)(const char*, char *const *))dlsym(RTLD_NEXT, "execvp");

        return (func(file_n, args));
}

Basically, it ensures that the arguments passed to execvp are sane and modifies
them if they are not.  I also needed a wrapper for rsh (hence
/opt/sge-tools/rsh) so that the correct arch would be selected in $SGE_ROOT/utilbin.

Here's some general information:
qconf -sq rcnib.q
qname                 rcnib.q
hostlist              @rcnibNodes
seq_no                51
load_thresholds       NONE
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            2
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make mpich-p4 ompi-ddr mpich-schrod
rerun                 FALSE
slots                 4
tmpdir                /opt/sge/tmp
shell                 /bin/bash
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            operators
xuser_lists           NONE
subordinate_list      NONE
complex_values        i_ib=true,p_low=false,t_devel=false
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                2M
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY

qconf -sp ompi-ddr
pe_name           ompi-ddr
slots             1024
user_lists        NONE
xuser_lists       NONE
start_proc_args   /bin/true
stop_proc_args    /bin/true
allocation_rule   $round_robin
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min

qconf -sconf
global:
execd_spool_dir              /opt/sge/rc/spool
mailer                       /bin/mail
xterm                        /usr/bin/X11/xterm
load_sensor                  none
prolog                       none
epilog                       none
shell_start_mode             posix_compliant
login_shells                 sh,ksh,csh,tcsh
min_uid                      0
min_gid                      0
user_lists                   none
xuser_lists                  none
projects                     none
xprojects                    none
enforce_project              false
enforce_user                 auto
load_report_time             00:00:40
max_unheard                  00:05:00
reschedule_unknown           00:00:00
loglevel                     log_warning
administrator_mail           admins@rc.usf.edu
set_token_cmd                none
pag_cmd                      none
token_extend_time            none
shepherd_cmd                 none
qmaster_params               none
execd_params                 none
reporting_params             accounting=true reporting=false \
                             flush_time=00:00:15 joblog=false sharelog=00:00:00
finished_jobs                100
gid_range                    10000-10500
qlogin_command               telnet
qlogin_daemon                /usr/sbin/in.telnetd
rlogin_daemon                /usr/sbin/in.rlogind
rsh_command                  /opt/sge-tools/rsh
max_aj_instances             2000
max_aj_tasks                 75000
max_u_jobs                   0
max_jobs                     0
auto_user_oticket            0
auto_user_fshare             100
auto_user_default_project    none
auto_user_delete_time        86400
delegated_file_staging       false
reprioritize                 1

And some more detailed output.  With the following code LD_PRELOADed, a good
call to execvp looks like this:

[rcn-ib-0037.rc.usf.edu:8123] execvp(): /opt/sge-tools/rsh /opt/sge-tools/rsh -n
-p 34391 rcn-ib-0027.rc.usf.edu exec '/opt/sge/utilbin/lx24-amd64/qrsh_starter'
'/usr/local/sge-6.0-u11/rc/spool/rcn-ib-0027/active_jobs/4268.1/1.rcn-ib-0027'
noshell

A bad call will look something like this:

[rcn-ib-0037.rc.usf.edu:8150] execvp(): (null) (null) -n -p 34537
rcn-ib-0028.rc.usf.edu exec '/opt/sge/utilbin/lx24-amd64/qrsh_starter'
'/usr/local/sge-6.0-u11/rc/spool/rcn-ib-0028/active_jobs/4268.1/1.rcn-ib-0028'
noshell


Here's the code:

#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <stdio.h>
#include <sys/utsname.h>
#include <sys/types.h>
#define __USE_GNU
#include <dlfcn.h>

int execvp(const char *file, char *const argv[]){
        int pid, i = 1;
        static int (*func)(const char*, char *const *);
        struct utsname u_name;

        uname(&u_name);
        pid = (int)getpid();

        fprintf (stderr, "[%s:%d] execvp(): %s %s ", u_name.nodename, pid, file,
argv[0]);
        do {
                fprintf(stderr, "%s ", argv[i]);
        } while (argv[++i] != NULL);
        fprintf(stderr, "\n");

        func = (int (*)(const char*, char *const *))dlsym(RTLD_NEXT, "execvp");

        return (func(file, argv));
}

I can provide more information if required.

   ------- Additional comments from brs Mon Oct 8 13:02:57 -0700 2007 -------
Ok, the actual execvp is in source/clients/qrsh/qrsh_starter.c, line 728.
Should have looked more carefully.

   ------- Additional comments from brs Mon Oct 8 13:05:21 -0700 2007 -------
changed ticket summary to reflect correct source file

   ------- Additional comments from brs Sun Apr 6 10:20:19 -0700 2008 -------
This issue still exists in 6.1u4.  Who can look at this issue with me?

   ------- Additional comments from brs Sun Apr 6 10:28:49 -0700 2008 -------
Would like someone to look at this ticket.  Perhaps it was sent to the wrong person.

   ------- Additional comments from andreas Mon Apr 7 08:55:36 -0700 2008 -------
Moving this to execution subcomponent.

   ------- Additional comments from brs Mon Apr 28 11:53:04 -0700 2008 -------
So, from more investigation, I think I've narrowed the problem down.  In short,
using NFS (at least with Solaris 10 as the NFS server) doesn't work reliably.
It seems that multiple qrsh -inherit processes are making gdi requests to
qmaster for the rsh_command data.  I don't know how many gdi requests qmaster
can handle simultaneously from a single job, but it appears that some of the
qrsh -inherits that are spawned by mpirun can't read the data from
qrsh_client_cache and end up getting to the gdi2_get_configuration line before
the first qrsh can write his data out.  This results in a large-ish parallel job
making tons of gdi requests on qmaster.  My guess is that some of these gdi
requests aren't handled properly and so some qrsh'es end up running execvp
without any command string.

Long story short: NFS server on Solaris 10 w/ SGE: bad.  Let me know what you think.

   ------- Additional comments from brs Mon Apr 28 11:53:54 -0700 2008 -------
For clarification, I meant NFS as a path for tmpdir.

   ------- Additional comments from brs Mon Apr 28 11:55:10 -0700 2008 -------
Also, it appears that some type of file locking would completely fix this issue
if the gdi hypothesis is true.

   ------- Additional comments from andreas Tue Apr 29 08:15:24 -0700 2008 -------
The NFS issue I can't evaluate, but I think the code in get_client_name() should
be more persevering as to deal with such cases. As it is now qrsh -inherit tries
only once to read the cache and does then immediately a GDI-request as fallback.
The GDI query is certainly needed as fallback, but doing a couple of retries is
certainly not wrong if qrsh could stat(2) the cache file before:

      if (tmpdir != NULL) {
         FILE *cache;

         /* build filename of cache file */
         sge_dstring_init(&cache_name_dstring, cache_name_buffer, SGE_PATH_MAX);
         cache_name = sge_dstring_sprintf(&cache_name_dstring, "%s/%s",
                                          tmpdir, QRSH_CLIENT_CACHE);
         cache = fopen(cache_name, "r");
         if (cache != NULL) {
            char cached_command[SGE_PATH_MAX];

            if (fgets(cached_command, SGE_PATH_MAX, cache) != NULL) {
               if (strcasecmp(cached_command, "builtin") == 0) {
                  g_new_interactive_job_support = true;
               }

               FCLOSE(cache);
               DPRINTF(("found cached client name: %s\n", cached_command));
               DEXIT;
               return strdup(cached_command);
            }
            FCLOSE(cache);
         }
      }
#482 duplicate IZ2464: ob_info can cause immense memory consumption rems0
Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2464]

   Issue #: 2464   Platform: PC   Reporter: rems0 (rems0)
   Component: gridengine   OS: Linux
   Subcomponent: scheduling   Version: 6.1u2   CC:
   [_] olle
   [_] uddeborg
   [_] Remove selected CCs
   Status: REOPENED   Priority: P2
   Resolution:   Issue type: DEFECT
     Target milestone: ---
   Assigned to: andreas (andreas)
   QA Contact: andreas
   URL:
   * Summary: schedd_job_info can cause immense memory consumption
   Status whiteboard:
   Attachments:
   Date/filename:                                                       Description:                                                                                                     Submitted by:
   Wed Jan 16 04:27:00 -0700 2008: valgrind-sge_schedd-debug-3-1.out.gz sge_schedd output, started with SGE_DEBUG_LEVEL="3 3 3 3 3 3 3 3" (text/plain)                                   rems0
   Wed Jan 16 04:38:00 -0700 2008: valgrind-sge_schedd-debug-3-1.out.gz sge_schedd output, started with SGE_DEBUG_LEVEL="3 3 3 3 3 3 3 3" (replaces previous attached file) (text/plain) rems0
   Thu Mar 6 08:50:00 -0700 2008: showmetheleak.c                       To be compiled with "gcc -pthread showmetheleak.c" under MacOS 10.4.x (text/plain)                               andreas
   Thu Mar 6 08:52:00 -0700 2008: showmetheleak.c                       To be compiled with "gcc -pthread showmetheleak.c" under MacOS 10.4.x (text/plain)                               andreas
   Thu Mar 13 03:33:00 -0700 2008: mallinfo.diff                        mallinfo(3) patch for sge_schedd (text/plain)                                                                    andreas
   Tue Mar 18 02:05:00 -0700 2008: messages.gz                          log file from exploding schedd with mallinfo patch (application/x-gzip)                                          olle
   Thu Apr 3 02:14:00 -0700 2008: mallinfo.diff                         mallinfo(3) source patch (text/plain)                                                                            andreas
   Thu Apr 3 02:19:00 -0700 2008: messages                              Scheduler messages file with much more mallinfo(3) loggings (text/plain)                                         andreas
   Tue Apr 8 06:12:00 -0700 2008: messages.gz                           message file from schedd with more mallinfo (application/x-gzip)                                                 olle
     Issue 2464 blocks:
   Votes for issue 2464:

   Opened: Wed Jan 16 04:23:00 -0700 2008 
------------------------


There is a memory leak in sge_schedd of SGE 6.1u2.
I couldn't reproduce it until now, but apparently it only happens when there are
 several free queues, and also several jobs are submitted. Then apparently the
scheduler gets into a loop, does not start any of the waiting jobs, and keeps
increasing in memory consumption until it gets stopped.

See thread starting at
http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=22410 .

Our config:

1. SGE 6.1u2 on master and nodes.

2. openSUSE 10.3 64 bit on master and nodes.

3. 80 queues and 79 parallel environments. We have, at least, one queue and
parallel environment per node, see
http://gridengine.info/articles/2006/02/14/grouping-jobs-to-nodes-via-wildcard-pes
)

4. mem_free resource added to the hosts and requested in many job submissions.

5. qlicserver administering the number of free licenses.

6. many jobs running parallel on one queue/pe/host, using 4 cores.

7. job submission using wildcards ( i.e.: qsub -pe c*_4_* 4 -l
mem_free=6GB,lic1=0.25 script.sh )

The problem:
The scheduler takes the whole system memory, going up to between 3 or 4 GBs.
(the system has 4 GB RAM)


A solution seems to be disabling the queues, restarting SGE, and enabling the
queues one by one, leting the scheduler submit one job at a time.

I will attach example configuration files for the queues/hosts/PE/sched. config.
and so on as suggested by Andreas Haas.

Richard
What more in

   ------- Additional comments from rems0 Wed Jan 16 04:27:00 -0700 2008 -------
Created an attachment (id=143)
sge_schedd output, started with SGE_DEBUG_LEVEL="3 3 3 3 3 3 3 3"

   ------- Additional comments from rems0 Wed Jan 16 04:38:11 -0700 2008 -------
Created an attachment (id=144)
sge_schedd output, started with SGE_DEBUG_LEVEL="3 3 3 3 3 3 3 3" (replaces previous attached file)

   ------- Additional comments from andreas Wed Jan 16 07:26:00 -0700 2008 -------
Will be threated as duplicated of reopend #2187.

*** This issue has been marked as a duplicate of 2187 ***

   ------- Additional comments from rems0 Tue Feb 5 11:36:12 -0700 2008 -------
A memory leak seems to still be there. I sent Andreas Haas some more info and
log files from sge_schedd output.

This is still on 6.1u2 with Andreas patched sge_schedd scheduler (see Issue 2187).

Just reopening here, because I can't reopen issue 2187.

   ------- Additional comments from olle Thu Mar 6 06:42:35 -0700 2008 -------
adding myself to cc list since we see the same on our site (but with more memory
it doesn't stop at 4GB).

   ------- Additional comments from rems0 Thu Mar 6 07:11:05 -0700 2008 -------
Olle, after talking with Andreas Haas, I installed libhoard.so
(http://www.hoard.org/) and started sge_schedd with it (changing the sge_schedd
starting line in sgemaster to "LD_PRELOAD=/opt/hoard-3.7.1/lib64/libhoard.so
sge_schedd").

There seems to be some problems with malloc and threads not freeing memory (or
something similar, Andreas could explain this the right way) which could be
affecting sge_schedd.

Since restarting sge_schedd using hoard I didn't have any memory problems
anymore, but this just happened one day ago.

Memory consumption of sge_schedd seems now to be less than before.
Perhaps you could try also to use hoard!
Andreas?

cheers, Richard



   ------- Additional comments from andreas Thu Mar 6 07:30:56 -0700 2008 -------
libhoard is worth a try at all events. It is free, it is easy to apply and
finally brings sureness with regards to the question whom to blame for virtual
memory growth.

Let me know, olle, if you need addtl information on the use of libhoard:

---> andreas@sunsource.net

   ------- Additional comments from olle Thu Mar 6 08:15:19 -0700 2008 -------
You write "virtual memory growth", but do you with that mean a growth in address
space allocated by the process or a false memory growth caused by the memory
allocation routines or the kernel itself? (I know there is a memory
fragmentation problem in the RHEL4 kernel we use.)

Nothing stops us from trying libhoard.so and it can't make it much worse.

I will get back when I know the outcome.

   ------- Additional comments from uddeborg Thu Mar 6 08:35:17 -0700 2008 -------
Can't someone remove the requirement for a comment when adding yourself as a CC?

   ------- Additional comments from andreas Thu Mar 6 08:48:56 -0700 2008 -------
I refer to cases when the memory allocator library heuristics get outsmareted by
an application with the result of endless virtual memory growth with the
application process.

For illustration I attach a small test program that caused endless memory growth
on my MacOS 10.4.<something> laptop when I issued it like this

 # a.out 1073741824

the program spawns two threads "Producer" and "Consumer". Producer allocates
memory using malloc(), Consumer releases it using free(). For passing the
allocated memory a pthread lock+condition variables are used. When you look at
the program you'll see that all memory gets freed, but the virtual memory
consumption of this damn program nevertheless grows into the skys.

When I ran the program on a Linux AMD64 box (GNU C Library 2.3.4) no memory
growth did happen with this simple program, but there are other test programs
that can provoke memory growth under Linux even though all memory is actually
free()'d.

   ------- Additional comments from andreas Thu Mar 6 08:50:45 -0700 2008 -------
Created an attachment (id=156)
To be compiled with "gcc -pthread showmetheleak.c" under MacOS 10.4.x

   ------- Additional comments from andreas Thu Mar 6 08:52:18 -0700 2008 -------
Created an attachment (id=157)
To be compiled with "gcc -pthread showmetheleak.c" under MacOS 10.4.x

   ------- Additional comments from andreas Thu Mar 6 11:06:37 -0700 2008 -------
Changed the summary as to separate this phenomenon from "real" memory leaks i.e.
viratual memory growth due to bugs in sge_schedd as covered by issue #2187.

Btw. somewhat related is

http://ventnorsblog.blogspot.com/2008/02/beta-3.html

though it focusses on memory allocator performance, yet this is similarly of
interest for us.




   ------- Additional comments from rayson Thu Mar 6 11:36:33 -0700 2008 -------
Related: issue #1794

Would be great if mallinfo() results can be included in the qping output, so
that the admin can see how much memory the qmaster (and the scheduler thread in
6.2) has allocated, and compare the number with external tools such as top and
vmstat.


   ------- Additional comments from olle Tue Mar 11 03:33:41 -0700 2008 -------
libhoard.so did not fix our leak in schedd. :-(

Again it stayed at a stable size for a few days and then suddenly starts to
allocate all available memory and becomes unresponsive.

As a side effect I also ran qmaster with libhoard.so preloaded. Our qmaster
grows with about 1GB a day and hoard didn't help here either.

   ------- Additional comments from andreas Tue Mar 11 05:10:20 -0700 2008 -------
Then we need Raysons qping mallinfo patch.
I assume it is AMD64 Linux.

   ------- Additional comments from rems0 Tue Mar 11 05:23:44 -0700 2008 -------
Olle, are you using Andreas modified sge_schedd?
Andreas fixed some bugs on 6.1 which are not in 6.1u3. I'm using a sge_schedd
binary Andreas sent me, not the 6.1u2 or 6.1u3 ones.

See http://gridengine.sunsource.net/servlets/ReadMsg?listName=users&msgNo=22426 .
And http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=23272 .

This new binary has apparently corrected one memory leak problem, but not this
apparently memory allocator error I'm seeing now.

Andreas?

   ------- Additional comments from rems0 Tue Mar 11 10:32:57 -0700 2008 -------
The memory leak happened today again, with libhoard enabled. So, despite that
libhoard seems to lower memory consumption, a memory leak error seems to be
still there.

The situation today here was:
* Many jobs (around 100) were waiting, some on hold, some not
* all queues were disabled

I reenabled all queues at once, and the the memory leak appeared (again).

   ------- Additional comments from olle Wed Mar 12 01:11:48 -0700 2008 -------
We are running 6.0u3 courtesy binaries and the qmaster host is RHEL4 x86_64,
just in case someone has built a patched schedd already that is compatible with
6.0u3.

   ------- Additional comments from andreas Thu Mar 13 01:10:51 -0700 2008 -------
@olle: Sorry for asking, but are you really using 6.0u3?
I'm curious since most recent binary set is actually 6.1u3.

   ------- Additional comments from olle Thu Mar 13 01:20:01 -0700 2008 -------
We are running 6.1u3. :-)

   ------- Additional comments from andreas Thu Mar 13 03:29:00 -0700 2008 -------
I have prepared a lx24-amd64 sge_schedd binary that dumps mallinfo(3)

  http://www.gnu.org/software/libtool/manual/libc/Statistics-of-Malloc.html

at the begin of each scheduling interval into scheduler messages file like this:

03/13/2008 11:12:48|schedd|gally2|W|sbrk 6356992 unused 521 mmap chunks 0 mmap
size 0 used 1224624 unused 5132368 top-most 133424
03/13/2008 11:12:49|schedd|gally2|W|sbrk 6356992 unused 2 mmap chunks 0 mmap
size 0 used 5182576 unused 1174416 top-most 133424
03/13/2008 11:12:51|schedd|gally2|W|sbrk 6356992 unused 2 mmap chunks 0 mmap
size 0 used 5359008 unused 997984 top-most 133424
03/13/2008 11:12:52|schedd|gally2|W|sbrk 6696960 unused 8519 mmap chunks 0 mmap
size 0 used 1326608 unused 5370352 top-most 133360
03/13/2008 11:12:54|schedd|gally2|W|sbrk 6356992 unused 521 mmap chunks 0 mmap
size 0 used 1224624 unused 5132368 top-most 133424
03/13/2008 11:12:55|schedd|gally2|W|sbrk 6356992 unused 2 mmap chunks 0 mmap
size 0 used 5161344 unused 1195648 top-most 133424
03/13/2008 11:12:57|schedd|gally2|W|sbrk 7032832 unused 2 mmap chunks 0 mmap
size 0 used 5337472 unused 1695360 top-most 809264
03/13/2008 11:12:58|schedd|gally2|W|sbrk 6356992 unused 524 mmap chunks 0 mmap
size 0 used 1326048 unused 5030944 top-most 133424
03/13/2008 11:13:00|schedd|gally2|W|sbrk 6356992 unused 2 mmap chunks 0 mmap
size 0 used 5124816 unused 1232176 top-most 133424
03/13/2008 11:13:01|schedd|gally2|W|sbrk 6455296 unused 7751 mmap chunks 0 mmap
size 0 used 1224624 unused 5230672 top-most 131824
03/13/2008 11:13:03|schedd|gally2|W|sbrk 6455296 unused 7751 mmap chunks 0 mmap
size 0 used 1224624 unused 5230672 top-most 131824
03/13/2008 11:13:04|schedd|gally2|W|sbrk 6455296 unused 7751 mmap chunks 0 mmap
size 0 used 1224624 unused 5230672 top-most 131824
03/13/2008 11:13:06|schedd|gally2|W|sbrk 6455296 unused 7751 mmap chunks 0 mmap
size 0 used 1224624 unused 5230672 top-most 131824
03/13/2008 11:13:07|schedd|gally2|W|sbrk 6455296 unused 540 mmap chunks 0 mmap
size 0 used 1224624 unused 5230672 top-most 231728

I built this binary from current V61BRANCH i.e. running it in an environment
with 6.1u3 binaries should be no problem.

@Richard/Olle: My hope is that mallinfo(3) helps us to finally gain certainty
about this phenomenon. It tells us exactly how much the malloc library got from
the OS ("sbrk"), how much it passed to the application ("used") and how much is
unused ("unused").

If you see a chance to run it, send me mail to: andreas@sunsource.net

   ------- Additional comments from andreas Thu Mar 13 03:33:37 -0700 2008 -------
Created an attachment (id=158)
mallinfo(3) patch for sge_schedd

   ------- Additional comments from olle Tue Mar 18 02:05:24 -0700 2008 -------
Created an attachment (id=159)
log file from exploding schedd with mallinfo patch

   ------- Additional comments from olle Tue Mar 18 02:51:54 -0700 2008 -------
Unfortunately there is no mallinfo written out to the message file after the
point where schedd starts to grow.

   ------- Additional comments from andreas Thu Apr 3 02:12:59 -0700 2008 -------
I have prepared a new 6.1 branch patched Linux sge_schedd binary with much more
fine grained mallinfo(3) loggings. Based on that information it should be
possible to figure out the code section that causes the memory growth. Please
contact me under

  ---> andreas@sunsource.net

if you want to have this binary.

For illustration I'll attach the source diff and the scheduler messages file I
got from it.

   ------- Additional comments from andreas Thu Apr 3 02:14:46 -0700 2008 -------
Created an attachment (id=161)
mallinfo(3) source patch

   ------- Additional comments from andreas Thu Apr 3 02:19:44 -0700 2008 -------
Created an attachment (id=162)
Scheduler messages file with much more mallinfo(3) loggings

   ------- Additional comments from olle Tue Apr 8 06:12:28 -0700 2008 -------
Created an attachment (id=163)
message file from schedd with more mallinfo

   ------- Additional comments from andreas Tue Apr 8 07:13:05 -0700 2008 -------
We believe we have found the culprit:

static void schedd_mes_find_others(lList *job_list,
                                   int ignore_category)
{
   if (tmp_sme && job_list) {
      lListElem *message_elem = NULL;  /* MES_Type */
      lRef category = NULL;            /* Category pointer (void*) */
      lList *jid_cat_list = NULL;      /* ULNG */
      int create_new_jid_cat_list = 0;
      lList *message_list = lGetList(tmp_sme, SME_message_list);

      /*
       * Here we have a list of message elements where each
       * MES_job_number_list containes only one id.
       * We have to find the other jobs (jids) which have the same category.
       */
      for_each(message_elem, message_list) {
         lList *jid_list = lGetList(message_elem, MES_job_number_list);
         u_long32 jid;
         lRef jid_category;
         jid = lGetUlong(lFirst(jid_list), ULNG);
         jid_category = schedd_mes_get_category(jid, job_list);

         /*
          * Initilize jid_cat_list if not initialized
          * or if category differs from the last run
          */
         if (category != jid_category || ignore_category) {
            jid_cat_list = schedd_mes_get_same_category_jids(jid_category,
                                                             job_list,
                                                             ignore_category);
            category = jid_category;
            create_new_jid_cat_list = 0;
         }
         else {
            create_new_jid_cat_list = 1;
         }

         /*
          * Replace the MES_job_number_list which containes only one jid
          * with a list of jids (all have same category
          */
         if (create_new_jid_cat_list) {
            lSetList(message_elem, MES_job_number_list,
                     lCopyList("", jid_cat_list));
         } else {
            lSetList(message_elem, MES_job_number_list, jid_cat_list);
         }
      }
   }
}

It appears *FORMALLEY* it is no memory leak even though immense memory growth
occurs.

Anyone suffering from this sympthom please try to set schedd_job_info in
sched_conf(5) to 'false'.

   ------- Additional comments from olle Wed Apr 9 00:04:22 -0700 2008 -------
Andreas, can this really explain the pattern in we see? If so, what triggers the
sudden memory explosion that occurs after hours of stable operation?

The normal virtual size we see is about 200MB. When it starts to grow it becomes
unresponsive and increases above 18GB in no time at all. I guess it would grow
even more on a host with more memory.

In my last attachment we had schedd restarted by a cronjob when it exceeded
500MB, but I can increase the limit if it helps debugging.

   ------- Additional comments from andreas Thu Apr 10 00:14:17 -0700 2008 -------
It can, Olle. We ran into this memory growth by incident in a simulated cluster
setup with 3000 Hosts x 3 Clusterqueues x 500 Jobs. The jobs were not
dispatchable due to an unsatisfiable -l arch request.

Outcome was the scheduler generated 9000 single schedd_job_info messages for the
first job i.e. one for each queue instance. Then for each subsequent job
schedd_mes_find_others() was called as to attach each jobs' job id to each of
the 9000 messages in order to actually reuse the message. This activity took
ages while schedd memory consumption skyrocketed though the roof. Roland and I
tracked it down successfully using dtrace on a S10/AMD64 machine.

Btw. the workaround for qstat -j <jobid> is qalter -w v <jobid>. The latter one
is by far more good natured since diagnosis messages are created on-demand and
only for the job in question.

In our opinion repairing schedd_job_info is not doable as it will always consume
immense resources in large clusters in sick cases. That means we plan to abolish
it in a past 6.2 version and as replacement possibly flesh-up "-w v|w|e|n"
option e.g. with a new option argument V whose purpose were to do the dry
scheduling run in the cluster with all resource debitations in place. Note, the
'v' does also a dry scheduling run, but it assumes the cluster be empty as to
answer the question whether the job could be run at all. --> Feedback appreciated!

   ------- Additional comments from uddeborg Thu Apr 10 00:28:08 -0700 2008 -------
I've now disabled schedd_job_info.  (I and Olle are working with the same
installation.)  It will be interesting to see if the problems go away.

   ------- Additional comments from rems0 Thu Apr 10 02:24:04 -0700 2008 -------
Andreas, great that you have found the culprit!

Is there a way to trigger this "schedd_job_info" thing, so I can get the "memory
leak" before turning "schedd_job_info" to false?
Submitting lots of jobs which cannot be dispatched?

thanks, Richard

   ------- Additional comments from andreas Thu Apr 10 06:23:35 -0700 2008 -------
Switched schedd_job_info to false in install time scheduler profile "normal" for
6.2 in MAINTRUNK.

@Richard:
Submitting lots of jobs which cannot be dispatched will bring you the case only
if you do it in a way that SGE must generate a qstat -j message for each queue
instance. Otherwise the code behaves relatively well. In our case jobs were
requesting a resource string value -l name=<value> which didn't match with
anything. In case of wildcard PE requests with one PE for each host this gets you

<number of PEs> x <number of hosts> x <number of queues>

single messages, if these jobs can't be dispatched e.g. due to wrong -l
arch=<val>  request.


   ------- Additional comments from rems0 Fri Apr 11 08:23:50 -0700 2008 -------
Thanks Andreas.
Doing what you proposed (+ having enough free queues/slots) I could trigger the
"memory leak".
And after disabling schedd_job_info everything is running fine.
So I can confirm that disabling schedd_job_info works for ME. 8)
This is great news!

Thank you Andreas and the SUN SGE Team for all your effort and good work !!!

Richard

   ------- Additional comments from andreas Fri Apr 11 08:35:35 -0700 2008 -------
Thank *YOU* Richard for your patience!

   ------- Additional comments from uddeborg Wed Apr 23 04:53:31 -0700 2008 -------
After having run without schedd_job_info for almost two weeks without schedd
blowing up, I feel pretty confident this change did indeed help.

Tha lack of scheduler information in qstat output is a problem, though.  Now it
is much harder to debug why a job doesn't get dispatched.  I haven't found
"qalter -w v <jobid>" to be very useful in most cases.  The common question is
something like.  Why isn't it running no X@Y for example, which is free?"  That
isn't answered by "qalter -w v".

   ------- Additional comments from andreas Wed Apr 23 07:46:13 -0700 2008 -------
@Uddeborg: The problem you raise should be covred by the -w V which I proposed
in issue #2548 whose claim is to abolish schedd_job_info entirely.

   ------- Additional comments from uddeborg Wed Apr 23 23:40:45 -0700 2008 -------
Then it is fine.  I wasn't quite sure what information your "-w V" flag would
give, so I thought I better describe the typical use case.  Just to be clear.
#486 fixed IZ2482: sge_qquota not processed properly Dave Love <d.love@…> clarinet
Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2482]

        Issue #:      2482             Platform:     PC       Reporter: clarinet (clarinet)
       Component:     gridengine          OS:        Linux
     Subcomponent:    clients          Version:      6.1u3       CC:
                                                                        [_] ernst
                                                                        [_] Remove selected CCs
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    roland (roland)
      QA Contact:     roland
          URL:
       * Summary:     sge_qquota not processed properly
   Status whiteboard:
      Attachments:

     Issue 2482 blocks:
   Votes for issue 2482:


   Opened: Mon Feb 4 15:28:00 -0700 2008 
------------------------


Parameters in ~/.sge_qquota or system-wide sge_qquota file are not processed
properly, qquota command seems to ignore the first word in the file.

If ~/.sge_qquota contains "-u *", running qquota generates the following
error: 'error: ERROR! invalid option argument "*"'.

If ~/.sge_qquota is empty, running qquota generates the following
error: 'error: ERROR! invalid option argument "pwf"'.

   ------- Additional comments from clarinet Tue Feb 5 03:13:29 -0700 2008 -------
Is not there a problem on line 470 of qquota.c file? Using "++argv" seems fine
for real command line arguments but may fail when processing arguments read
from a file (where there is no argv[0]).

Also, I am not sure if qquota processes duplicate arguments (those that appear
in a file as well as on a command line) correctly.

   ------- Additional comments from andreas Tue Feb 5 03:52:52 -0700 2008 -------
The

   rp = ++argv;

looks strange at all events. With

   rp = argv++;

the code in sge_parse_cmdline_qquota() makes more sense.

Thanks for reporting!
Note: See TracQuery for help on using queries.