Opened 13 years ago

Last modified 9 years ago

#343 new defect

IZ2033: Possible memory leak with libdrmaa

Reported by: andreas Owned by:
Priority: normal Milestone:
Component: sge Version: 6.0u7
Severity: Keywords: PC Linux drmaa
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2033]

        Issue #:      2033             Platform:     PC       Reporter: andreas (andreas)
       Component:     gridengine          OS:        Linux
     Subcomponent:    drmaa            Version:      6.0u7       CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    templedf (templedf)
      QA Contact:     templedf
          URL:
       * Summary:     Possible memory leak with libdrmaa
   Status whiteboard:
      Attachments:
                      Date/filename:                            Description:                    Submitted by:
                      Wed Apr 19 05:28:00 -0700 2006: example.c Modified example.c (text/plain) andreas

     Issue 2033 blocks:
   Votes for issue 2033:


   Opened: Wed Apr 19 05:24:00 -0700 2006 
------------------------


Hey Guys,
   This is sort of a long email but here goes.  I have been experiencing
memory leaks when I want to run cluster jobs multiple times using the grid.
First the good news: I found some simple things you can do to the swig
wrapper that should resolve part of the problem
And the bad news:   I'm pretty sure that the leaks I'm seeing are in the
DRMAA SGE library itself so there's not much you can do about it.

  For the wrappers, I was looking at the mallocs you do for the fixed-length
string buffer arguments (like the error code strings and jobname strings)
and it looked like you didn't need to do a malloc since the buffer sizes
were fixed and the scope needed for the buffer was only in a single
function call.  I redid the mallocs into stack allocated buffers with no ill
effects, and this means there is no need to free the memory (which was not
being freed before).  Since you are using the PyString_FromString call
Python makes it's own internal copy of the memory and
the local version is no longer needed.
   For example:
         arg4 = (char*) malloc( sizeof(char) * DRMAA_ERROR_STRING_BUFFER );
   turns into:
         char arg4 [sizeof(char) * DRMAA_ERROR_STRING_BUFFER ];

  If you still want to do mallocs, my suggestion would be to make sure there
is a free for all of the fixed length memory buffers at the end
of each function, I think the  stack-allocated version is a little nicer.

  Now, that solved a small memory leak, but unfortunately I kept testing the
simple invocation of jobs and no matter what I did the memory
usage kept going up.  I tried deleting/recreating templates to no effect
(I'm pretty sure the templates are not the source of the leak).  I also
made doubly sure to get back the job Info object on each job I started from
the wait call to make sure I was getting & deleting the object to
no effect.  I tried just running a synchronize with the dispose argument
asserted to have DRMAA reap the whole thing internally, and that
did not fix the problem either.   Long story short....after some more
testing I don't think the leak is the wrapper's fault I think it's something
deeper in the C library.

    Below is a hacked-up version of the example DRMAA C code that you can
get from the grid engine site and that I'm pretty sure
you based your own example code on.  I've modified the code to not do any
bulk jobs (just normal ones for right now), and to not
do any fancy processing of job information.  Instead I run normal jobs in an
infinite loop with a synchronize call that is set to reap the
jobs.  I'm deallocating/reallocating the template in each loop pass, which
is not necessary but also has no effect on the memory leak.
  The main memory handling I'm doing is to make sure that I free up the
all_jobids string list that the code builds when it runs a bunch of
jobs, so this should not be part of the memory leak.  In order to get this
going, get the code compiled and submit any job you like
(I recommend 'sleep 1' to make it fast).  It will run forever and as far as
I can see the memory usage jumps by ~64 KB for every pass
through the execution loop.

  Now.. I'm still new to DRMAA and I'm hoping that I'm missing something
really obvious for freeing memory here.  I'm a little bit worried
that I was able to quickly modify the pre-built example to leak like this
but I'm hoping you guys know how to stop this problem.  If you want
I'll forward the part of this email related to the C library to  the DRMAA
list to see who else knows about this type of thing.

  Thanks again,  code follows below.
      -- Chuck

   ------- Additional comments from andreas Wed Apr 19 05:28:12 -0700 2006 -------
Created an attachment (id=77)
Modified example.c

   ------- Additional comments from andreas Wed Apr 19 07:44:59 -0700 2006 -------
Added Grid Engine version and OS/HW information.

Chuck: I should have included more details: we are running SGE 6.0_u7  which I
think is the newest mainline version available from the grid engine web site.
Some more info that might be helpful is that we are using the x86-64 version
on Intel Xeons with the 2.6.5 Linux Kernel from Suse Enterprise version 9.

I've found that the sample program I've included will leak, but more
slowly since using the 'synchronize' call with process reaping set to '1' seems
to keep the speed of the leak down.  However, when I set it and let it run
overnight the memory usage had gone up by about 3 times.   The leaks are
much faster with the DRMAA python wrapper, but I'm trying to isolate out a
possible internal leak from one in the wrapper itself.

  Once again... I'm really hoping you'll see that I just didn't free
something properly in the test program and that the calling code or the wrapper
can be fixed by using the appropriate calls.   I'm probably stretching things
out since I have a long running process that will potentially dispatch
thousands of jobs over its lifetime, although the number of jobs being run and
monitored at any one time should only be a few dozen or so.

Attachments (1)

77 (3.8 KB) - added by dlove 9 years ago.

Download all attachments as: .zip

Change History (1)

Changed 9 years ago by dlove

Note: See TracTickets for help on using tickets.