Opened 15 years ago
Last modified 10 years ago
#343 new defect
IZ2033: Possible memory leak with libdrmaa
Reported by: | andreas | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | sge | Version: | 6.0u7 |
Severity: | Keywords: | PC Linux drmaa | |
Cc: |
Description
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2033]
Issue #: 2033 Platform: PC Reporter: andreas (andreas) Component: gridengine OS: Linux Subcomponent: drmaa Version: 6.0u7 CC: None defined Status: NEW Priority: P3 Resolution: Issue type: DEFECT Target milestone: --- Assigned to: templedf (templedf) QA Contact: templedf URL: * Summary: Possible memory leak with libdrmaa Status whiteboard: Attachments: Date/filename: Description: Submitted by: Wed Apr 19 05:28:00 -0700 2006: example.c Modified example.c (text/plain) andreas Issue 2033 blocks: Votes for issue 2033: Opened: Wed Apr 19 05:24:00 -0700 2006 ------------------------ Hey Guys, This is sort of a long email but here goes. I have been experiencing memory leaks when I want to run cluster jobs multiple times using the grid. First the good news: I found some simple things you can do to the swig wrapper that should resolve part of the problem And the bad news: I'm pretty sure that the leaks I'm seeing are in the DRMAA SGE library itself so there's not much you can do about it. For the wrappers, I was looking at the mallocs you do for the fixed-length string buffer arguments (like the error code strings and jobname strings) and it looked like you didn't need to do a malloc since the buffer sizes were fixed and the scope needed for the buffer was only in a single function call. I redid the mallocs into stack allocated buffers with no ill effects, and this means there is no need to free the memory (which was not being freed before). Since you are using the PyString_FromString call Python makes it's own internal copy of the memory and the local version is no longer needed. For example: arg4 = (char*) malloc( sizeof(char) * DRMAA_ERROR_STRING_BUFFER ); turns into: char arg4 [sizeof(char) * DRMAA_ERROR_STRING_BUFFER ]; If you still want to do mallocs, my suggestion would be to make sure there is a free for all of the fixed length memory buffers at the end of each function, I think the stack-allocated version is a little nicer. Now, that solved a small memory leak, but unfortunately I kept testing the simple invocation of jobs and no matter what I did the memory usage kept going up. I tried deleting/recreating templates to no effect (I'm pretty sure the templates are not the source of the leak). I also made doubly sure to get back the job Info object on each job I started from the wait call to make sure I was getting & deleting the object to no effect. I tried just running a synchronize with the dispose argument asserted to have DRMAA reap the whole thing internally, and that did not fix the problem either. Long story short....after some more testing I don't think the leak is the wrapper's fault I think it's something deeper in the C library. Below is a hacked-up version of the example DRMAA C code that you can get from the grid engine site and that I'm pretty sure you based your own example code on. I've modified the code to not do any bulk jobs (just normal ones for right now), and to not do any fancy processing of job information. Instead I run normal jobs in an infinite loop with a synchronize call that is set to reap the jobs. I'm deallocating/reallocating the template in each loop pass, which is not necessary but also has no effect on the memory leak. The main memory handling I'm doing is to make sure that I free up the all_jobids string list that the code builds when it runs a bunch of jobs, so this should not be part of the memory leak. In order to get this going, get the code compiled and submit any job you like (I recommend 'sleep 1' to make it fast). It will run forever and as far as I can see the memory usage jumps by ~64 KB for every pass through the execution loop. Now.. I'm still new to DRMAA and I'm hoping that I'm missing something really obvious for freeing memory here. I'm a little bit worried that I was able to quickly modify the pre-built example to leak like this but I'm hoping you guys know how to stop this problem. If you want I'll forward the part of this email related to the C library to the DRMAA list to see who else knows about this type of thing. Thanks again, code follows below. -- Chuck ------- Additional comments from andreas Wed Apr 19 05:28:12 -0700 2006 ------- Created an attachment (id=77) Modified example.c ------- Additional comments from andreas Wed Apr 19 07:44:59 -0700 2006 ------- Added Grid Engine version and OS/HW information. Chuck: I should have included more details: we are running SGE 6.0_u7 which I think is the newest mainline version available from the grid engine web site. Some more info that might be helpful is that we are using the x86-64 version on Intel Xeons with the 2.6.5 Linux Kernel from Suse Enterprise version 9. I've found that the sample program I've included will leak, but more slowly since using the 'synchronize' call with process reaping set to '1' seems to keep the speed of the leak down. However, when I set it and let it run overnight the memory usage had gone up by about 3 times. The leaks are much faster with the DRMAA python wrapper, but I'm trying to isolate out a possible internal leak from one in the wrapper itself. Once again... I'm really hoping you'll see that I just didn't free something properly in the test program and that the calling code or the wrapper can be fixed by using the appropriate calls. I'm probably stretching things out since I have a long running process that will potentially dispatch thousands of jobs over its lifetime, although the number of jobs being run and monitored at any one time should only be a few dozen or so.
Attachments (1)
Note: See
TracTickets for help on using
tickets.