Custom Query (431 matches)
Results (154 - 156 of 431)
Ticket | Resolution | Summary | Owner | Reporter |
---|---|---|---|---|
#532 | fixed | IZ2628: Tasks held with array dependency may get deleted prematurely | johna | |
Description |
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2628] Issue #: 2628 Platform: All Reporter: johna (johna) Component: gridengine OS: All Subcomponent: qmaster Version: 6.2beta CC: None defined Status: NEW Priority: P3 Resolution: Issue type: DEFECT Target milestone: --- Assigned to: ernst (ernst) QA Contact: ernst URL: * Summary: Tasks held with array dependency may get deleted prematurely Status whiteboard: Attachments: Issue 2628 blocks: Votes for issue 2628: Opened: Tue Jun 24 21:25:00 -0700 2008 ------------------------ It seems to be that tasks in the JB_ja_a_h_ids hold range can get ignored, leading to the parent job being deleted before they are scheduled to run. This bug does not appear in the ARI branch and seems to only occur when the dependent job held with -hold_jid_ad option has higher priority. This probably means that the QA testing procedure does not detect this issue since it probably does not submit the jobs with different priority. This can be reproduced as follows (aimk options are '-spool-classic -parallel 3 -no-dump -debug -no-secure -no-jni -no-java'): [root@xen-grid1 johna]# qsub -t 1-10 -p -100 -b y /bin/sleep 20 Your job-array 1.1-10:1 ("sleep") has been submitted [root@xen-grid1 johna]# qsub -t 1-10 -p 100 -hold_jid_ad 1 -b y /bin/sleep 20 Your job-array 2.1-10:1 ("sleep") has been submitted [root@xen-grid1 johna]# qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 1 0.50617 sleep root r 06/25/2008 12:52:18 all.q@xen-grid1.rsp.com.au 1 1 1 0.00000 sleep root qw 06/25/2008 12:52:13 1 2-10:1 2 0.00000 sleep root hqw 06/25/2008 12:52:20 1 1-10:1 [root@xen-grid1 johna]# qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 1 0.50617 sleep root qw 06/25/2008 12:52:13 1 2-10:1 2 0.00000 sleep root qw 06/25/2008 12:52:20 1 1 2 0.00000 sleep root hqw 06/25/2008 12:52:20 1 2-10:1 [root@xen-grid1 johna]# qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 2 0.60383 sleep root r 06/25/2008 12:52:48 all.q@xen-grid1.rsp.com.au 1 1 1 0.50617 sleep root qw 06/25/2008 12:52:13 1 2-10:1 2 0.00000 sleep root hqw 06/25/2008 12:52:20 1 2-10:1 [root@xen-grid1 johna]# qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 2 0.00000 sleep root hqw 06/25/2008 12:52:20 1 2-10:1 1 0.50617 sleep root qw 06/25/2008 12:52:13 1 2-10:1 [root@xen-grid1 johna]# qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 1 0.50617 sleep root r 06/25/2008 12:53:18 all.q@xen-grid1.rsp.com.au 1 2 1 0.50617 sleep root qw 06/25/2008 12:52:13 1 3-10:1 End result, job 2 is "gone" despite it having some tasks left that are held with AD. A preliminary investigation on MT has found some missing code lines in sge_job_qmaster.c, but I have not as yet been able to isolate this defect. |
|||
#535 | fixed | IZ2633: memory leak after sge_peopen() in AFS/DCE/KERBEROS code | andreas | |
Description |
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2633] Issue #: 2633 Platform: All Reporter: andreas (andreas) Component: gridengine OS: All Subcomponent: qmaster Version: 6.1u4 CC: None defined Status: NEW Priority: P3 Resolution: Issue type: DEFECT Target milestone: --- Assigned to: ernst (ernst) QA Contact: ernst URL: * Summary: memory leak after sge_peopen() in AFS/DCE/KERBEROS code Status whiteboard: Attachments: Issue 2633 blocks: Votes for issue 2633: Opened: Wed Jun 25 08:46:00 -0700 2008 ------------------------ The AFS/DCE/KERBEROS code in libs/gdi/sge_security.c leaks memory. Each time a sge_peopen() is done as to launch one of the script plug-in procedures, sge_bin2string() allocates memory that is not free()'d later. I'm filing this bug against qmaster because some of the procedures are launched by qmaster. |
|||
#536 | fixed | IZ2635: Fails to build libs/uti/sge_getloadavg.c with gcc 4.3.1 | paulmillar | |
Description |
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2635] Issue #: 2635 Platform: All Reporter: paulmillar (paulmillar) Component: gridengine OS: All Subcomponent: build Version: current CC: None defined Status: NEW Priority: P3 Resolution: Issue type: DEFECT Target milestone: --- Assigned to: andreas (andreas) QA Contact: andreas URL: * Summary: Fails to build libs/uti/sge_getloadavg.c with gcc 4.3.1 Status whiteboard: Attachments: Date/filename: Description: Submitted by: Wed Jul 2 01:53:00 -0700 2008: 2635.diff A source diff that should fix the problem (text/plain) andreas Wed Jul 2 01:59:00 -0700 2008: 2635.diff New try (former diff was buggy) (text/plain) andreas Issue 2635 blocks: Votes for issue 2635: Opened: Tue Jul 1 16:36:00 -0700 2008 ------------------------ I tried to build gridengine from cvs on Debian sid. The current HEAD failed whilst building the libraries, in particular with libs/uti/sge_getloadavg.c. The problem was with line 1306 and I've copied the output below: _________C_O_R_E__S_Y_S_T_E_M_____________ gcc -O3 -Wall -Werror -Wstrict-prototypes -D__GRIDENGINE_FD_SETSIZE=8192 -DLINUX -DLINUX86 -DLINUX86_26 -D_GNU_SOURCE -DGETHOSTBYNAME_R6 -DGETHOSTBYADDR_R8 -DLOAD_OPENSSL -I/vol2/SW/db-4.4.20/lx26-x86/include/ -DSGE_ARCH_STRING=lx26-x86 -DTARGET_32BIT -DSPOOLING_dynamic -DSECURE -I/vol2/tools/SW/openssl-0.9.8g-origin/lx26-x86/include -Wno-strict-aliasing -D_FILE_OFFSET_BITS=64 -DCOMPILE_DC -D__SGE_COMPILE_WITH_GETTEXT__ -D__SGE_NO_USERMAPPING__ -I../common -I../libs -I../libs/uti -I../libs/juti -I../libs/gdi -I../libs/japi -I../libs/sgeobj -I../libs/cull -I../libs/rmon -I../libs/comm -I../libs/comm/lists -I../libs/sched -I../libs/evc -I../libs/evm -I../libs/mir -I../libs/lck -I../daemons/common -I../daemons/qmaster -I../daemons/execd -I../daemons/schedd -I../clients/common -I. -I/usr/lib/jvm/java-6-sun/include -I/usr/lib/jvm/java-6-sun/include/linux -fPIC -c ../libs/uti/sge_getloadavg.c cc1: warnings being treated as errors ../libs/uti/sge_getloadavg.c: In function 'get_cpu_load': ../libs/uti/sge_getloadavg.c:1306: error: array subscript is above array bounds ../libs/uti/sge_getloadavg.c:1306: error: array subscript is above array bounds make: *** [sge_getloadavg.o] Error 1 I've not traced the logic of the function, but the code doesn't pass the "sniff test". I've copied a patch that fixes this issue, allowing the compilation to progress, although it failed later on. Index: libs/uti/sge_getloadavg.c =================================================================== RCS file: /cvs/gridengine/source/libs/uti/sge_getloadavg.c,v retrieving revision 1.38 diff -u -r1.38 sge_getloadavg.c --- libs/uti/sge_getloadavg.c 15 Apr 2008 12:40:54 -0000 1.38 +++ libs/uti/sge_getloadavg.c 1 Jul 2008 23:19:22 -0000 @@ -1302,10 +1302,11 @@ /* calculate percentages based on overall change, rounding up */ half_total = total_change / 2l; for (i = 0; i < cnt; i++) { - *out = ((double)((*diffs++ * 1000 + half_total) / total_change))/10; + *out = ((double)((*diffs * 1000 + half_total) / total_change))/10; DPRINTF(("diffs: %lu half_total: %lu total_change: %lu -> %f", *diffs, half_total, total_change, *out)); out++; + diffs++; } DEXIT; Naturally, someone who understands the precise semantics of this function should review the patch. Cheers, Paul. PS. Can one attach patches to to bug with this issue-tracker? I'd guess that posting patches in-line is fragile. ------- Additional comments from andreas Wed Jul 2 01:53:01 -0700 2008 ------- Created an attachment (id=176) A source diff that should fix the problem ------- Additional comments from andreas Wed Jul 2 01:59:21 -0700 2008 ------- Created an attachment (id=177) New try (former diff was buggy) ------- Additional comments from andreas Wed Jul 2 02:02:08 -0700 2008 ------- Paul, could you try the second diff that I attached to this issue and let me know the result? Note, the first one was buggy, since DPRINTF expressions are evaluated in monitoring mode only. For that reason increments must be done outside the DPRINTF statements. Regards, Andreas ------- Additional comments from paulmillar Wed Jul 2 14:48:01 -0700 2008 ------- Hi Andreas, Thanks for looking into this. Both patches look broken to me. The first patch *only* increments the two ptrs inside the DPRINTF, which (as you say) is broken if monitoring is switched off; the second patch increments both inside and outside the DPRINT, which is broken if monitoring is switched on! Could you have another look at my patch? I still believe this is the correct fix. Cheers, Paul. PS. Is it possible to use unified output for diffs ("cvs diff -u")? I find these easier to read. ------- Additional comments from paulmillar Mon Jul 14 17:06:04 -0700 2008 ------- Hi Andreas, A couple of updates on this issue: The first point is I've tried the second version of the patch, as you recommended. At first it seemed to work; however, I was concerned that the compiler was somehow factoring out the DPRINTF macro (hence the diffs++ and out++ within the DPRINTF macro are never evaluated). This would hide the problem until someone attempts to compile with an enabled DPRINTF. To test this, I replaced the DPRINTF macro with a simple printf and the compilation broke again: gcc -O3 -Wall -Werror -Wstrict-prototypes -D__GRIDENGINE_FD_SETSIZE=8192 [... many more arguments ...] -I/usr/lib/jvm/java-6-sun-1.6.0.07/jre/include/linux -fPIC -c ../libs/uti/sge_getloadavg.c cc1: warnings being treated as errors ../libs/uti/sge_getloadavg.c: In function 'get_cpu_load': ../libs/uti/sge_getloadavg.c:1305: error: array subscript is above array bounds ../libs/uti/sge_getloadavg.c:1305: error: array subscript is above array bounds ../libs/uti/sge_getloadavg.c:1305: error: array subscript is above array bounds ../libs/uti/sge_getloadavg.c:1305: error: array subscript is above array bounds make: *** [sge_getloadavg.o] Error 1 I believe this demonstrates that the second version (of the patch) is still broken --- although I know why DPRINTF is not having any affect: is DPRINTF not available for the uti library? The second point is that I've just noticed that there's a function called percentages_new() in the same file that is similar to percentages(). The patches so far only fix percentages() and not percentages_new(). The latter looks to have the same problem as the former, but was not picked up by gcc as the code is wrapped by some preprocessor tests for (I believe) compilation architecture. HTH, Paul. |
Note: See TracQuery
for help on using queries.