Opened 13 years ago
Last modified 10 years ago
#472 new defect
IZ2403: qmaster becomes unresponsive after drmaa_control release call
Reported by: | tholzer | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | sge | Version: | 6.1u2 |
Severity: | Keywords: | PC Linux drmaa | |
Cc: |
Description
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2403]
Issue #: 2403 Platform: PC Reporter: tholzer (tholzer) Component: gridengine OS: Linux Subcomponent: drmaa Version: 6.1u2 CC: None defined Status: NEW Priority: P3 Resolution: Issue type: DEFECT Target milestone: --- Assigned to: templedf (templedf) QA Contact: templedf URL: * Summary: qmaster becomes unresponsive after drmaa_control release call Status whiteboard: Attachments: Issue 2403 blocks: Votes for issue 2403: Opened: Tue Oct 16 18:50:00 -0700 2007 ------------------------ When releasing a large number of jobs (50,000) from a hold through drmaa_control with DRMAA_JOB_IDS_SESSION_ALL, the call returns an error and the qmaster process becomes unresponsive to client requests. drmaa_control returns error code 2 (failed receiving gdi request) and other clients (e.g. qstat, qmon) return the same error message: # qstat error: failed receiving gdi request 0 The qmaster log shows: 10/17/2007 11:27:39|qmaster|abc|E|acknowledge timeout after 600 seconds for event client (drmaa:178) on host "xyz" However, all jobs eventually get released and the qmaster process starts responding again. When releasing the 50,000 jobs through qrls, this behaviour does not occur. To reproduce : // gcc -I $SGE_ROOT/include/ -L $SGE_ROOT/lib/lx26-x86/ -l drmaa test.c -o test #include <stdio.h> #include <stdlib.h> #include <strings.h> #include "drmaa.h" int err = 0; char buff[1024]; void check() { if (err != 0) { printf("Error %d : %s\n", err, buff); exit(err); } } int main(int argc, char **argv) { drmaa_job_template_t *jt = NULL; char jobid[1024]; char drmsys[1024]; int i = 0; // Init bzero(buff, sizeof(buff)); bzero(drmsys, sizeof(drmsys)); bzero(jobid, sizeof(jobid)); // DRMAA err = drmaa_get_DRM_system(drmsys, sizeof(drmsys), buff, sizeof(buff)-1); check(); printf("System %s\n", drmsys); err = drmaa_init(NULL, buff, sizeof(buff) - 1); check(); err = drmaa_allocate_job_template(&jt, buff, sizeof(buff) - 1); check(); err = drmaa_set_attribute(jt, DRMAA_REMOTE_COMMAND, "/bin/true", buff, sizeof(buff) - 1); check(); err = drmaa_set_attribute(jt, DRMAA_NATIVE_SPECIFICATION, "-o /dev/null -h -j y -v n -b y -shell n", buff, sizeof(buff) - 1); check(); for (i = 0; i < 50000; i++) { if (i % 1000 == 0) { printf("Jobs %d\n", i); } err = drmaa_run_job(jobid, sizeof(jobid), jt, buff, sizeof(buff) - 1); check(); } // Fails with error code 2 (failed receiving gdi request) err = drmaa_control(DRMAA_JOB_IDS_SESSION_ALL, DRMAA_CONTROL_RELEASE, buff, sizeof(buff) - 1); check(); err = drmaa_delete_job_template(jt, buff, sizeof(buff) - 1); check(); err = drmaa_exit(buff, sizeof(buff) - 1); check(); return 0; }
Note: See
TracTickets for help on using
tickets.