Opened 12 years ago

Last modified 9 years ago

#472 new defect

IZ2403: qmaster becomes unresponsive after drmaa_control release call

Reported by: tholzer Owned by:
Priority: normal Milestone:
Component: sge Version: 6.1u2
Severity: Keywords: PC Linux drmaa
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2403]

        Issue #:      2403             Platform:     PC       Reporter: tholzer (tholzer)
       Component:     gridengine          OS:        Linux
     Subcomponent:    drmaa            Version:      6.1u2       CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    templedf (templedf)
      QA Contact:     templedf
          URL:
       * Summary:     qmaster becomes unresponsive after drmaa_control release call
   Status whiteboard:
      Attachments:

     Issue 2403 blocks:
   Votes for issue 2403:


   Opened: Tue Oct 16 18:50:00 -0700 2007 
------------------------


When releasing a large number of jobs (50,000) from a hold through drmaa_control
with DRMAA_JOB_IDS_SESSION_ALL, the call returns an error and the qmaster
process becomes unresponsive to client requests.

drmaa_control returns error code 2 (failed receiving gdi request) and other
clients (e.g. qstat, qmon) return the same error message:

# qstat
error: failed receiving gdi request
0

The qmaster log shows:

10/17/2007 11:27:39|qmaster|abc|E|acknowledge timeout after 600 seconds for
event client (drmaa:178) on host "xyz"

However, all jobs eventually get released and the qmaster process starts
responding again. When releasing the 50,000 jobs through qrls, this behaviour
does not occur.

To reproduce :

// gcc -I $SGE_ROOT/include/ -L $SGE_ROOT/lib/lx26-x86/ -l drmaa test.c -o test

#include <stdio.h>
#include <stdlib.h>
#include <strings.h>
#include "drmaa.h"

int err = 0;
char buff[1024];

void check() {
  if (err != 0) {
    printf("Error %d : %s\n", err, buff);
    exit(err);
  }
}

int main(int argc, char **argv) {
  drmaa_job_template_t *jt = NULL;
  char jobid[1024];
  char drmsys[1024];
  int i = 0;

  // Init
  bzero(buff, sizeof(buff));
  bzero(drmsys, sizeof(drmsys));
  bzero(jobid, sizeof(jobid));

  // DRMAA
  err = drmaa_get_DRM_system(drmsys, sizeof(drmsys), buff, sizeof(buff)-1);
  check();

  printf("System %s\n", drmsys);

  err = drmaa_init(NULL, buff, sizeof(buff) - 1);
  check();

  err = drmaa_allocate_job_template(&jt, buff, sizeof(buff) - 1);
  check();

  err = drmaa_set_attribute(jt, DRMAA_REMOTE_COMMAND,
                            "/bin/true", buff, sizeof(buff) - 1);
  check();

  err = drmaa_set_attribute(jt, DRMAA_NATIVE_SPECIFICATION,
                            "-o /dev/null -h -j y -v n -b y -shell n", buff, sizeof(buff) - 1);
  check();

  for (i = 0; i < 50000; i++) {
    if (i % 1000 == 0) {
      printf("Jobs %d\n", i);
    }
    err = drmaa_run_job(jobid, sizeof(jobid), jt, buff, sizeof(buff) - 1);
    check();
  }

  // Fails with error code 2 (failed receiving gdi request)
  err = drmaa_control(DRMAA_JOB_IDS_SESSION_ALL, DRMAA_CONTROL_RELEASE, buff,
sizeof(buff) - 1);
  check();

  err = drmaa_delete_job_template(jt, buff, sizeof(buff) - 1);
  check();

  err = drmaa_exit(buff, sizeof(buff) - 1);
  check();

  return 0;
}

Change History (0)

Note: See TracTickets for help on using tickets.