Opened 11 years ago

Last modified 9 years ago

#519 new defect

IZ2581: DRMAA event client stops after qmaster restart

Reported by: templedf Owned by:
Priority: normal Milestone:
Component: sge Version: 6.0u10
Severity: Keywords: drmaa
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2581]

        Issue #:      2581             Platform:     All      Reporter: templedf (templedf)
       Component:     gridengine          OS:        All
     Subcomponent:    drmaa            Version:      6.0u10      CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    templedf (templedf)
      QA Contact:     templedf
          URL:
       * Summary:     DRMAA event client stops after qmaster restart
   Status whiteboard:
      Attachments:

     Issue 2581 blocks:
   Votes for issue 2581:


   Opened: Thu May 22 15:09:00 -0700 2008 
------------------------


From the customer:

---
I noticed this very strange behavior that occurs when we failover the
qmaster.  As you can see below, the qmaster finally failed over from the
primary xxx4 host to xxx1.

03/24/2008 10:14:35|qmaster|xxx4|I|starting up GE 6.0u10 (sol-sparc64)
03/24/2008 10:17:42|qmaster|xxx4|I|starting up GE 6.0u10 (sol-sparc64)
03/24/2008 10:35:26|qmaster|xxx1|I|starting up GE 6.0u10 (sol-sparc64)

At 03/24 12:02:08, our nite_jobs scheduler calls drama_init to connect
to the Grid.
Jobs runs fine up until the next day around 9:01am when the scheduler
gets this event client error (return code 5).

I log all the drama calls as you can see here.  The job submit got back
the job id 82738 so that was fine but when it called drama_wait, we got
this error.
03/25 09:01:06 drmaa_run_job(rpt.nite_control): ok (82738)
03/25 09:01:06 drmaa_wait(jobXXX): The event client has not been
started. [5]

What I can tell you is drmaa_delete_job_template is called immediately
after drama_run_job returns a good status code.

I'm able to reproduce this problem.
---

According to the customer, the failed call is not the first one that happens
after the qmaster restart.  If that's accurate, the event client is able to
reconnect with the qmaster after the fail-over, but it subsequently dies.

From the customer:

---
I did find these errors in the qmaster messages log:
messages:03/24/2008 10:58:55|qmaster|XXX1|E|acknowledge timeout after
600 seconds for event client (drmaa:16) on host "XXX2"
messages:03/24/2008 15:17:20|qmaster|XXX1|E|acknowledge timeout after
600 seconds for event client (drmaa:119) on host "XXX2"
messages:03/24/2008 15:18:12|qmaster|XXX1|E|acknowledge timeout after
600 seconds for event client (drmaa:122) on host "XXX2"
messages:03/25/2008 15:15:53|qmaster|piaptss001|E|acknowledge timeout
after 600 seconds for event client (drmaa:217) on host "XXX2"
---

   ------- Additional comments from templedf Thu May 22 15:09:52 -0700 2008 -------
Changed platform.

Change History (0)

Note: See TracTickets for help on using tickets.