Opened 8 years ago

Last modified 6 years ago

#580 new defect

IZ2756: Builtin qrsh kills slaves too early, no accounting written w/side effects

Reported by: reuti Owned by:
Priority: high Milestone:
Component: sge Version: 6.2
Severity: Keywords: clients
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2756]

        Issue #:      2756             Platform:     All      Reporter: reuti (reuti)
       Component:     gridengine          OS:        All
     Subcomponent:    clients          Version:      6.2         CC:    None defined
        Status:       NEW              Priority:     P2
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    roland (roland)
      QA Contact:     roland
          URL:
       * Summary:     Builtin qrsh kills slaves too early, no accounting written w/side effects
   Status whiteboard:
      Attachments:

     Issue 2756 blocks:
   Votes for issue 2756:


   Opened: Wed Oct 15 10:01:00 -0700 2008 
------------------------


With the new setting of "accounting_summary=TRUE" in the PE I get the correct accumulated time, hence
the new built-in qrsh is working fine. But with "accounting_summary=FALSE" I also get only one record:
the one for the master task. The slaves are missing.

   ------- Additional comments from reuti Wed Oct 15 10:57:11 -0700 2008 -------
This seems to be a race condition: when the master exits a little bit earlier than the slaves, these will be
killed - hence no records written.

Work around: end the parallel e.g. MPI program with:

//
// End of program
//

   MPI_Finalize();

   if (rank==0)
   {
       system("sleep 5");
   }
}

So the master rank waits 5 seconds and all records are written. Some built in grace-time would be
good.

I change the summary according to this.

   ------- Additional comments from reuti Wed Oct 15 11:18:36 -0700 2008 -------
Blah - I touched the wrong button. Now the correct summary.

BTW: the messages file confirms the unwanted kills of the slaves.

PS: Besides the missing accounting, it might also damage user written files, if the slave tasks want to write
something before they intend to exit gracefully - just a few seconds later. Reminds me of issue:

http://gridengine.sunsource.net/issues/show_bug.cgi?id=1960

When should the slaves be killed: before or after the PE stop_proc_args? Or after an adjustable timeout?

   ------- Additional comments from joga Mon Jan 5 02:04:51 -0700 2009 -------
Hi Reuti,

do you see any indication that the slave tasks actually get killed, e.g. some
logging indicating this in the jobs output file?
Or is it just the accounting records missing.

If it is the latter, then you are most probably seeing IZ 2815:
http://gridengine.sunsource.net/issues/show_bug.cgi?id=2815

Do you see this issue with short tasks (task runtime < load_report_time)?

   ------- Additional comments from reuti Mon Jan 5 09:34:43 -0700 2009 -------
As time of writing this issue I found this in the messages file of the qmaster:

10/15/2008 18:40:53|worker|pc15370|E|execd@pc15370.Chemie.Uni-Marburg.DE reports running job (15.1/3.pc15370) in queue
"all.q@pc15370.Chemie.Uni-Marburg.DE" that was not supposed to be there - killing
10/15/2008 18:40:53|worker|pc15370|E|execd@pc15370.Chemie.Uni-Marburg.DE reports running job (15.1/2.pc15370) in queue
"all.q@pc15370.Chemie.Uni-Marburg.DE" that was not supposed to be there - killing
10/15/2008 18:40:53|worker|pc15370|E|execd@pc15370.Chemie.Uni-Marburg.DE reports running job (15.1/1.pc15370) in queue
"all.q@pc15370.Chemie.Uni-Marburg.DE" that was not supposed to be there - killing
10/15/2008 18:46:13|worker|pc15370|E|execd@pc15370.Chemie.Uni-Marburg.DE reports running job (17.1/3.pc15370) in queue
"all.q@pc15370.Chemie.Uni-Marburg.DE" that was not supposed to be there - killing
10/15/2008 18:46:13|worker|pc15370|E|execd@pc15370.Chemie.Uni-Marburg.DE reports running job (17.1/2.pc15370) in queue
"all.q@pc15370.Chemie.Uni-Marburg.DE" that was not supposed to be there - killing
10/15/2008 18:46:13|worker|pc15370|E|execd@pc15370.Chemie.Uni-Marburg.DE reports running job (17.1/1.pc15370) in queue
"all.q@pc15370.Chemie.Uni-Marburg.DE" that was not supposed to be there - killing

just the entries before I extended the issue (daylight-saving time + 9 hrs). As I just found, in 6.2u1 it's not happening any longer - at least during
my tests right now.

   ------- Additional comments from joga Fri Jan 9 02:23:14 -0700 2009 -------
This is certainly hightly dependent on timing.
6.2u1 fixes a number of issues with the new interactive job support, shutdown of
the tasks might be faster now.

But in general, I would expect the master task to wait for the slaves to finish
before exiting.

Change History (0)

Note: See TracTickets for help on using tickets.