Opened 8 years ago

Closed 7 years ago

#660 closed defect (fixed)

IZ2986: we need the ability for qrsh to ignore SIGUSR1 and SIGUSR2 to support -notify

Reported by: joga Owned by:
Priority: high Milestone:
Component: sge Version: 5.3
Severity: minor Keywords: Sun clients
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2986]

        Issue #:      2986             Platform:     Sun      Reporter: joga (joga)
       Component:     gridengine          OS:        All
     Subcomponent:    clients          Version:      5.3         CC:    None defined
        Status:       REOPENED         Priority:     P2
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: 6.2u3
      Assigned to:    joga (joga)
      QA Contact:     roland
          URL:
       * Summary:     we need the ability for qrsh to ignore SIGUSR1 and SIGUSR2 to support -notify
   Status whiteboard:
      Attachments:

     Issue 2986 blocks:
   Votes for issue 2986:


   Opened: Wed Apr 8 07:00:00 -0700 2009 
------------------------


When running a job under Open MPI, we use qrsh to start up the job on the remote nodes.  We make one call to qrsh for each node.  This call
starts a process called an orted on each node which in turn forks/execs the a.outs.  To start the orted, we issue qrsh in this manner.

qrsh -inherit -nostdin -V burl-ct-v440-1 orted

This all works fine.  The issue is when we attempt to support the suspend/resume feature along with the -notify flag.  To support these
features, we have added special signal handling into Open MPI.  First, we make sure that both mpirun and the orteds ignore SIGUSR1 and
SIGUSR2.  And we catch the SIGTSTP signal in the mpirun so it can forward it off to the orteds which change it to SIGSTOP and deliver to the
a.outs.  So all is good.   The problem is that the the one process that we cannot get to ignore the SIGUSR1/SIGUSR2 signal is qrsh.  I
actually changed the signal handling for SIGUSR1/SIGUSR2 to ignore prior to the execv'ing qrsh but it appears qrsh changes it back to the
default.  This means when a SIGUSR1 signal comes along because we are running the job with -notify, the job dies.

   ------- Additional comments from joga Wed Apr 8 07:02:55 -0700 2009 -------
Evaluation:

qrsh behaves exactly like rsh here (which in general was a goal when developing qrsh), both exit on SIGUSR1 and SIGUSR2.
But to allow tightly integrated jobs to be notified, we have to block SIGUSR1 and SIGUSR2 at least for the qrsh -inherit.
If the notification signals are modified by the execd_params NOTIFY_KILL and NOTIFY_SUSP, these signals have to be blocked instead.

See also IZ 2979.

   ------- Additional comments from joga Fri Apr 24 02:35:29 -0700 2009 -------
SuggestedFix

Block the required signals, only for qrsh -inherit.

If NOTIFY_KILL or NOTIFY_SUSP is configured in the execd params,
we have to transport this information to the qrsh -inherit.

qrsh -inherit itself does not contact qmaster, so it does not have the hosts configuration.
The signal mapping information is contained in the file "config" in the jobs active_jobs directory. But we certainly do not want a possibly
high number of qrsh -inherit calls all reading the config file.
So we better transport this information via environment variables,
similar to the already existing variable SGE_RSH_COMMAND:
Only for the master task of tighly integrated parallel jobs,
and only when a signal mapping is requested. We use the 2 variables
SGE_NOTIFY_KILL_SIGNAL and SGE_NOTIFY_SUSP_SIGNAL.

We have the full fix for the builtin qrsh transport only.

If ssh is used as transport, the fix will not work, as the ssh client spawned by qrsh -inherit will terminate on SIGUSR1/2.

For the rsh client delivered with SGE we do a minimal fix:
We block SIGUSR1/2, so the fix will work even with the old interactive job support, unless a signal mapping to signals other than SIGUSR1/2
is done via execd_params NOTIFY_KILL or NOTIFY_SUSP.

   ------- Additional comments from joga Fri Apr 24 02:36:10 -0700 2009 -------
Fixed in maintrunk for 6.2u3 and
V61_BRANCH for 6.1u7.

   ------- Additional comments from brooks Tue Sep 1 08:32:40 -0700 2009 -------
This fix introduced non-portable sigignore() calls.  While sigaction() is certainly a pain, it is the only fully portable option.  One option option to reduce the pain might be to introduce
an sge_sigignore() implemented using sigaction().

   ------- Additional comments from reuti Tue Sep 1 08:53:04 -0700 2009 -------
AFAICS you can set the sa_handler in the struct for sigaction's parameter just to SIG_IGN, which should give the behavior.

Change History (1)

comment:1 Changed 7 years ago by dlove

  • Resolution set to fixed
  • Severity set to minor
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.