Opened 10 years ago

Closed 7 years ago

#507 closed defect (fixed)

IZ2552: dump if SGE daemons crash when admin_user != "root"

Reported by: andreas Owned by:
Priority: high Milestone:
Component: sge Version: 6.1AR_snapshot3_6
Severity: minor Keywords: kernel
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2552]

   Issue #: 2552   Platform: All   Reporter: andreas (andreas)
   Component: gridengine   OS: All
   Subcomponent: kernel   Version: 6.1AR_snapshot3_6   CC: None defined
   Status: REOPENED   Priority: P2
   Resolution:   Issue type: DEFECT
     Target milestone: ---
   Assigned to: andreas (andreas)
   QA Contact: andreas
   URL:
   * Summary: No core dump if SGE daemons crash when admin_user != "root"
   Status whiteboard:
   Attachments:
   Date/filename:                                Description:                                                                          Submitted by:
   Fri Apr 11 08:10:00 -0700 2008: libcore.so.gz libcore.so for AMD64 Linux (application/x-gzip)                                       andreas
   Fri Apr 11 08:12:00 -0700 2008: libcore.c     Source code for libcore.so (text/plain)                                               andreas
   Mon Apr 28 04:00:00 -0700 2008: libcore.so.gz libcore.so for lx24-ia64 (application/x-gzip)                                         andreas
   Mon Apr 28 04:01:00 -0700 2008: libcore.so.gz libcore.so for lx24-x86 (text/plain)                                                  andreas
   Mon Apr 28 06:49:00 -0700 2008: 2552.diff     Proposed patch (maintrunk) (text/plain)                                               andreas
   Tue May 13 02:23:00 -0700 2008: build.sh      Build.sh that I used to build libcore.so from libcore.c attached earlier (text/plain) andreas
     Issue 2552 blocks:
   Votes for issue 2552:

   Opened: Thu Apr 10 02:51:00 -0700 2008 
------------------------


DESCRIPTION:
When SGE daemons crash no core file gets written if admin_user != "root" due to
security concerns.

WORKAROUND/FIX:
Under Solaris coreadm(1) can be used to give the kernel a waiver (per
process/globally) so that core files get written in this case.

Under Linux there are two means:
(1) For overriding it for all processes there is a

      # sysctl -w kernel.core_setuid_ok=1

    it is mentioned in

      http://kbase.redhat.com/faq/FAQ_49_3652.shtm

    for RHEL3 so I would assume it works in RHEL4 as well

(2) For overriding it indivudually there is a call

      prctl(PR_SET_DUMPABLE,1,42,42,42);

    due to

      https://bugzilla.redhat.com/show_bug.cgi?id=104310

    mentioning it as a bug when it is broke I would assume one can rely on it

   ------- Additional comments from andreas Thu Apr 10 05:00:34 -0700 2008 -------
Use of

  prctl(PR_SET_DUMPABLE,1,42,42,42)

under Linux seems problematic as it were necessary to issue this prctl() anew
each time uid/euid changes:

  http://linux-documentation.com/en/man/man2/prctl.html

   ------- Additional comments from andreas Thu Apr 10 05:38:01 -0700 2008 -------
Best approach to address this issue is to have the documentation explain how to
still get the core file.

Plan is to add a trouble shooting section to 6.2 Install Guide that refers
coreadm(1M) and sysctl -w kernel.core_setuid_ok

   ------- Additional comments from andreas Fri Apr 11 08:07:50 -0700 2008 -------
As it turned out that e.g. RHEL4 does not know

# sysctl -w kernel.core_setuid_ok=1

anymore the only resort to get a core dump under Linux appears to issue

   prctl(PR_SET_DUMPABLE,1,42,42,42);

after each call to setuid(), seteuid(), setgid(), and setegid().

As workaround the use of libcore.so using LD_PRELOAD turned out to solve the
issue. E.g. to apply it for sge_execd one must change in

   $SGE_ROOT/$SGE_CELL/common/sgeexecd

the line

    $bin_dir/sge_execd

where sge_execd is started into

    env LD_PRELOAD=/path/to/libcore.so $bin_dir/sge_execd

after execd restart a nice core.<pid> file is written in the spool directory
$SGE_ROOT/$SGE_CELL/spool/<host>/ of this execd when it gets killed using

    # kill -SEGV <pid>

LD_PRELOAD though gets inherited to shepherds processes that are forked by such
an execd, but the jobs themselfs will not have it in their environments, except
if one was adding INHERIT_ENV=LD_PRELOAD to the execd_params section of the
cluster configuration sge_conf(5).

   ------- Additional comments from andreas Fri Apr 11 08:10:12 -0700 2008 -------
Created an attachment (id=164)
libcore.so for AMD64 Linux

   ------- Additional comments from andreas Fri Apr 11 08:12:04 -0700 2008 -------
Created an attachment (id=165)
Source code for libcore.so

   ------- Additional comments from andreas Mon Apr 28 04:00:50 -0700 2008 -------
Created an attachment (id=166)
libcore.so for lx24-ia64

   ------- Additional comments from andreas Mon Apr 28 04:01:50 -0700 2008 -------
Created an attachment (id=167)
libcore.so for lx24-x86

   ------- Additional comments from andreas Mon Apr 28 06:49:52 -0700 2008 -------
Created an attachment (id=168)
Proposed patch (maintrunk)

   ------- Additional comments from andreas Wed Apr 30 06:47:05 -0700 2008 -------
Fixed in Maintrunk for Linux sge_execds.

   ------- Additional comments from andreas Tue May 13 02:23:05 -0700 2008 -------
Created an attachment (id=171)
Build.sh that I used to build libcore.so from libcore.c attached earlier

Change History (1)

comment:1 Changed 7 years ago by dlove

  • Resolution set to fixed
  • Severity set to minor
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.