Opened 15 years ago

Closed 9 years ago

#185 closed enhancement (fixed)

IZ1102: user belonging to too many groups sets all queues into error state

Reported by: wig Owned by:
Priority: normal Milestone:
Component: sge Version: 5.3
Severity: minor Keywords: Solaris qmaster
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=1102]

        Issue #:      1102             Platform:     All           Reporter: wig (wig)
       Component:     gridengine          OS:        Solaris
     Subcomponent:    qmaster          Version:      5.3              CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    ENHANCEMENT
                                   Target milestone: ---
      Assigned to:    andreas (andreas)
      QA Contact:     ernst
          URL:
       * Summary:     user belonging to too many groups sets all queues into error state
   Status whiteboard:
      Attachments:

     Issue 1102 blocks:
   Votes for issue 1102:


   Opened: Fri Jun 18 07:51:00 -0700 2004 
------------------------


Followup on the following discussion on the
users@gridengine maillist:

Please change the behviour of such jobs from
"set queue into error state and resubmit job
to another queue" to "set job into error
state".


> Date: Wed, 28 May 2003 10:42:24 +0200 (MEST)
> From: Andy Schwierskott <andy.schwierskott@sun.com>
> Content-Type: TEXT/PLAIN; charset=US-ASCII
> Subject: Re: [GE users] Too many groups send
queues go into (E)rror state
>
>
> Hi,
>
> > --- Jon  Kleinsmith <jon.kleinsmith@conexant.com>
> > wrote:
> > > Looking in the messages log, I see two
things that
> > > bother me:
> > >
> > > 1. the scheduler is trying to set uid and
euid 0 on
> > > the job?
> >
> > Actually the shepherd, not the scheduler. Each
job has
> > its own shepherd, which is used for
> > collecting/controlling the job.
> >
> > The message (uid=0, euid=0) tells you that SGE
> > couldn't add an additional group, but
uid=0/euid=0 is
> > not what SGE is trying to set to, it is
actually the
> > uid/euid of shepherd.
> >
> > > 1. Why does the scheduler need to add the
uid/euid
> > > info for the job?
> >
> > The shepherd needs to add an additional group
id to
> > the job for job control.
>
> There are two reasons why setting the additional
group id for the job may
> fail:
>
>    - the root user on that machine already has
(usually 16) add. group
> id's
>      This should be easy to fix for the admin.
All jobs on that machine
>      would be affected in put the queue into
error state.
>
>      (thinking about this I'm wondering if it
could be possible to reduce
>       the number of add. group id's in the
shepherd process to avoid this
>       source of failure, but I'm not sure if
that couldn't cause other
>       unexpected problems???)
>
>    - the 'job user' has 16 add. group id's.
While this is purely admin
>      related, it's not neccessarily in the
responsibility of the SGE admin
>
>      I think in practice it should be relativly
easy to reduce the number
> of
>      add. group id's of the users to 15, but I
understand that an SGE
> admin
>      does not want that this blocks compute
resources until this problem
> is
>      solved.
>
> > > 2. Why does this error "lock-up" my queues?
> >
> > So that the cluster admin knows that something
wrong
> > is going on.
>
> We had a couple of times this problem report in
the past. I see it would
> be
> better to put the job into error state if user
is already in 16 add.
> groups
>
> Is anyone seeing a reason why it could be a
problem putting the job into
> error state and not the queue in that case? If
no one sees a problem we
> might file this is an enhancement issue.
>
> Andy
>
> > > 3. Is there a simple way of reseting the
error state
> > > of the queues
> > > without losing scheduled/running jobs?
> >
> > As mentioned by other people, use qmod -c
<queue>. You
> > can consider using a cron job to clear the
error state
> > of the queues.
> >
> >  -Ron
> >
> > >
> > > Any info would be helpful.
> > >
> > > -Jon
> > >
> > >
> > > --
> > > Jon  Kleinsmith <jon.kleinsmith@conexant.com>
> > >
> > >

   ------- Additional comments from sgrell Mon Dec 12 02:45:29 -0700 2005 -------
Changed subcomponent.

Stephan

   ------- Additional comments from sgrell Mon Dec 12 02:48:52 -0700 2005 -------
Changed subcomponent.

Stephan

Change History (1)

comment:1 Changed 9 years ago by dlove

  • Resolution set to fixed
  • Severity set to minor
  • Status changed from new to closed

Fixed by [3752]

Note: See TracTickets for help on using tickets.